Object detection from a target image has attracted increasing research attention because of a wide range of emerging new applications, such as those on smart mobile phones. Traditionally, pattern recognition methods are used to train a classifier with a large number, possibly thousands, of image samples.220.127.116.11.18.104.22.168.–10 An object in a sample image is a composite of many visual patches or parts recognized by some sparse feature analysis methods. In the detection process, sparse features are to be extracted in a testing image and a trained classifier is used to locate objects in a testing image. Unfortunately, in most real applications there are always insufficient training samples for robust object detection. Most likely, we may just have a few samples about the object we are interested in, such as the situations in passport control at airports, image retrieval from the Web, and object detection from video or images without preprocessed indexes. In these cases, the template matching approach based on a small number of samples often has been used.
Designing a robust template matching method remains a significant effort.11 Most template matching approaches use query image to locate instances of the object by densely sampling local features. Shechtman and Irani provided a single-sample method12 that uses a template image to find instances of the template in a target image (or a video). The similarity between the template and a target patch is computed by a local self-similarity descriptor. In Refs. 13 and 14, one sample, representing human behavior or action, is used to query videos. Based on this training-free idea, Seo and Milanfar proposed the locally adaptive regression kernels (LARK) feature as the descriptor to match with the object in a target image using only one template.15 This LARK feature, which is constructed by local kernels, is robust and stable, but this LARK feature brings overfitting problem and results in low computational efficiency.In Ref. 16, the authors constructed an inverted location index (ILI) strategy to detect the instance of an object class in a target image or video. This ILI structure saves the feature locations of one sample and indexes feature values according to the locations to locate the object in target image. But this ILI structure just processes one sample. In order to improve the efficiency and accuracy based on a small number of training samples, these methods have to run a few times on each of those samples.
Different from the dense feature like LARK, key-point sampled local features, such as scale invariant feature transform (SIFT)17 and speeded up robust features (SURF),18 always obtain a good performance in the case of using thousands of samples to train classifiers. And these key-point features are always in a high-dimensional feature space. If one has thousands of samples and needs to learn classifiers such as support vector machine (SVM),3,8 key-point features have obtained good performance. Previous works15,1920.–21 pointed out that the densely sampled local features always give better results in classification tasks than that of key-point sampled local features like SIFT17 and SURF.18
Recently, some interesting researches based on few samples have emerged. Pishchulin et al. proposed a person detection model from a few training samples.22 Their work employs a rendering-based reshaping method in order to generate thousands of synthetic training samples from only a few persons and views. However, the samples are not well organized and their method is not applicable on generic object detection. In Ref. 23, a new object detection model is proposed named the fan shape model (FSM). FSM uses a few samples very efficiently, which handles some of the samples to train out the tolerance of object shape and makes one sample the template. However, FSM method is not scalable in terms of samples and is only for contour matching.
In this paper, we propose a novel approach for generic object detection based on few samples. First, a new type of feature at each pixel is considered, called locally adaptive steering (LAS) feature, which is designed for a majority voting strategy. The LAS feature at 1 pixel can describe the local gradient information in the neighborhood, which consists of the dominant orientation energy, the orthogonal dominant orientation energy, and the dominant orientation angle. Then, for each member of this feature, a cell is constructed at each pixel of the template image, whose length is the range of the feature member value. The cells for all pixels construct a voting space. Since this feature is in three dimensions, three voting spaces are to be constructed. We use a few samples to train these voting spaces, which represent the tolerance of appearance of an object class at each pixel location.
Our idea of using a LAS feature is motivated by earlier work on adaptive kernel regression24 and the work of LARK feature.15 In Ref. 24, to perform denoising, interpolating, and deblurring efficiently, localized nonlinear filters are derived that adapt themselves to the underlying local structure of the image. LARK feature can describe the local structure very well. After densely extracting LARK features from template and target images, matrix cosine similarity is used to measure the similarity between query image and a patch from target image. This method is resilient to noises and distortions, but the computation of LARK features is time-consuming with heavy memory usage. Our LAS feature simplifies the computation of LARK and saves the memory. Moreover, LAS feature also exactly captures the local structure of a specific object class.
Our voting strategy is inspired by the technology of Hough transformation. Many works2522.214.171.124.–30 have contributed to model spatial information at locations of local features or parts as opposite to the object center by Hough voting. The Hough paradigm starts with feature extraction and each feature casts votes for possible object positions.4 There are two differences from Hough voting. The first one is each cell size of the voting spaces is trained out by samples. But each cell size of the Hough voting space is all fixed. Our trained space can tolerate more deformation. The second one is our voting strategy is based on template matching. Previous Hough voting is based on a trained code-book with thousands of samples.
This paper is structured as follows. Section 2 gives an overview of our method. In Sec. 3, we propose LAS feature. Section 4 describes the concept of voting space and the training processing. Section 5 introduces the procedure of object matching in voting spaces. Section 6 gives the experimental results and compares our method with the state-of-the-art methods. This paper is extended from an early conference paper31 with improved algorithms and comprehensive experimental results.
Overview of Our Approach
An overview of our method is shown in Fig. 1. In the training processing, for each dimension of LAS feature, a voting space is constructed by image coordinates and corresponding value ranges of the feature member. Each cell in the voting space is formed by the corresponding pixel position and its value range of this feature member, which is trained by several samples. Since the voting space is three dimensional (3-D), for simplicity we use a 3-D box to represent a cell. The longer the box is, the higher the cell length. The different sizes of boxes mean the different tolerances of object appearance changes at corresponding pixels. In Fig. 1, LAS feature is specially designed for object matching in voting spaces, where , , and represent the dominant orientation energy, the orthogonal dominant orientation energy, and the dominant orientation angle, respectively, in the neighborhood at each location. Thanks to the merit of voting, only a few samples (2 to 10 samples in this paper) are enough to train cells.
In the detection process, by randomly choosing one sample image as the query , the instances of an object class are located in target image , which is always larger than . First, the patches extracted from by a sliding window are detected by densely voting in the trained voting spaces. Each component of LAS feature of and is voted in each voting space to obtain a value of similarity. If the values of similarity of all LAS features are larger than the corresponding thresholds, then is a similar patch of . Then a refined hypotheses step is used to accurately locate the multiple instances by computing the histogram distance corresponding to the feature . The refined step is just processing the similar patches that are obtained in the voting step. If the histogram distance of between a similar patch and is small enough, then the similar patch is the object instance.
Locally Adaptive Steering Feature
The basic idea of our LAS is to obtain the locally dominant gradient orientation of image. For gray image, we compute the gradient directly. If the image is RGB, the locally dominant gradient orientation is almost the same on each channel. To reduce the computation cost of transforming RGB to gray image, we just use the first channel of RGB. The dominant orientation of the local gradient field is the singular vector corresponding to the smallest singular value of the local gradient matrix.24,32 (The proof of transformation invariance of singular value decomposition (SVD) can be reviewed by interested readers from Ref. 24.) For each pixel , one can get the local gradient field shown in Fig. 2(a). The local gradient field is a patch in the gradient map around the pixel . Here, the size is set as . The dominant orientation means the principal gradient orientation in this gradient field. To estimate the dominant orientation, we compute the horizontal and orthogonal gradients of the image. Then, the matrix is concatenated column-like as follows:
The eigenvalues and represent the gradient energies on the principal and minor directions, respectively. Our LAS feature at each pixel is denoted as . We define a measure to describe the dominant orientation energy as follows:
The measure is the rotation angle of , which represents the dominant orientation angle.
The LAS feature can describe the local gradient distribution information (see Fig. 2). , , and are from the computation of the local gradient field, which can yield invariance to brightness change, contrast change, and white noise as shown in Fig. 3. The results of Fig. 3 are from the computation of LAS feature under different corresponding conditions. Due to the SVD decomposition of local gradients, the conditions of Fig. 3 on each pixel do not change the dominant orientation energy enormously. One can find the proof details of the tolerance of white noises, brightness change, and contrast change from Ref. 24.
Some studies15,1920.–21 have already pointed out that the densely sampled local features always give better results in classification tasks than that of key-point sampled local features, such as SIFT17 and SURF.18 These key-point sampled features are always in a high-dimensional feature space in which no dense clusters exist.15 Comparing to the histogram of gradient (HOG) feature,27 our LAS feature has smaller memory usage. Each location of the HOG feature is 32 dimensions histogram, while our LAS feature is just three dimensions. In Ref. 33, the authors also proposed dominant orientation feature. But this dominant orientation is a set of representative bins of the HOG.33 Our dominant gradient orientation is computed by the SVD decomposition of the local gradient values, which have more local shape information. Comparing to the LARK feature,15 our LAS feature has 27 times smaller memory usage, for the LARK feature is 81 dimensions at each pixel location. In Ref. 24, the authors mentioned these three parameters, but no one has used them as features. Next, we train the voting spaces based on this LAS feature to obtain three voting spaces.
So why can the LAS deal with only a few image samples well? That is because our LAS feature contains more local gradient information than other dense features like LARK15 and HOG.27 There are three components of our LAS feature , , and , which represent the dominant orientation energy, the orthogonal direction energy of dominant orientation, and the dominant orientation angle, respectively. For LARK,15 there is only gradient energy information, which cannot reflect the energy variations. For HOG,27 there are just values of region gradient intensity in different gradient orientations, which cannot reflect dominant orientation energy and angle.
Training Voting Spaces
The template image coordinates and the value ranges of the LAS feature component at the corresponding locations form the voting spaces (denoted as , , and , respectively, for three components of LAS feature). To match the template image and the patch from the target image (testing image) accurately, the cell length should be trained to reflect the tolerance of appearances at each pixel location. Several samples (2 to 10 samples in this paper) are enough to train the cells in each voting space.
Assume the query samples as and is the cardinality of . We use Eqs. (3) to (5) to compute the LAS feature matrices of samples and obtain , , and . We want to get the tolerance at each location from the matrices , , and for (). Because our LAS feature is from local gradients at each location, each value in matrices , , and reflects the local edge orientation. To reflect the variation range of samples at each location, we define the cell sizes , , and as follows:
Our definition of cell size is not the only choice. However, this definition is very simple and effective. Different from traditional training scheme, our training method, based on LAS feature, is not computationally expensive.
A query image is randomly selected from the sample images. And the target image is divided into a set of overlapping patches by a sliding window with the same size as the query image , where is the number of patches in .
In our object matching scheme, there are two steps to search similar patches in . First step is voting in the trained voting spaces (see Fig. 5). To combine the votes from three voting spaces, we use a joint voting strategy to detect similar patches from . After the first step, one can get some similar patches ( and is the cardinality of ). Then a refined hypotheses step follows. In this step, the histogram distance of the LAS feature between and is used to measure integral similarity, which can precisely locate the instances of the query .
Voting in Trained Spaces
We associate each patch in with a hypothesis as follows:
Because there are three components with respect to the LAS feature, we have three estimated conditional densities , , and . These conditional densities are defined as the results of voting. Specifically,
To compute the function , we define three variables , , and as , , and , where means to take absolute value of the elements in the matrix. In our framework, the functions , , and are defined as12) to (14) of function , we can find that the estimated conditional densities in Eqs. (9) to (11) represent, for each LAS component, the ratio of votes at the size of the query image .
The estimated conditional densities , , and between and each element of are computed after voting. So how can we discriminate the similar patches from these densities?
Our answer is organizing the samples to train the density thresholds between and the set . In Ref. 15, the authors use a tunable threshold to detect possible objects presented in the target image and nonmaxima suppression strategy to locate the objects in a similarity potential map. But in our scenario, we make use of several samples sufficient to obtain the thresholds, written as , , and , off-line. These three thresholds must contain two properties. The first one is that these thresholds reflect the tolerance of the cells. The second one is that the thresholds must be different when the query image changes. Here, the computation formulas of , , and are the following:15) to (17) that , , and are also changed. It is worth noting that the min function is just one of the alternative functions in Eqs. (15) to (17). One can choose mean function, median function, even max function, or so on. The reason that we choose min function is that our samples in the experiment are without rotation, strong noises, and brightness change. Other functions to handle more complex cases need further research. In our experiments, we just use Eqs. (15) to (17) to compute the thresholds.
Next, we use the estimated conditional densities and trained thresholds to obtain the similar patches. For the LAS feature containing three components, our work is to combine these three components to detect the similar patches . So we define a map (, 2, 3) and the combination , where
For each , if , , 2, 3, then . In Fig. 4, we show the voting results between and the elements in . The densities in the red bounding box are all larger than the thresholds. So is the patch similar to . We compute the combination function for all and put patches whose function values equal to 1 into the set . In Fig. 5, we draw the graphical illustration of the detection process.
Refined Hypotheses Step
After the density voting step, we obtain the similar patch set . The refined step just a process of this set , which is obtained from the voting. However, the first step is just a local voting method at each pixel location. It is not enough to describe the integral information of the object. The construction of LAS feature shows that is related to the orientation of the local edge, which is mentioned in Ref. 24. To use the contour information sufficiently, we compute the histogram distance between and . For the features and , after being quantized here in the bin of 10 deg, one can calculate the histograms denoted as and , respectively. The distance between and is defined as
The more similar two histograms are, the smaller is. So we use the max function to compute the . If is satisfied, will be the instance of the query . It is efficient to use the distance [see Fig. 5(c)]. The reason is that the histogram distance between and reflects the integral difference.
In fact, besides using the histogram distance of , we can also use the histogram distance of and . But in experiments, we find that using the histogram distance of or cannot enhance the precision of detection result, and is better than and . The reason is that the feature more precisely describes the contour information of an object.
Previous works3,3435.36.37.–38 have already shown that the histogram is a popular representation for feature description. That is because the histogram encodes the distribution of spatially unordered image measurements in a region.36 The distance is used to compare the distance between two histograms in Ref. 3. So, we use this quadratic- measurement to discriminant histogram distance.
The experiments consist of three parts using car detection, face detection, and generic object detection, respectively. To handle object variations on scale and rotation in the target image, we use the strategies provided in Ref. 15, which construct a multiscale pyramid of the target image and generate rotated templates (from ) in 30-deg steps. The receiver operating characteristic (ROC) curves are drawn to describe the performance of object detection methods. We use the definition in Ref. 15 that Recall and Precision are computed as
Now, we show the performance of our method on the University of Illinois at Urbana-Champaign (UIUC) car data set.39 The UIUC car data set contains the learning and test sets. The learning set consists of 550 positive car images and 500 noncar images. The test set consists of two parts: 170 gray-scale images containing 200 side views of cars of size and 108 gray-scale images containing 139 cars.
In Fig. 6, we show some detected examples of UIUC car on the single-scale test set with the trained parameters , , , and . The query image and training samples are of size .
To demonstrate the performance improvement of our method, we compare our method to some state-of-the-art works15,3940.–41 (see Fig. 7). Seo et al.15 proposed the LARK features that detect instances of the same object class and get the best accuracy detection, resulting in template matching methods. This method is referred to as LARK. In Ref. 39, the authors used a sparse, part-based representation and gave an automatically learning method to detect instances of the object class. Wu et al.41 showed a method based on the per-pixel figure-ground assignment around a neighborhood of the edgelet on the feature response. Their method needs to learn the ensemble classifier with a cascade decision strategy from the base classifier pool.41 In Ref. 40, the authors introduced a conditional model for simultaneous part-based detection and segmentation of objects of a given class, which needs a training set of images with segmentation masks for the object of interest. However, these works39,40,41 are all based on the training methods, which need hundreds or thousands of samples.
From Fig. 7, it can be observed that our method is better than the methods in Refs. 15 and 39 and the recall is lower than that in Refs. 40 and 41, which need hundreds or thousands of samples. The precision of our method can be improved more if the detected results are combined by querying the object using the training samples one by one. But this is not our main point. Our focus is that detecting the instances of an object by one query using our method is competitive to or better than that of the one-query method executing several times. Compared with the LARK method, because we organize the training samples reasonably, our detection results have more appearance tolerance of the object. Although we just have few samples in hand, the detection result of our method is better than that of the previous works,39 which need hundreds or thousands of samples.
The comparisons of detected equal-error rates (EER)15 are shown in Tables 1 and 2. One can also find that our proposed method is competitive to or better than those state-of-the-art methods. Here, we compare our method to the state-of-the-art training-based methods3,3940.–41 and the one-query method (LARK). The EER on the single- and multiscale test sets are shown in Tables 1 and 2, respectively. From Table 1, it can be found that the EER of our method is higher than that of methods in Refs. 15 and 39, and lower than that of the methods in Refs. 3, 40, and 41. In Table 2, the EER of our method is higher than that of the methods in Refs. 15 and 39 and lower than that of the methods in Refs. 3 and 40. As our strategy is based on few samples, the prior knowledge of the object class is limited. However, the EER of our method also reaches 92.15 and 91.34%, respectively, which are competitive to the methods in Refs. 3, 40, and 41. These methods always need thousands of samples to train classifiers. But our method, whose training processing is much simpler than that of these methods, just needs several samples. Compared to these three methods, our method is also competitive.
Detection equal-error rates on the single-scale UIUC car test set.
Detection equal-error rates on the multiscale UIUC car test set.
In this section, we demonstrate the performance of our method to face detection on Massachusetts Institute of Technology—Carnegie Mellon University (MIT-CMU) face data set42 and Caltech face data set.11,16 The several training samples in our face detection experiments are all chosen from Fundação Educacional Inaciana (FEI) face data set.43 Since we just have few samples in hand, in this section the comparison is only made to the template matching method. As mentioned before, in the template matching methods, the LARK method15 shows good performance. So we take it as our baseline object detector.
First, we show the detection results of our strategy on MIT-CMU face data set. There are 45 images with 157 frontal faces of various sizes in our test set. The query image and training samples are all adjusted to the size . The scale of faces in the data set between the largest and smallest is from 0.4 to 2.0. One can see some of the results in Fig. 8. Although the target image is blurry or contains a cartoon human face, our detection method can localize the faces. Especially in Fig. 9, we detect 56 faces correctly among 57 faces and the precision rate is higher than the results in Refs. 15 and 44.
To make a fair comparison, we use the union LARK detection results from several images. For example, if there are six training samples, LARK processes them one by one as the query image. For each target image, we record the true positives of six queries and get the total number of true positives without repeat. In this way, this union multisamples detection result of LARK can be compared with our method fairly. In Fig. 10, we show the comparison between our method and LARK.15 The curve of our method is the average of five query images with the same training samples. One can find that our method is superior to the LARK.
The proposed method on three data sets achieves higher accuracy and a lower false alarm rate than that of the union LARK. The organization of the several training samples is more efficient than the one-by-one detection strategy. We draw ROC curves with respect to different query images from the same training samples and to the same query image on different training samples [see Figs. 11(a) and 11(b)] on MIT-CMU face data set. Figure 11 demonstrates that our detection strategy can achieve consistent precisions for both different training samples and different query images. This means our method is robust on different training samples and query images. In fact, we also obtained the same result on other data sets used in this paper. To describe the result clearly, we give the ROC curve on the MIT-CMU face data set.
From above, it can be seen that our detection strategy is consistent and robust on different query images and training samples. This is because our detection method has two steps, voting step (VS) and refined hypotheses step (RHS), which measure the object locally and integrally, respectively. Here, we show how these two steps affect our detection results. We compare the detection results of VS + RHS, VS and RHS on the Caltech data set (see Fig. 12). Each curve is drawn and averaged with the same seven query images and three training samples. One can see that the combination of both steps can get a higher precision rate than that of using each step alone, and that the voting strategy along has a higher accuracy than RHS. A similar conclusion can be drawn with other data sets, which are not shown here.
Generic Object Detection
We have already shown the detection results of our proposed method on the car and face data set. In this section, we use our strategy to some general real-world images containing hearts, flowers, and footballs. To the best of our knowledge, there does not exist a data set for object detection based on a few samples. So we download some real-world images from Google as our data set. One can find these images from our website http://cvlab.uestc.edu.cn/xupei. There are 34 images of red hearts with 49 positive samples and 40 images of sunflowers with 101 positive samples. In all of these data sets, the scale is from 0.2 to 1.8.
The detection examples can be found in Fig. 13. In the real world, the red-heart shape can be found with complex display. So, our detection results contain some false alarms.
Time Efficiency Comparison and Feature Comparison
Now we compare the efficiency between our proposed scheme and the detection scheme.15 For these two methods, there are the same two steps: feature construction and object detection. We compare the time efficiency of these two steps between our strategy and LARK. To formalize the efficiency, and are, respectively, defined as the evaluation time of the feature construction. and are the evaluation times of the detection step. Here, we define and to describe the time efficiency of LAS and LARK features, respectively, where
One can find the comparison results of LAS and LARK features in Table 3. In the experiment, we evaluate 10 testing times (each testing contains 30 images) to record the evaluation time of the two steps in both our method and the LARK, respectively. In Table 3, we can see that the construction time for LARK feature is more than that of our LAS feature. This is because the LAS feature just needs to compute the gradients and SVD decomposition to get three parameters. But for LARK feature, after SVD decomposition, the local kernel must be computed and then the principal component analysis (PCA) method is used to reduce the dimension. The superiority of our LAS feature is not just saving memory, but also cutting down the computing steps for each pixel. In Table 3, one can find that is also lower % than (blue). Our detection step is based on the idea of voting. In Ref. 15, the matrix cosine similarity for each and is computed. Then, the salience map is constructed which is very time-consuming. In our method, the time of the training step can be ignored. In the experiments, we find that the consuming time of the training step is of the whole running time for detecting objects in a target image.
Comparison of efficiency between LAS and LARK.
We further compare the performance with some art local features on Shechtman and Irani’s test set.12 We use our LAS feature to compare with gradient location-orientation-histogram (GLOH),45 LARK,15 Shape Context,46 SIFT,17 and Commission Internationale de L’Eclairage (CIE).12 The ROC curves can be seen in Fig. 14. Compared to previous works, our LAS feature is much better than SIFT, Shape Context, and GLOH in the case of a few samples. Comparing to CIE and LARK, our LAS feature is comparable or even better.
Conclusion and Future Work
In this paper, we proposed a generic objects detection method based on few samples. We used the local principal gradient orientation variation information, namely LAS, as our feature. The voting spaces are trained based on a few samples. Our detection method contains two steps. The first step is adopting a combination densely voting method in trained voting spaces to detect similar patches in target image. Through the construction of a voting space, the advantage of our approach is resilient to local deformation of appearance. Then, a refined hypotheses step is used to locate object accurately.
Compared with the state-of-the-art methods, our experiments confirm the effectiveness and efficiency of our method. Our LAS feature has more efficiency and memory saving than that of LARK. Besides, the strategy we proposed in this paper gives a method of object detection when the samples are limited. Previous template matching method is to detect objects using samples one by one, while our method is to organize several samples to detect objects once. In the future, we will extend our work to the problem of multiple-object detection and improve the efficiency further.
This work was supported in part by the National Natural Science Foundation of China (61375038) 973 National Basic Research Program of China (2010CB732501) and Fundamental Research Funds for the Central University (ZYGX2012YB028, ZYGX2011X015).
Pei Xu received his BS degree in computer science and technology from SiChuan University of Science and Engineering, ZiGong, China, in 2008 and his MS degree in condensed matter physics from University of Electronic Science and Technology of China, Chengdu, China, in 2011. He is currently a PhD student in University of Electronic Science and Technology of China, Chengdu, China. His current research interests include machine learning and computer vision.
Mao Ye received his PhD degree in mathematics from Chinese University of Hong Kong in 2002. He is currently a professor and director of CVLab at University of Electronic Science and Technology of China. His current research interests include machine learning and computer vision. In these areas, he has published over 70 papers in leading international journals or conference proceedings.
Xue Li is an Associate Professor in the School of Information Technology and Electrical Engineering at University of Queensland in Brisbane, Queensland, Australia. He obtained the Ph.D degree in Information Systems from the Queensland University of Technology 1997. His current research interests include Data Mining, Multimedia Data Security, Database Systems, and Intelligent Web Information Systems.
Lishen Pei received her BS degree in computer science and technology from Anyang Teachers College, Anyang, China, in 2010. She is currently an MS student in the University of Electronic Science and Technology of China, Chengdu, China. Her current research interests include action detection and action recognition in computer vision.
Pengwei Jiao received his BS degree in mathematics from Southwest Jiaotong University, Chengdu, China, in 2011. He is currently a postgraduate student in the University of Electronic Science and Technology of China, Chengdu, China. His current research interests are machine vision, visual surveillance, and object detection.