Clustering-weighted SIFT-based classification method via sparse representation

Abstract. In recent years, sparse representation-based classification (SRC) has received significant attention due to its high recognition rate. However, the original SRC method requires a rigid alignment, which is crucial for its application. Therefore, features such as SIFT descriptors are introduced into the SRC method, resulting in an alignment-free method. However, a feature-based dictionary always contains considerable useful information for recognition. We explore the relationship of the similarity of the SIFT descriptors to multitask recognition and propose a clustering-weighted SIFT-based SRC method (CWS-SRC). The proposed approach is considerably more suitable for multitask recognition with sufficient samples. Using two public face databases (AR and Yale face) and a self-built car-model database, the performance of the proposed method is evaluated and compared to that of the SRC, SIFT matching, and MKD-SRC methods. Experimental results indicate that the proposed method exhibits better performance in the alignment-free scenario with sufficient samples.


Introduction
Sparse representation (SR) 1,2 has become a hot topic in recent years. SR considers a query signal y as a linear representation of the columns in A, i.e., y ¼ Ax þ e, where A is the dictionary (each column in A is typically referred to as an atom), x is a sparse representation coefficient vector over the dictionary A, and e denotes the noise. In Ref. 3, Wright et al. presented a new method sparse representation-based classification (SRC), which achieved high recognition accuracy on face recognition. Due to this approach's promising performance in image classification, SRC has been widely used in many pattern recognition applications, such as face recognition, 4,5 gender, 6 digit, 7,8 biology data, 9,10 and medical image 11,12 classification.
For robustness, many methods have been improved and presented. For handling contiguously occluded face recognition, such as disguise or expression variation, a modular weighted global sparse representation method was proposed in Ref. 13, which divided the image into modules and determined the reliability of each module based on its sparsity and residual. Next, a reconstructed image from the modules weighted by their reliability is formed for robust recognition. To obtain rotation and scale invariance, in Ref. 14, the authors constructed a dictionary based on a large number of vehicle images captured at different angles and distances, which made the dictionary large scale and the method time consuming. In Ref. 15, a practical face recognition system was presented, which gained robustness for registration and illumination by minimizing the sparsity of the registration error and capturing a sufficient set of training illuminations for linearly interpolating practical lighting conditions, respectively. In Ref. 16, the authors presented a block-based face-recognition algorithm, which is based on a sparse linear-regression subspace model via a locally adaptive dictionary constructed from the past observable data (i.e., training samples). Though it obtained a high recognition rate, prealignment and a certain scale were always required, i.e., those methods are more suitable for applications in constrained environments. To handle the problem of alignment, in Ref. 17 the authors introduced SIFT descriptors 18 to the SRC framework, and proposed multikeypoint descriptors SRC (MKD-SRC) method, which has achieved preliminary success on both holistic and partial face recognition. Additionally, modified MKD-SRC has been proposed based on the Gabor Ternary pattern (GTP) descriptors in Ref. 19. Those two methods may be affiliated to a feature-based SRC method, which has shown good robustness for alignment and affine transform and thus may extend the application of SRC. Obviously, a feature-based dictionary is the core, and it may contain considerable useful information for recognition, which may be omitted with present methods.
Although several researchers who focus on SRC have paid attention to the similarity of atoms, 20-22 they only use it to optimize the dictionary rather than to improve the recognition rate. For example, in Ref. 22, the authors presented an efficient face recognition algorithm based on the SRC using an adaptive K-means method, which clustered similar atoms of the same class and merged them into one atom while preserving the accuracy. Obviously, the method has not considered the similarity of the atoms belonging to different classes, which will affect the recognition performance.
In this paper, focusing on the scenario of disguises or partial targets and scale and illumination or expression variation without alignment, we propose a clustering-weighted SIFT descriptor-based SRC (CWS-SRC) method.
The remainder of this paper is organized as follows. Motivation for the proposed method is given in Sec. 2. Section 3 proposes the CWS-SRC method. The experimental results of the AR database, 23 the Yale face database 24 and a self-built car-models database are shown in Sec. 4. The conclusions and future research areas are presented in Sec. 5.

Motivation
In this section, we first describe the principle of the MKD-SRC method. 17 Given a set of sample images collected from c different subjects, c subdictionary A k ðk ¼ 1; : : : ; cÞ can be constructed by pooling all of the descriptors extracted from the samples of each subject, and a gallery dictionary can be obtained A ¼ ½A 1 ; : : : ; A c . A probe image Y can be denoted with a set of SIFT descriptors, i.e., Y ¼ ½y 1 ; y 2 ; : : : ; y m , where y i (i ¼ 1; : : : m) is the i'th probe descriptor. Thus, the problem of recognition of Y is converted to the problem of solving a multitask l 1 -minimization problem: (1) where each column in A is a descriptor extracted from the sample images, X ¼ ½x 1 ; x 2 ; : : : ; x m is the sparse coefficient matrix, and k · k 1 denotes the l 1 norm of a vector. Finally, the following multitask SRC is adopted to determine the identity of the probe image.
identityðYÞ ¼ argmin where δ k ð:Þ is a function that selects only the coefficients corresponding to the k'th class, and k · k 2 denotes the l 2 norm of a vector. With a SIFT descriptor-based dictionary, MKD-SRC 17 has not only successfully resolved the problem of alignment, but also handled the affine transformation to some extent. Although several images or even one as samples per subject are sufficient for face recognition with the MKD-SRC method, 17 this approach may not always work well for a general three-dimensional (3-D) target, which may be due to different application requirements. For frontal face recognition, a few (even one) samples are sufficient. For a general 3-D target, more sample images are necessary for recognizing an image in an arbitrary view. For example, for vehicle recognition, rotation invariance is important and many more vehicle images taken from different angles are crucial. 14 Those are often similar. In such scenarios, there will be more similar SIFT descriptors. For convenience, similar descriptors in the dictionary are called similar subsets. They will influence the sparse representation result of the orthogonal matching pursuit (OMP) algorithm. 25 The reason for that will be deduced next.
It is known that with OMP, the sparsest linear combination of y is obtained by calculating the correlation and projecting orthogonally, alternately, and iteratively. OMP selects the atom with the highest correlation to the current residual at each step. Once the atom is selected, the signal y is orthogonally projected to the space spanned by the selected atoms. The residual is subsequently recomputed, and the process is repeated. Though the most correlated atom is selected in each iteration, the final linear combination of the atoms may not be the best representation for y. It seems that such a SIFT descriptor-based dictionary is far from the requirement of the restricted isometry property (RIP), 26,27 which is discussed in Ref. 28. However, the distribution of similar descriptors in classes can characterize their discrimination. 29 Therefore, studying and utilizing the distribution of similar descriptors to improve recognition performance are beneficial.

Proposed Approach
As mentioned in Sec. 2, considerable discriminative information may be included in similar SIFT descriptors, which will affect the recognition rate. To tackle this problem, we propose a clustering-weighted SIFT descriptor-based SRC method in this paper.

Extracting the SIFT descriptors
Given a set of sample images of c different subjects, we extract the SIFT descriptors a ∈ R 128×118 from them and subsequently construct the following dictionary: where the vector a ki denotes the i'th descriptor extracted from images of the k'th subject, whose total number is denoted as M k . Then, T ¼ P c k¼1 M k is the total number of the atoms in A.

Clustering for each atom in A according to similarity
In this paper, the similarity is measured by the inner product s ¼ a i · a j ∕ja i jja j j. If it is greater than a threshold t s , atoms a i and a j are treated as similar. For each atom a j in the dictionary, we clustered atoms similar to it and pooled them together as a subset C j . Then, T clustering subsets denoted as C ¼ fC j ¼ ½a 1 ; : : : ; a G j ; j ¼ 1; : : : ; Tg are obtained, where G j is the number of descriptors in the j'th subset.

Determining the Weight of the Atoms in
Dictionary A To resolve the multitask problem, we introduce a weightedvoting classifier in this paper. The primary challenge is how to assign the appropriate weight to each atom in the dictionary.

Relationship between the distribution of the similar atoms and their weight
After clustering, we obtain T clustering subsets. Similar atoms in each subset C j may belong to either the same or different classes. The distribution of atoms will determine how discriminative the corresponding atom is in dictionary A. Consider the extreme case. If the atoms in subset C j all belong to the i'th class, atom a j is the most representative and discriminative for the i'th class. In this instance, if a probe descriptor only matches this atom via the sparse representation, we can deduce that reliably it belongs to the i'th class. Otherwise, if similar atoms of a subset are distributed in many classes, a misjudgment is likely to occur. Therefore, considering the distribution of similar atoms in a subset, we can infer (1) for sufficient samples, if the atoms of subset C j concentrate on the same class as a j , a j can be observed as common and representative for that class. The larger the quantity of the similar atoms in C j that belong to the same class as a j , the more important a j is. We call it intraclass similarity; (2) if a large percentage of similar atoms belong to a certain class, i.e., the distribution is more intensive, the corresponding atom can characterize the class more effectively, and the atom will have greater discrimination ability. On the contrary, if the distribution is dispersed, the discrimination ability of the corresponding atom is smaller. We refer to it as interclass discrimination.
The purpose of the weighted method is to find the common and representative atoms for each subject and attach a weight to them. The weight of one atom is determined by both its intraclass similarity and interclass discrimination, which will be presented next.
Given a clustering subset C j ðj ¼ 1; : : : ; TÞ and the corresponding atom a j , according to C j , we will determine a quantity vector: N j ¼ ½n j 1 : : : n j k : : : n j c T , k ∈ f1; : : : ; cg, where n j k denotes the quantity of the atoms of the k'th class in the j'th subset C j . If there is no descriptor of the k'th class, N j does not include n j k . We determine the weight of the atom a j by two factors: the intraclass similarity and the interclass discrimination.

Calculating the intraclass similarity
For the atom a j in A, suppose it belongs to the k'th class, then its intraclass similarity is proportional to the quantity of the similar atoms belonging to the k'th class in C j , which is denoted as where P k ¼ maxfn j k g, j ¼ 1; : : : ; T, i.e., P k is the largest quantity of the similar atoms of the k'th class in T clustering subsets. Thus, w j 1 is between 0 and 1, and can measure the importance of the atom a j for the k'th class. The larger the quantity of similar atoms of one class, the more important the corresponding atom is. If the quantity of similar atoms of the k'th class is the largest among all classes, the intraclass similarity is 1; this similarity will be smaller if the quantity of similar atoms is reduced.

Calculating the interclass discrimination
The interclass discrimination of the atoms is determined by the distribution of all similar atoms in the corresponding clustering subset. We adopt the following method to measure the interclass discrimination of atoms.
How does it stand for discrimination? We will examine this question briefly. For simplicity, in the following equations, the superscript or subscript j for the j'th clustering subset is omitted; for example, n j k is replaced with n k , N replaces N j , etc. Thus, according to the definition of the norm, Eq. (5) can be written as The average and variance of the elements in N are defined as n ¼ P r∈f1;: : : ;cg n r kNk 0 ; σ ¼ P r∈f1;: : : ;cg ðn r −nÞ 2 kNk 0 .
Using Eq. (7), Eq. (6) becomes ¼ "n · ðkNk 0n Þ ðkNk 0n Þ 2 þ 1 kNk 0n 2 · P r∈f1;: : : ;cg ðn r −nÞ 2 kNk 0 where k:k 0 denotes the l 0 norm of a vector. Equation (8) shows that w 2 is positively correlated to the variance of N and negatively correlated to the average and the l 0 norm of N, and its meaning can be highlighted with two extreme cases: (1) if similar atoms in C j all belong to the k'th class, i.e., kNk 0 ¼ 1, σ ¼ 0, w 2 ¼ 1, the corresponding atom a k is the most discriminative for the class; (2) if the atoms in C j are equally distributed among all classes, i.e., kNk 0 ¼ c, , a k is the least discriminative, and the discriminative power decreases as the number of classes increases. Thus, in a clustering subset, Eq. (5) shows the relationship between the distribution of the atoms over all classes and the interclass discrimination.

Calculating the weight for each atom
Synthesizing Eqs. (4) and (5), we can measure the weight of a j : After computing the weights of all atoms in dictionary A, we can obtain the weight vector as follows: w ¼ ½w 1 ; w 2 ; : : : ; w T T :

Weighted-Voting Classifier
If there are m SIFT descriptors detected for a probe image, we have Y ¼ ½y 1 ; y 2 ; : : : ; y m : For y i ði ¼ 1;2; : : : ; mÞ, we have the following sparse representation by the gallery dictionary A.
x i ¼ argmin If y i belongs to some class, the nonzero coefficient in vectorx i will be concentrated on that class, i.e., the value of that class inx i is larger. 3 In Ref. 17, the authors demonstrated that the concentration of the sparse representation coefficient can determine the best matching class. Thus, we have the following weighted-voting function to determine the identity of the probe image max k w k ðYÞ ¼ X m i¼1 kδ k ðx i ∘ wÞk 1 ; k ¼ 1; : : : ; c; (13) wherex i ∘ w ¼ ½x ij · w j 1×T , j ¼ 1; : : : ; T, which is the Hadamard product of two vectors.

Summary
The proposed CWS-SRC method can be summarized as follows: 1. Extract the SIFT descriptors from the sample images and construct the dictionary A denoted as Eq. (3). 2. Cluster by similarity and obtain T clustering subsets. 3. Compute the weight of each atom in A using Eq. (9) and form the weight vector using Eq. (10). 4. Have the sparse representation of each SIFT descriptor detected in a probe image, and then obtain the identity of the probe image by taking the SRC result of each descriptor to the weighted-voting classifier using Eq. (13).

Experiments
In this paper, three databases, i.e., the AR database, 23 the Yale face database, 24 and a self-obtained car-model database, are used for evaluation. A performance comparison among the proposed methods, the SIFT matching approach, 18 the MKD-SRC method, 17

Holistic Face Recognition with Occlusion
This experiment was conducted on the AR database. The AR database contains 120 subjects, including 65 males and 55 females. The images were captured in two different sessions, with different expressions and occlusions, such as sunglasses, scarf, and so on. For each subject, 26 images were taken, of which 14 images are nonoccluded. We randomly selected three images from the nonoccluded ones as samples and all occluded ones as probes. Thus, there were 360 face images in the sample set and 1440 images in the probe set. All images were cropped to 128 × 170 pixels. No alignment has been performed between the probes and the samples. Some examples of the sample and the probe are shown in Fig. 1.
To ascertain the relationship between the recognition performance and the similarity threshold t s , we examined different values of t s and evaluated the resulting performance in terms of accuracy. The curve is shown as Fig. 2. Therefore, we set the value of t s as 0.97, which has been proven to also be suitable for other databases, and may be set as an empirical value. For recognition rate, we compared the proposed CWS-SRC method to the other three algorithms. Following the experimental settings, we use 10 random splits of the data for the experiment. The average and deviation results of the algorithms are listed in Table 1. It has been shown that the CWS-SRC achieves the highest recognition rate of up to 93.89% AE 0.84 (t s ¼ 0.97), which is slightly higher than that of MKD-SRC and much higher than those of the others. Because no alignment has been performed between the sample and the probe sets, the recognition rate of SRC is considerably lower. Therefore, for occluded holistic face recognition without the alignment process, the CWS-SRC method can achieve a better performance.

Partial Face Recognition with Arbitrary Patch
The cropped Yale database consists of 165 frontal face images of 15 subjects with an image size of 170 × 230. We randomly selected two images per subject as samples and the remaining as the probes. For each probe image, one patch of random size h × w at a random position was cropped as a partial face, where h and w were randomly selected from (120,180) and (90,130), respectively. Thus, there were 135 partial images (nine images per subject) in the probe set and 30 images in the sample set (two images per subject). Examples are shown in Fig. 3.
The threshold value of the similarity t s is still 0.97. Because the original SRC algorithm is unsuited to partial or scale variation scenarios, only three methods are compared in this part. Following the experiment settings, we use 10 random splits of the data for the experiment. The performance of the remaining three methods is shown in Table 2. The Fig. 2 The relationship between the recognition rate and the threshold value of similarity.

Car Model Image Recognition with Different Scales and Pitch Angles
The car-model database is self-built and is captured using the equipment shown in Fig. 4. By adjusting the photography parameters, e.g., distance, pitch angle, illumination, we can capture car images of different scales and postures. The database consists of 10 vehicles (e.g., Touran, Tiguan, Polo, Passat, etc.), which are shown in Fig. 5(a). Examples of the sample and probe set are shown in Figs. 5(b) and 5(c), whose photography parameters are listed in Table 3.
In this experiment, we took different quantities of the samples to evaluate the performance of the CWS-SRC method. The quantity of the sample set per subject was increased from 20 to 60 with a step of 10, and the newly added sample images were randomly selected. Simultaneously, the number of similar descriptors grew rapidly. The experimental results are shown in Fig. 6 (where t s ¼ 0.97). It is shown that the CWS-SRC and the MKD-SRC methods are superior to the SIFT matching. With the quantity of sample images increasing, the result shows that the CWS-SRC method is more suitable for a target recognition task when many more samples are available.
The results of the three experiments demonstrate that the weighted-voting classifier based on the similarity of features has contributed to improving the recognition rate, and the proposed CWS-SRC method can obtain a better performance in alignment-free scenarios and also exhibits good    robustness for scale variation and affine transformation.
Comparing the experimental results, we find that the result of the holistic face with an occlusion is the best, possibly due to its relatively simple experimental condition. The result shows that sufficient information is necessary to improve the performance of the SRC-based method; therefore, it makes sense to explore optimization based on the similarity of the features.

Conclusions and Future Work
In this work, a novel framework for robust target recognition with sufficient sample images is proposed, the CWS-SRC method. With this method, each image is represented by a set of SIFT descriptors. First, we obtain subsets by clustering based on the similarity. Next, based on the subsets, we calculate each atom's weight, and a weighted-voting classifier is created. Finally, each descriptor detected in a probe image can be sparsely represented by the dictionary, and the identity of the probe image can be inferred via the classifier. We evaluated the proposed approach on three conditions, i.e., the holistic face with occlusion (AR database), the partial face (Yale database), and the car-model with affine transformation and scale variation. Compared to the SIFT matching, the MKD-SRC and the original SRC methods, the experimental results clearly and consistently indicate that the proposed method is more robust with an increase in the number of sample images for alignment-free image recognition. Meanwhile, there are still methods that may improve the robustness, such as dictionary optimization, which will be studied in the future.
Bo Sun received his BSc degree in computer science from Beihang University, China, and his MSc and PhD degrees from Beijing Normal University, China. He is currently a professor in the Department of Computer Science and Technology at Beijing Normal University. His research interests include pattern recognition, natural language processing, and information systems. He is a member of ACM and a senior member of the China Society of Image and Graphics.