1 June 2009 Learning a subspace for face image clustering via trace ratio criterion
Author Affiliations +
Abstract
Face clustering is gaining ever-increasing attention due to its importance in optical image processing. Because traditional clustering methods do not specify the particular characters of the face image, they are not suitable for face image clustering. We propose a novel approach that employs the trace ratio criterion and specifies that the face images should be spatially smooth. The graph regularization technique is also applied to constrain that nearby images have similar cluster indicators. We alternately learn the optimal subspace and the clusters. Experimental results demonstrate that the proposed approach performs better than other learning methods for face image clustering.
Hou, Nie, Zhang, and Wu: Learning a subspace for face image clustering via trace ratio criterion

1.

Introduction

Human face clustering is an important research direction in the field of optical image processing. It has been successfully used in many fields. For example, in face image retrieval, if we can automatically cluster different kinds of face images, then it will beneficial to discover the latent similarities among various face images. The computational cost of searching for an interested face image will also reduce considerably.

Currently, a lot of research has been proposed for face image processing. However, most of it focuses on learning a subspace for face classification (i.e., classify the face images in the learned subspace with a certain number of labeled samples). The most prominent approach is called the Fisherface.1 It employs the supervised method, linear discriminant analysis (LDA),2 for face classification.

There is little work dedicated to learning a subspace for face image clustering. One direct way is to employ the traditional clustering techniques (e.g., K-means3) for clustering. Nevertheless, it is difficult for K-means to achieve a satisfied accuracy because the vector representations of face images have very high dimensionality (commonly, the number of pixels of a image). Then, one of the most famous works, which is named PCA+K -means, is proposed. It employs the unsupervised principle component analysis (PCA)2 technique to compute the famous Eigenface 1 and then applies the K-means approach for clustering. Although it has achieved prominent performances in many applications, PCA+K -means seems unsuitable for face image clustering. It takes into account few considerations about the special characters and manifold structures of face images.4 Recently, Ding extended the traditional supervised LDA method for clustering.5 The method, which is called LDA-Km, combines LDA with K-means and learns the subspace and clusters alternately. It has been reported that LDA-Km performs better than the K-means and the PCA+K -mean approaches.5 However, it also gives no considerations to the manifold structure and the spatial smooth character6 of face images. Additionally, LDA-Km applies the traditional ratio trace criterion, not the more prominent trace ratio criterion.7 The accuracy of LDA-Km is not as high as expected for face image clustering.

In this paper, we propose a new method, which is named as subspace clustering via trace ratio (SCTR) for face image clustering. First, it applies the trace ratio criterion, which has been shown more effective for discriminative subspace learning.7 Second, we employ the spatial smooth regularizer, which represents the particular character of images. Finally, we consider the manifold structure by intentionally adding the graph-smooth constraint. After computing the optimal clustering and learning the subspace alternately, we can finally derive the subspace and clustering results simultaneously. Experiments show that SCTR performs better than the state-of-the-art approaches.

2.

Leaning a Subspace for Face Image Clustering

2.1.

Problem Formulation

Assume X=[x1,x2,,xl] are vector representations of l face images, which are all of D1×D2 resolutions. Here xiRD for i=1,2,,N and D=D1×D2 . The number of clusters is predefined as C . The purpose of our method is to find the subspace in which these face images can be optimally clustered. More concretely, under a given criterion, we plan to find the transformation matrix WRD×d and simultaneously, the indicator matrix FRl×C . Here, d is the dimensionality of the subspace. Fij=1lj if xi belong to the jth cluster, and Fij=0 otherwise, where lj is the number of samples in the jth cluster.

It has been shown7 that the trace ratio criterion is much more effective than the ratio trace, which is widely used in the traditional LDA approach and its variants. We employ the trace ratio criterion in our approach. Mathematically, assume Y=[y1,y2,,yl] are the embddings of X (i.e., Y=WTX ). The within scatter matrix Sw and the total-scatter matrix St are defined as follows:

1.

Sw=j=1Ci=1lj(yimj)(yimj)T=YLwYT,
St=i=1l(yim)(yim)T=YLtYT,
where mj=(1lj)i=1ljyi (j=1,2,,C) is the mean of the samples in the j ’th cluster, m=(1l)i=1lyi is the mean of all samples. The corresponding Lw and Lt are

2

Lw=IFFT,Lt=I1leeT,
where e is an l -dimensional column vector with all ones. The trace ratio criterion proposed in Ref. 7 is

3

argminWTW=Itr(WTSwW)tr(WTStW).

We now consider the spatial character of images. Because an image represented in the plane is intrinsically a matrix. The pixels spatially close to each other may be correlated. Although we have D1×D2 pixels per image, this spatial correlation suggests the real number of freedom is far less. We employ a spatial regularizer (i.e., the Laplacian penalty) to constrain the coefficients to be spatially smooth.6 In brief, we define the Dj×Dj (j=1,2) matrices that yield the discrete approximation of the second-order derivation as follows:

4

Mj=1hj2(110121121011).
Here, hj=1Dj for j=1,2 . The discrete approximation for the two-dimensional Laplacian is a D×D matrix

5

Δ=M1I2+I1M2,
where Ij is the Dj×Dj identity matrix for j=1,2 . ⊗ represents the Kronecker product of two matrixes.

For a D1×D2 image vector x , Δx2 is proportional to the sum of squared differences between nearby grid points in that image. It can be used to measure the smoothness of an image on a D1×D2 lattice. Because each column vector of W can be regarded as a basis image, we add αΔTΔ to Sw where 0<α<1 is a balance parameter that controls the smoothness of the estimator.

Finally, as in most learning-based face image processing approaches1 we assume that the face images, which belong to the same cluster, are nearly lying on a low dimensional manifold. The graph Laplacian4 (i.e., L ) is employed to represent this character. Intuitively, the cluster indicators of face images belonging to the same cluster should be identical. In other words, if we regard that the cluster indicators are the output of a function defined on all face images, this intuition indicates that the predefined function should be smooth on these manifolds. Mathematically, it is equal to minimize tr(FTLF) .

In summary, we define SCTR as the solution to the following problem:

6

argminWTW=I,FTF=I(1β)tr[(1α)WTSwW+αWTΔTΔW]tr(WTStW)+βtr(FTLF)=argminWTW=I,FTF=I(1β)tr{(1α)WT[X(IFFT)XT+αΔTΔ]W}tr(WTXLtXTW)+βtr(FTLF),
where 0<β<1 is also a balance parameter.

2.2.

Solution

There are totally two different kinds of variables that should be optimized (i.e., W and F ). It is difficult to compute them simultaneously. We alternately optimize them.

2.2.1.

Stage one: fixing F and optimizing W

In this situation, the optimization problem in Eq. 6 becomes

7

argminWTW=Itr{WT[(1α)X(IFFT)XT+αΔTΔ]W}tr(WTXLtXTW).

It has the same form as the problem proposed in Ref. 7, except for a little differences in the formulation of the numerator. Thus, we can directly employ the same technique. It is an iterative procedure. For our problem, in the p ’th step, we solve a trace difference problem

argminWTW=Itr{WT[(1α)X(IFFT)XT+αΔTΔλpXLtXT]W},
where λp is the trace ratio value calculated from the previous projection matrix Wp1 in the previous step. Please see Ref. 7 for more details.

2.2.2.

Stage two: fixing W and optimizing F

In this case, because Lt and Δ have no relation with F , the optimization problem in Eq. 6 is equivalent to

8

argmaxFTF=Itr{FT[(1β)(1α)XTWWTXβtr(WTXLtXTW)L]F}.

This problem can be effectively solved by spectral decomposition technique. In fact, the optimal F to the problem in Eq. 8 is formed by the eigenvectors corresponding to the C largest eigenvalues of (1β)(1α)XTWWTXβtr(WTXLtXTW)L .

2.3.

Discussions

There are two essential parameters (i.e., 0<α<1 and 0<β<1 ) in our method. They control the smoothness of estimator and the balance of two objectives. When α=0 , SCTR totally ignores the spatial character of the face image. When α1 , SCTR tends to only choose a spatial smoothest basis and ignore the discriminative information. When β=0 , we ignore the smooth constraint on F . If β=1 , SCTR only concerns the smoothness of F .

Parameter determination is an essential task in most of the learning algorithms.6 Among various kinds of methods, grid search is probably the simplest and most widely used one for unsupervised learning. Because we have constrained 0<α<1 and 0<β<1 , we apply the grid search technique in this paper. More concretely, we set α and β by searching the grid {0.1,0.2,,0.9}×{0.1,0.2,,0.9} .

There are another three problems that should be indicated here. (i) Because SCTR is an iterative approach, a direct way to initialize the iteration is to first compute F by K -means and then optimize the problems in Eqs. 7, 8 alternately. (ii) The optimal F that maximizes Eq. 8 may not have the form of the indicator matrix that we specified in the previous part. We apply the discretization procedure in Ref. 8 to solve this problem. (iii) Because we have derived W and F , it is easy to compute the embedding and the cluster of a new image by using W and F directly.

3.

Experiments

We employ four different kinds of face image data sets for illustration. They are the Yale data, the ORL data, the Umist data, and the first 20 classes of Feret data. All these images are rescaled to 16×16 resolutions.

We compare SCTR with K-means, PCA+K -means, and LDA-Km. Two standard measurements—the accuracy (Acc) and normalized mutual information (NMI)—are used. The parameters α and β are determined by grid search and d=C1 . We randomly initialize K-means and repeat for 100 times. The average values of these Accs are listed in Table 1. Figure 1 shows the corresponding mean of these NMIs.

Fig. 1

Mean normalized mutual information (NMI) on the Yale, the ORL, the Umist and the Feret data sets.

060501_1_1.jpg

Table 1

Mean accuracy (Acc) results of different methods on four face image data sets.

K-means PCA+K -meansLDA-KmSCTR
Yale0.3790.3850.4590.491
ORL0.4970.5130.6070.674
Umist0.3810.3830.4680.501
Feret0.3740.3720.5260.551

As seen from Table 1 and Fig. 1, SCTR performs the best. It achieves both the highest mean Acc and the largest mean NMI in all cases. Because all the images are rescaled to be low resolution and the dimensionality is not so high, it seems that K-means and PCA+K -means have the similar performances on these data sets. LDA-Km performs better than K-means and PCA+K -means.

We have also done some experiments on the Yale data set with different parameters, which are set by the grid search. The Acc results versus different α and β are shown in Fig. 2. It seems that for Yale, smaller α and β are more suitable. The best parameter α=0.3 and β=0.1 .

Fig. 2

Acc versus α and β on Yale by grid search.

060501_1_2.jpg

4.

Conclusions

We propose a new subspace learning method for face image clustering. It integrates the spatial characters and the manifold structures of face images. By alternately computing the subspace and clusters, it performs best among all the involved approaches.

Acknowledgments

This work is partly supported by NSF China (Grants No. 60673090 and No. 60721003), the Hunan Provincial Innovation Foundation for Postgraduate and the Graduate School of National University of Defense Technology.

References

1.  P. Belhumeur, J. Hepanha, and D. Kriegman, “Eigenfaces vs. fisherfaces: recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell.0162-8828 19(7), 711–720 (1997). 10.1109/34.598228 Google Scholar

2.  R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. Wiley-Interscience, Hoboken, NJ (2000). Google Scholar

3.  L. Parsons, E. Haque, and H. Liu, “Subspace clustering for high dimensional data: a review,” SIGKDD Explorations 6, 90–105 (2004). 10.1145/1007730.1007731 Google Scholar

4.  M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: a geometric framework for learning from examples,” J. Mach. Learn. Res.1532-4435 7, 2399–2434 (2006). Google Scholar

5.  C. Ding and T. Li, “Adaptive dimension reduction using discriminant analysis and K-means clustering,” Proc. 24th Int. Conf. on Machine Learning pp. 521–528, ACM Press, Madison, WI (2007). 10.1145/1273496.1273562 Google Scholar

6.  T. Hastie, A. Buja, and R. Tibshirani, “Penalized discriminant analysis,” Ann. Stat.0090-5364 23, 73–102 (1995). 10.1214/aos/1176324456 Google Scholar

7.  H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang, “Trace ratio vs. ratio trace for dimensionality reduction,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), June 2007, IEEE Computer Society, Washington, DC (2007). Google Scholar

8.  S. Yu and J. Shi, “Multiclass spectral clustering,” IEEE International Conference on Computer Vision (ICCV 2003), pp. 313–319, October 2003, IEEE Computer Society, Washington, DC (2003). Google Scholar

Chenping Hou, Feiping Nie, Changshui Zhang, Yi Wu, "Learning a subspace for face image clustering via trace ratio criterion," Optical Engineering 48(6), 060501 (1 June 2009). https://doi.org/10.1117/1.3149850
JOURNAL ARTICLE
3 PAGES


SHARE
Back to Top