## 1.

## Introduction

Human face clustering is an important research direction in the field of optical image processing. It has been successfully used in many fields. For example, in face image retrieval, if we can automatically cluster different kinds of face images, then it will beneficial to discover the latent similarities among various face images. The computational cost of searching for an interested face image will also reduce considerably.

Currently, a lot of research has been proposed for face image processing. However, most of it focuses on learning a subspace for face classification (i.e., classify the face images in the learned subspace with a certain number of labeled samples). The most prominent approach is called the *Fisherface*.^{1} It employs the supervised method, *linear discriminant analysis* (LDA),^{2} for face classification.

There is little work dedicated to learning a subspace for face image clustering. One direct way is to employ the traditional clustering techniques (e.g., K-means^{3}) for clustering. Nevertheless, it is difficult for K-means to achieve a satisfied accuracy because the vector representations of face images have very high dimensionality (commonly, the number of pixels of a image). Then, one of the most famous works, which is named
$\mathrm{PCA}+\mathrm{K}$
-means, is proposed. It employs the unsupervised *principle component analysis* (PCA)^{2} technique to compute the famous *Eigenface*
^{1} and then applies the K-means approach for clustering. Although it has achieved prominent performances in many applications,
$\mathrm{PCA}+\mathrm{K}$
-means seems unsuitable for face image clustering. It takes into account few considerations about the special characters and manifold structures of face images.^{4} Recently, Ding extended the traditional supervised LDA method for clustering.^{5} The method, which is called LDA-Km, combines LDA with K-means and learns the subspace and clusters alternately. It has been reported that LDA-Km performs better than the K-means and the
$\mathrm{PCA}+\mathrm{K}$
-mean approaches.^{5} However, it also gives no considerations to the manifold structure and the spatial smooth character^{6} of face images. Additionally, LDA-Km applies the traditional ratio trace criterion, not the more prominent trace ratio criterion.^{7} The accuracy of LDA-Km is not as high as expected for face image clustering.

In this paper, we propose a new method, which is named as *subspace clustering via trace ratio* (SCTR) for face image clustering. First, it applies the trace ratio criterion, which has been shown more effective for discriminative subspace learning.^{7} Second, we employ the spatial smooth regularizer, which represents the particular character of images. Finally, we consider the manifold structure by intentionally adding the graph-smooth constraint. After computing the optimal clustering and learning the subspace alternately, we can finally derive the subspace and clustering results simultaneously. Experiments show that SCTR performs better than the state-of-the-art approaches.

## 2.

## Leaning a Subspace for Face Image Clustering

## 2.1.

### Problem Formulation

Assume $X=[{x}_{1},{x}_{2},\dots ,{x}_{l}]$ are vector representations of $l$ face images, which are all of ${D}_{1}\times {D}_{2}$ resolutions. Here ${x}_{i}\u220a{\mathbb{R}}^{D}$ for $i=1,2,\dots ,N$ and $D={D}_{1}\times {D}_{2}$ . The number of clusters is predefined as $C$ . The purpose of our method is to find the subspace in which these face images can be optimally clustered. More concretely, under a given criterion, we plan to find the transformation matrix $W\u220a{\mathbb{R}}^{D\times d}$ and simultaneously, the indicator matrix $F\u220a{\mathbb{R}}^{l\times C}$ . Here, $d$ is the dimensionality of the subspace. ${F}_{ij}=1\u2215\sqrt{{l}_{j}}$ if ${x}_{i}$ belong to the $j\text{th}$ cluster, and ${F}_{ij}=0$ otherwise, where ${l}_{j}$ is the number of samples in the $j\text{th}$ cluster.

It has been shown^{7} that the trace ratio criterion is much more effective than the ratio trace, which is widely used in the traditional LDA approach and its variants. We employ the trace ratio criterion in our approach. Mathematically, assume
$Y=[{y}_{1},{y}_{2},\dots ,{y}_{l}]$
are the embddings of
$X$
(i.e.,
$Y={W}^{T}X$
). The within scatter matrix
${S}_{w}$
and the total-scatter matrix
${S}_{t}$
are defined as follows:

## 1.

## Eq. 2

$${L}_{\mathrm{w}}=I-F{F}^{T},\phantom{\rule{1em}{0ex}}{L}_{\mathrm{t}}=I-\frac{1}{l}e{e}^{T},$$## Eq. 3

$$\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\underset{{W}^{T}W=I}{\mathrm{min}}\phantom{\rule{0.2em}{0ex}}\frac{\mathrm{tr}\left({W}^{T}{S}_{\mathrm{w}}W\right)}{\mathrm{tr}\left({W}^{T}{S}_{\mathrm{t}}W\right)}.$$We now consider the spatial character of images. Because an image represented in the plane is intrinsically a matrix. The pixels spatially close to each other may be correlated. Although we have
${D}_{1}\times {D}_{2}$
pixels per image, this spatial correlation suggests the real number of freedom is far less. We employ a spatial regularizer (i.e., the Laplacian penalty) to constrain the coefficients to be spatially smooth.^{6} In brief, we define the
${D}_{j}\times {D}_{j}$
$(j=1,2)$
matrices that yield the discrete approximation of the second-order derivation as follows:

## Eq. 4

$${M}_{j}=\frac{1}{{h}_{j}^{2}}\left(\begin{array}{ccccc}-1& 1& & & 0\\ 1& -2& 1& & \\ & \cdots & \cdots & \cdots & \\ & & 1& -2& 1\\ 0& & & 1& -1\end{array}\right).$$For a ${D}_{1}\times {D}_{2}$ image vector $x$ , ${\Vert \Delta x\Vert}^{2}$ is proportional to the sum of squared differences between nearby grid points in that image. It can be used to measure the smoothness of an image on a ${D}_{1}\times {D}_{2}$ lattice. Because each column vector of $W$ can be regarded as a basis image, we add $\alpha {\Delta}^{T}\Delta $ to ${S}_{\mathrm{w}}$ where $0<\alpha <1$ is a balance parameter that controls the smoothness of the estimator.

Finally, as in most learning-based face image processing approaches^{1} we assume that the face images, which belong to the same cluster, are nearly lying on a low dimensional manifold. The graph Laplacian^{4} (i.e.,
$L$
) is employed to represent this character. Intuitively, the cluster indicators of face images belonging to the same cluster should be identical. In other words, if we regard that the cluster indicators are the output of a function defined on all face images, this intuition indicates that the predefined function should be smooth on these manifolds. Mathematically, it is equal to minimize
$\mathrm{tr}\left({F}^{T}LF\right)$
.

In summary, we define SCTR as the solution to the following problem:

## Eq. 6

$$\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\underset{{W}^{T}W=I,{F}^{T}F=I}{\mathrm{min}}(1-\beta )\frac{\mathrm{tr}[(1-\alpha ){W}^{T}{S}_{\mathrm{w}}W+\alpha {W}^{T}{\Delta}^{T}\Delta W]}{\mathrm{tr}\left({W}^{T}{S}_{t}W\right)}+\beta \phantom{\rule{0.3em}{0ex}}\mathrm{tr}\left({F}^{T}LF\right)=\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\underset{{W}^{T}W=I,{F}^{T}F=I}{\mathrm{min}}(1-\beta )\frac{\mathrm{tr}\left\{(1-\alpha ){W}^{T}[X(I-F{F}^{T}){X}^{T}+\alpha {\Delta}^{T}\Delta ]W\right\}}{\mathrm{tr}\left({W}^{T}X{L}_{\mathrm{t}}{X}^{T}W\right)}+\beta \phantom{\rule{0.3em}{0ex}}\mathrm{tr}\left({F}^{T}LF\right),$$## 2.2.

### Solution

There are totally two different kinds of variables that should be optimized (i.e., $W$ and $F$ ). It is difficult to compute them simultaneously. We alternately optimize them.

## 2.2.1.

#### Stage one: fixing $F$ and optimizing $W$

In this situation, the optimization problem in Eq. 6 becomes

## Eq. 7

$$\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\underset{{W}^{T}W=I}{\mathrm{min}}\phantom{\rule{0.2em}{0ex}}\frac{\mathrm{tr}\left\{{W}^{T}[(1-\alpha )X(I-F{F}^{T}){X}^{T}+\alpha {\Delta}^{T}\Delta ]W\right\}}{\mathrm{tr}\left({W}^{T}X{L}_{\mathrm{t}}{X}^{T}W\right)}.$$It has the same form as the problem proposed in Ref. 7, except for a little differences in the formulation of the numerator. Thus, we can directly employ the same technique. It is an iterative procedure. For our problem, in the $p$ ’th step, we solve a trace difference problem

## 2.2.2.

#### Stage two: fixing $W$ and optimizing $F$

In this case, because ${L}_{\mathrm{t}}$ and $\Delta $ have no relation with $F$ , the optimization problem in Eq. 6 is equivalent to

## Eq. 8

$$\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\underset{{F}^{T}F=I}{\mathrm{max}}\phantom{\rule{0.2em}{0ex}}\mathrm{tr}\left\{{F}^{T}[(1-\beta )(1-\alpha ){X}^{T}W{W}^{T}X-\beta \phantom{\rule{0.3em}{0ex}}\mathrm{tr}\left({W}^{T}X{L}_{\mathrm{t}}{X}^{T}W\right)L]F\right\}.$$This problem can be effectively solved by spectral decomposition technique. In fact, the optimal $F$ to the problem in Eq. 8 is formed by the eigenvectors corresponding to the $C$ largest eigenvalues of $(1-\beta )(1-\alpha ){X}^{T}W{W}^{T}X-\beta \phantom{\rule{0.2em}{0ex}}\mathrm{tr}\left({W}^{T}X{L}_{\mathrm{t}}{X}^{T}W\right)L$ .

## 2.3.

### Discussions

There are two essential parameters (i.e., $0<\alpha <1$ and $0<\beta <1$ ) in our method. They control the smoothness of estimator and the balance of two objectives. When $\alpha =0$ , SCTR totally ignores the spatial character of the face image. When $\alpha \to 1$ , SCTR tends to only choose a spatial smoothest basis and ignore the discriminative information. When $\beta =0$ , we ignore the smooth constraint on $F$ . If $\beta =1$ , SCTR only concerns the smoothness of $F$ .

Parameter determination is an essential task in most of the learning algorithms.^{6} Among various kinds of methods, grid search is probably the simplest and most widely used one for unsupervised learning. Because we have constrained
$0<\alpha <1$
and
$0<\beta <1$
, we apply the grid search technique in this paper. More concretely, we set
$\alpha $
and
$\beta $
by searching the grid
$\{0.1,0.2,\dots ,0.9\}\times \{0.1,0.2,\dots ,0.9\}$
.

There are another three problems that should be indicated here. (*i*) Because SCTR is an iterative approach, a direct way to initialize the iteration is to first compute
$F$
by
$K$
-means and then optimize the problems in Eqs. 7, 8 alternately. (*ii*) The optimal
$F$
that maximizes Eq. 8 may not have the form of the indicator matrix that we specified in the previous part. We apply the discretization procedure in Ref. 8 to solve this problem. (*iii*) Because we have derived
$W$
and
$F$
, it is easy to compute the embedding and the cluster of a new image by using
$W$
and
$F$
directly.

## 3.

## Experiments

We employ four different kinds of face image data sets for illustration. They are the Yale data, the ORL data, the Umist data, and the first 20 classes of Feret data. All these images are rescaled to $16\times 16$ resolutions.

We compare SCTR with K-means, $\mathrm{PCA}+\mathrm{K}$ -means, and LDA-Km. Two standard measurements—the accuracy (Acc) and normalized mutual information (NMI)—are used. The parameters $\alpha $ and $\beta $ are determined by grid search and $d=C-1$ . We randomly initialize K-means and repeat for 100 times. The average values of these Accs are listed in Table 1. Figure 1 shows the corresponding mean of these NMIs.

## Table 1

Mean accuracy (Acc) results of different methods on four face image data sets.

K-means | PCA+K -means | LDA-Km | SCTR | |
---|---|---|---|---|

Yale | 0.379 | 0.385 | 0.459 | 0.491 |

ORL | 0.497 | 0.513 | 0.607 | 0.674 |

Umist | 0.381 | 0.383 | 0.468 | 0.501 |

Feret | 0.374 | 0.372 | 0.526 | 0.551 |

As seen from Table 1 and Fig. 1, SCTR performs the best. It achieves both the highest mean Acc and the largest mean NMI in all cases. Because all the images are rescaled to be low resolution and the dimensionality is not so high, it seems that K-means and $\mathrm{PCA}+\mathrm{K}$ -means have the similar performances on these data sets. LDA-Km performs better than K-means and $\mathrm{PCA}+\mathrm{K}$ -means.

We have also done some experiments on the Yale data set with different parameters, which are set by the grid search. The Acc results versus different $\alpha $ and $\beta $ are shown in Fig. 2. It seems that for Yale, smaller $\alpha $ and $\beta $ are more suitable. The best parameter $\alpha =0.3$ and $\beta =0.1$ .

## Acknowledgments

This work is partly supported by NSF China (Grants No. 60673090 and No. 60721003), the Hunan Provincial Innovation Foundation for Postgraduate and the Graduate School of National University of Defense Technology.

## References

**,” IEEE Trans. Pattern Anal. Mach. Intell., 19 (7), 711 –720 (1997). https://doi.org/10.1109/34.598228 0162-8828 Google Scholar**

*Eigenfaces vs. fisherfaces: recognition using class specific linear projection***,” SIGKDD Explorations, 6 90 –105 (2004). https://doi.org/10.1145/1007730.1007731 Google Scholar**

*Subspace clustering for high dimensional data: a review***,” J. Mach. Learn. Res., 7 2399 –2434 (2006). 1532-4435 Google Scholar**

*Manifold regularization: a geometric framework for learning from examples***,” Proc. 24th Int. Conf. on Machine Learning, 521 –528 (2007). https://doi.org/10.1145/1273496.1273562 Google Scholar**

*Adaptive dimension reduction using discriminant analysis and K-means clustering***,” Ann. Stat., 23 73 –102 (1995). https://doi.org/10.1214/aos/1176324456 0090-5364 Google Scholar**

*Penalized discriminant analysis***,” (2007). Google Scholar**

*Trace ratio vs. ratio trace for dimensionality reduction*