## 1.

## Introduction

Human face pose estimation has a variety of applications, such as face recognition, face tracking, and human-computer interaction (HCI). Due to the inadequacy of the quality of 3-D quantities estimated from 2-D data, it is very complex to estimate face poses from 2-D face images. In addition, many factors exacerbate the problem, for example, illumination conditions, face expressions, spatial scale, etc. More importantly the appearance of the human head can change drastically across different viewing angles, mainly caused by nonlinear deformations during in-depth rotations of the head.^{1} Many different approaches have been proposed to solve this problem. Generally, the existing pose estimation methods can be broadly classified into two categories: feature-based^{2} and appearance-based methods.^{3, 4}

There are four major problems to be solved in the existing approaches mentioned before. The first is that the face region must be extracted from a whole face image. It is very difficult to locate the face region from a side or a profile face image. The second is that original face images are normalized manually. However, manual normalization is tedious work and its cost is very high. The third is the difficult problem of extracting the face features accurately. It is especially more difficult to extract features from a side face image than a frontal one. Lastly, face images with varying intrinsic features such as illumination, face pose, and face expression are considered to constitute highly nonlinear manifolds in the high-dimensional observation space. Therefore, some pose estimation systems using linear approaches [for example, principal components analysis (PCA)] will ignore subtleties of manifolds. Manifold learning algorithms are better alternatives. However, the discriminant ability of the low-dimensional subspaces obtained by manifold learning is often lower than those obtained by the traditional dimensionality reduction approaches. Furthermore, the original feature vectors may include high-order correlation, which cannot be removed by manifold learning algorithms. Therefore, a new approach based on manifold learning is proposed to address the four problems mentioned before. In our proposed approach, face images not removed from the background are first transformed by Gabor filters. Then, a novel supervised locality preserving projection (SLPP) is proposed to project Gabor-based data out of the samples into a common low-dimensional subspace. For simplicity, the two combinations of Gabor fiters (GF) and SLPP are abbreviated to $\mathrm{GF}+\mathrm{SLPP}$ . Last, the support vector machine (SVM) classifier is applied to estimate the face pose.

## 2.

## Proposed Combination Approaches of Gabor Filters and the Supervised Locality Preserving Projection

Gabor filters are particularly appropriate for use in face pose estimation because they incorporate smoothing and can reduce sensitivity to spatial misalignment and illumination change. GWT can also obtain image representations that are locally normalized in intensity and decomposed in spatial frequency and orientation.^{5} In addition, Gabor filters can enhance pose-specific face features. Moreover, Gabor filters transform the face images into frequency domain, where unnoticeable information in the spatial domain will become clear. The transformational results of face images do well improving the discriminant ability of SLPP.

In our studies, the system processes face images as follows. A set of Gabor kernels ${h}_{mn}(x,y)$ is specified and the original image $I(x,y)$ is convolved with those kernels at each pixel. The result is a set of 2-D coefficient arrays,

where ${W}_{m,n}(x,y)$ is the convolution result corresponding to the Gabor kernel at scale $m$ and orientation $n$ . ${\text{\hspace{0.17em}}}^{*}$ denotes the convolution operator.Since the outputs ${W}_{m,n}(x,y)$ consist of different localities, scales, and orientation features, we concatenate all these features into a feature vector $\mathbf{X}$ . Without loss of generality, assume each output ${W}_{m,n}(x,y)$ is a column vector, which can be constructed by concatenating the rows (or columns) of the output. Before the concatenation, each output ${W}_{m,n}(x,y)$ is down-sampled by a factor $\rho $ to reduce the dimensionality of the origin vector space. Then, it is normalized to zero mean and unit variance. Let ${W}_{m,n}^{\rho}$ denote a normalized output, and then the feature vector ${\mathbf{X}}^{\left(\rho \right)}$ is defined as:

## 2

$${X}^{\left(\rho \right)}={[{W}_{0,0}^{{\left(\rho \right)}^{t}},{W}_{0,1}^{{\left(\rho \right)}^{t}},\dots ,{W}_{4,7}^{{\left(\rho \right)}^{t}}]}^{t},$$After high-order information features are extracted by the Gabor filters, an immediate problem is to reduce the dimensionality and uncover the intrinsic low-dimensionality manifold. In this work, we propose a SLPP approach.

LPP seeks a transformation
$W$
to project high-dimensional input data
$X=[{x}_{1},{x}_{2},\dots ,{x}_{m}]$
into a low-dimensional subspace
$Y=[{y}_{1},{y}_{2},\dots ,{y}_{m}]$
. The linear transformation
$W$
can be obtained by minimizing an objective function as follows:^{6}

## 3

$$\mathrm{min}\phantom{\rule{0.2em}{0ex}}\sum _{i,j}{({w}^{T}{x}_{i}-{w}^{T}{x}_{j})}^{2}{S}_{ij},$$## 4

$${S}_{ij}=\{\begin{array}{cc}\mathrm{exp}(-\frac{{\parallel {x}_{i}-{x}_{j}\parallel}^{2}}{t})& \text{if}\phantom{\rule{0.3em}{0ex}}{x}_{i}\phantom{\rule{0.3em}{0ex}}\text{and}\phantom{\rule{0.3em}{0ex}}{x}_{j}\phantom{\rule{0.3em}{0ex}}\text{are}\phantom{\rule{0.3em}{0ex}}\text{close}\\ 0& \text{otherwise}\end{array}\phantom{\}},$$The $d$ -dimensional data from LPP are further mapped into ${d}^{\prime}$ -dimensionality discriminant subspace through the linear discriminant analysis (LDA) algorithm. To minimize the intraclass distances while maximizing the interclass distances of the face manifold, the column vectors of discriminant matrix ${\mathbf{W}}^{\prime}$ are calculated by the eigenvectors of ${\mathbf{S}}_{w}^{-1}{\mathbf{S}}_{b}$ associated with the largest eigenvalues,

where ${\mathbf{S}}_{b}$ is the between-class scatter matrix, and ${\mathbf{S}}_{w}$ is the within-class scatter matrix. Then the matrix ${\mathbf{W}}^{\prime}$ projects vectors in the low-dimensionality face subspace into the common discriminant subspace, which can be formulated as follows:## 7

$$Z={\mathbf{W}}^{\prime}Y={\mathbf{W}}^{\prime}WXZ\subset {R}^{d\prime},\phantom{\rule{1em}{0ex}}Y\subset {R}^{d},\phantom{\rule{1em}{0ex}}{\mathbf{W}}^{\prime}\u220a{R}^{{d}^{\prime}\times d},$$## 3.

## Experimental Results

In this section, we manually selected two collections of face images from the JDL-PEAL face database.^{7} They both include 130 subjects, which are selected randomly, each with seven differently posed face images varying intrinsic features such as pose, illumination, and expression. The difference between the two collections is that the first collection is used as a training set while the second one is used as a testing set. In the first collection, all face images were resized to
$24\times 18$
. Some samples are illustrated in Fig. 1. Before performing the proposed approach, several parameters need to be fixed. First, for the Gabor filters, we chose five scales and eight orientations, and the number of
$\rho $
is 4. Second, the two reduced dimensions
$d$
and
${d}^{\prime}$
of the proposed method are fixed.
$d$
is defined as 20. The reduced discriminant dimension
${d}^{\prime}$
is generally no more than
$L-1$
, where
$L$
denotes the number of face poses.

We compared our proposed $\mathrm{GF}+\mathrm{SLPP}$ algorithm with PCALDA, $\mathrm{GF}+\mathrm{PCALDA}$ , and SLPP. For PCALDA, the algorithm is exploited to obtain the subspace in the training set directly. For SLPP, we utilize the SLPP approach without Gabor filters to learn the subspace in the training set. For $\mathrm{GF}+\mathrm{PCALDA}$ , the approach is similar to the $\mathrm{GF}+\mathrm{SLPP}$ approach, but the dimensionality reduction approach is replaced by PCALDA.

In the $\mathrm{GF}+\mathrm{SLPP}$ approach, the reduced discriminant dimension ${d}^{\prime}$ influences the performance of the proposed approach. It can be seen from Fig. 2 that as ${d}^{\prime}$ increases, the $\mathrm{GF}+\mathrm{SLPP}$ has a higher accuracy rate.

The experimental results with the optimal reduced dimensions are listed in Table 1. It can be seen from Table 1 that the discriminant ability of the SLPP approach is better than the $\mathrm{PCA}+\mathrm{LDA}$ approach, and the $\mathrm{GF}+\mathrm{SLPP}$ method achieves the best performance.

## Table 1

The accuracy rate (percent) of the combination of dimensionality reduction and SVM classification. d=20 and d′=6 .

Face pose | −45deg | −30deg | −15deg | 0deg | +15deg | +30deg | +45deg |
---|---|---|---|---|---|---|---|

$\mathrm{GF}+\mathrm{SLPP}$ accuracy rate | 96.23 | 96.53 | 95.97 | 97.49 | 95.93 | 96.11 | 96.29 |

$\mathrm{GF}+\mathrm{PCALDA}$ accuracy rate | 64.14 | 65.76 | 69.85 | 73.52 | 68.59 | 66.35 | 64.62 |

SLPP accuracy rate | 75.23 | 78.51 | 78.84 | 83.85 | 79.15 | 77.58 | 73.39 |

PCALDA accuracy rate | 58.21 | 59.68 | 61.18 | 64.38 | 61.23 | 58.92 | 57.98 |

## Acknowledgments

The research is sponsored by the Fundamental Project of the Committee of Science and Technology, Shanghai, under contract 03DZ14015.