Person reidentification (PR-ID) is a very important branch of computer vision and has been widely used in many safety-critical applications, such as video surveillance and forensics. The basic task of PR-ID shown in Fig. 1 is to determine whether or not two images from nonoverlapping cameras show the same person of interest. However, in real-world applications, there are many significant challenges for PR-ID because an image pair of a person is usually captured by different cameras with significantly different backgrounds, levels of illumination, viewpoints, occlusions, and image resolutions. To overcome these issues, many PR-ID methods have been proposed in recent years and can be generally classified into two categories: feature representation1,2 and metric learning methods.3,4 For feature representation methods, Schwartz and Davis1 proposed a high-dimensional feature extraction algorithm. Baltieri et al.2 proposed a view-independent signature method by mapping the local descriptors extracted from RGB-D sensors on an articulated body model. The pose priors and subject-discriminative features were used to reduce the effects of viewpoint changes.5 Li et al.6 proposed a cross-view multilevel dictionary learning model to improve the representation power, which contains dictionary learning at different representation levels, including image level, horizontal part level, and patch level. For metric learning methods, Cheng et al.3 introduced a new and essential ranking graph Laplacian term, which can minimize the intrapedestrian compactness and maximize the interpedestrian dispersion. Li and Wang7 presented a method that learns different metrics from the images of a person obtained from different cameras. In addition, Jing et al.4 combined semicoupled low-rank discriminant dictionary learning to achieve super-resolution PR-ID, and Li et al.8 also proposed for low-resolution PR-ID, which jointly learns a pair of dictionaries and a mapping to bridge the gap across lower and higher resolution images to incorporate positive and negative pair information and using the projective dictionary to boost PR-ID efficiency.
With the development of deep-learning methods, deep representation learning has recently achieved great success due to its highly effective learning ability. Several deep PR-ID models achieve a great improvement in the accuracy, such as deep metric learning (DML) for practical PR-ID,9 a multitask deep network (MDN) for PR-ID,10 and a deep linear discriminant analysis of Fisher networks for PR-ID.11 However, existing deep-learning-based methods require learning a deep metric network by maximizing the distance among interclass samples and minimizing the distance among intraclass samples simultaneously. These methods do not effectively use the discriminant information among different samples. Therefore, triplet-based PR-ID models have been proposed to improve the efficiency of exploiting discriminant information through three samples, including a multiscale triplet CNN,12 distance metric learning with asymmetric impostors (LISTEN),13 and a body-structure-based triplet convolutional neural network.14
Although these triplet-based methods can improve the performance of PR-ID, they did not consider constraint from impostors' congeners samples (IC samples). As shown in Fig. 2, some new impostors may be produced when removing existing impostors by existing impostor-based methods. Therefore, how to alleviate effects of these samples is an important problem on PR-ID.
Research in Refs. 1213.–14 has demonstrated that triplet-based methods can develop more discriminant information than that in pairwise-based methods. However, existing triplet-based methods cannot solve difficulties caused by IC samples, such as they are transformed to new impostors, or they would be dispersed after projection. They cannot fully use the different discriminant information contained in IC samples. To address this problem, two aspects are needed to be considered in triplet-based methods. (i) Existing triplet-based methods1213.–14 exploit information in impostors alone without IC samples. (ii) Impostor and its congeners maybe dispersed after projections, which must reduce the matching accuracy for PR-ID. (iii) Most deep PR-ID models are limited to handcrafted features in images by DML instead of the convolution of original images.
The major contributions of this study are summarized as follows.
1. We propose a deep triplet-group network that fully employs symmetric and asymmetric information (DSAN) for triplets and IC samples (denoted as triplet group), which learns a deep neural network by the convolution of the original images of a person and trains the network with a symmetric and asymmetric constraint loss function to ensure the clustering effect of impostor and its congeners and make them more efficient and discriminable.
2. We design a triplet-group constraint objective function that requires the distance between a negative pair to be larger than that between a positive pair, and the distances between impostor and its congeners (denoted as impostor-group) are minimized simultaneously.
3. We conduct a number of matching accuracy experiments in this study. The experimental results show that our DSAN approach outperforms various triplet-based methods and other deep-learning methods.
The corresponding relationships between an impostor and its relevant positive sample pair can be classified into two cases: a symmetric correspondence relationship and an asymmetric correspondence relationship (ACR). Given an impostor and the corresponding positive sample pair , if is an impostor of with respect to and an impostor of with respect to , the corresponding relationship between and is symmetric, as shown in Fig. 2(a). Otherwise, the correspondence relationship is asymmetric, as shown in Fig. 2(b). The ratio of impostors in some PR-ID datasets is presented in Ref. 13, and we can see the importance of impostors for PR-ID. For the distance between two samples , we compute the Euclidean distance as follows:
Existing Triplet-Based Methods
The impostor-based metric learning method1516.–17 exploits the impostors with a “normal” triplet constraint [i.e., for a triplet , it requires , where is a distance function], meaning that they cannot effectively remove the impostors in the case of an ACR. For this reason, Zhu et al.13 proposed LISTEN; it requires that and simultaneously. However, LISTEN does not consider the relationship between and and other samples in a same class with . This may lead to producing another impostor when removing the existing impostors, as in Figs. 2(a) and 2(b).
Our Overasymmetric and Oversymmetric Relationship Constraints on Triplet
In our method, we transform the symmetric correlated impostor and asymmetric correlated impostor (Fig. 2) in two cases when an overasymmetric relationship (OAR) and an oversymmetric relationship (OSR) meet on positive pair and IC samples. Given a impostor with its congeners in a same class and the corresponding positive sample pair , we want them to become the desirable status as Fig. 2(c) regardless of their previous status, which make a very short distance as much as possible, and and are very long distance as much as possible. To some extreme degree, the correlation in a triplet can be considered as symmetric relationship because and are extremely longer than . Meanwhile, we make be clustering to for better classification in class to avoid circumstances in Figs. 2(a) and 2(b).
We proposed our deep triplet-group network and a person reidentification method for our proposed and details will be described below.
Deep Triplet-Group Network
For our deep triplet-group network, we use a deep convolutional network inspired by Schroff et al.18 The network architecture is outlined in Fig. 3. We use layers, where the last layer is our OAR and OSR loss function. The input of the network is the triplet samples with impostor’s congeners, and for image , the output of the first layer is , where is the projection matrix, is the bias vector to be learned in the first layer of our network, and is a nonlinear active function that is applied in a component-wise manner. , where is the projection matrix and is the bias vector to be learned in the second layer of our network. Similarly, the output for the ’th layer () is , and that for the top layer is
According to Eq. (1), we compute the distance between the outputs of the ’th layer from and as follows:
To increase the image classification performance, we expect all positive pair and IC-sample outputs through the network will simultaneously satisfy the OAR and OSR constraints. Assume a desired status, the impostor should leave and , a maximal distance simultaneously, and we can consider there will be a symmetric relationship between , , and . However, it is hard to meet this symmetric relationship in reality, and we develop this symmetric relationship on a cluster center of impostor and its congeners (denoted impostor group as ), which could not only maintain the asymmetric relationship in triplet but also exploit some discriminative information in its congeners to make impostor group more discriminative. In other words, our developed strategy ensures meets OAR constraint and OSR constraint between and . In our network, for each triplet group and congeners of , the outputs and satisfy the following objective function:1.
Our DSAN algorithm
|Input: Training set , number of network layers , learning rate , parameters and , and convergence error ;|
|Output: Parameters and , .|
|Initialization: Initialize and with appropriate values|
|Compute the triple-group collection|
|Compute , , and -group using the deep network.|
|Obtain the gradients according to backpropagation algorithm.|
|Update and according to forward propagation algorithm|
|Calculate using Eq. (5).|
|If and , go to Return.|
|Return: and , where .|
Person Reidentification Method
For the image of a pedestrian in probe from testing image set, we use as the input of our network with the learned parameter and obtain its deep feature representation . Then, we compute the distances between and each image in the gallery from testing image set by Eq. (3). Finally, we choose the smallest distance in every distance, including , and obtain the label of the sample that has the smallest distance with as follows:
We conducted extensive experiments using five widely used datasets: CUHK03,19 CUHK01,20 VIPeR,21 iLIDS-VID,22 and PRID2011.23 Here, we compare the performance of our approach with triplet-based state-of-the-art approaches.
Datasets and Experimental Settings
Experiments are conducted with one large dataset and four small datasets. The large dataset is the CUHK03 dataset, which contains 13,164 images from 1360 persons. We randomly selected 1160 persons for training, 100 persons for validation, and 100 persons for testing, following exactly the same settings in Refs. 19 and 24. The four small datasets are the CUHK01, VIPeR, iLIDS, and PRID2011 datasets. For these four datasets, we randomly divided the individuals into two equal parts, with one used for training and the other for testing. Moreover, we created triplet collections following the method by Schroff et al.18
To validate the effectiveness of our DSAN approach, we compare the DSAN model with several state-of-the-art metric-learning-based methods: keep it simple and straightforward metric learning (KISSME)25 and relaxed pairwise metric learning (RPML).26 In addition, our DSAN model was compared with several state-of-the-art deep-learning-based methods: the improved deep-learning architecture (IDLA),24 deep ranking PR-ID (DRank),27 and an MDN (MTDnet).10 Moreover, our DSAN model was compared with some state-of-the-art triplet-based networks: efficient impostor-based metric learning (EIML),17 LISTEN,13 an improved triplet loss network (ImpTrLoss),28 and a spindle Net.29
For evaluating our DSAN, we use TensorFlow30 framework to train our DASN. Note that we used network configuration as in Ref. 18. For all datasets, our network contains six convolutional layers, four max polling layers, and one fully connected (FC) layers for each images. These layers configured as below.(1) , stride = 2, feature maps = 64; (2) Max pool , stride = 2; (2) Max pool , stride = 2; (3) , stride = 1, feature maps = 192; (4) Max pool , stride = 2; (5) , stride = 1, feature maps = 384; (6) Max pool , stride = 2; (7) , stride = 1, feature maps = 256; (8) , stride = 1, feature maps = 256; (9) , stride = 1, feature maps = 256; and (10) FC, output dimension = 128.
For small datasets, we adopt an unsupervised image generating strategy31 to solve the problem of lacking training samples. In detail, we use small dataset as source domain and map 10,000 images in CUHK03 dataset into source domain. This strategy makes the 10,000 images follow distribution of target small dataset. Then, we used these generated images to train our model and fine-tune with target small datasets.
Results and Analysis
Table 1 shows our rank-1 matching accuracies, and Figs. 4Fig. 5Fig. 6Fig. 7–8 describe cumulative match characteristic (CMC) curves in different ranks on five datasets. We will describe evaluations on five datasets.
Top-ranked matching rates (%) for five datasets.
Evaluation with the CUHK03 dataset
The CUHK03 dataset contains 13,164 images of 1360 pedestrians captured by six surveillance cameras. Each identity is observed by two disjoint camera views. On average, there are 4.8 images per identity for each view. This dataset provides both manually labeled pedestrian bounding boxes and bounding boxes automatically obtained by running a pedestrian detector.32 We report results for both versions of the data (labeled and detected). Following the protocol used in Ref. 19, we randomly divided the 1360 identities into nonoverlapping training (1160), test (100), and validation (100) sets. This yielded about 26,000 positive pairs before data augmentation. We used a minibatch size of 150 samples and trained the network for 200,000 iterations. We used the validation set to design the network architecture. In Table 1 and Fig. 4, we compare our method against KISSME, IDLA, MTDnet, ImpTrLoss and Spindle net, and it is observed that DSAN outperforms these methods with regards to the rank-1 matching accuracy except for Spindle. We achieve a rank-1 accuracy of 77.35% with the parameters and .
Evaluation with the CUHK01 dataset
The CUHK01 dataset has 971 identities, with two images per person for each view. Most previous papers have reported results using the CUHK01 dataset by considering 486 identities for testing. With 486 identities in the test set, only 485 identities remain for training. This leaves only 1940 positive samples for training, which makes it practically impossible for a deep architecture with a reasonable size to not overfit if trained from scratch with these data. One way to solve this problem is to use a model trained with the transformed CUHK03 dataset and then test the 486 identities of the CUHK01 dataset. This is unlikely to work well since the network does not know the statistics of the tests with the CUHK01 dataset. In fact, our model was trained with the transformed CUHK03 dataset and adapted for the CUHK01 dataset by fine-tuning it with the CUHK01 dataset with 485 training identities (nonoverlapping with the test set). Table 1 and Fig. 5 compare the performance of our approach with that of other methods. We used a minibatch size of 150 samples and trained the network for 180,000 iterations. Our method obtains a rank-1 accuracy of 79.35% with the parameters and , surpassing all other methods individually.
Evaluation with the VIPeR dataset.
The VIPeR dataset contains 632 pedestrian pairs with two views, with only one image per person for each view. The testing protocol is to split the dataset in half: 316 pairs for training and 316 pairs for testing. This dataset is extremely challenging for a deep neural network architecture for two reasons: (a) there are only 316 identities for training with one image per person for each view, giving a total of just 316 positives, and (b) the resolution of the images is lower ( as compared to for the CUHK01 dataset). We trained a model using the transformed CUHK03 dataset and then adapted the trained model to the VIPeR dataset by fine-tuning it with 316 training identities. Since the number of negatives is small for this dataset, hard negative mining does not improve results after fine-tuning because most of the negatives were already used during fine-tuning. The results in Table 1 and Fig. 6 show that DSAN outperforms the state-of-the-art methods by a large margin. We used a minibatch size of 150 samples and trained the network for 130,000 iterations. Our rank-1 accuracy is 49.05%, surpassing all other methods for the parameters and .
Evaluation with the iLIDS dataset
The iLIDS-VID dataset has 300 different pedestrians observed across two disjoint camera views in a public open space. This dataset is very challenging owing to the clothing similarities among people, the lighting, and the viewpoint variations across camera views. There are two versions: a static-image-based version and image-sequence-based version, and we chose the static images for use in our experiments. This version contains 600 images of 300 distinct individuals, with one pair of images from two camera views for each person. We divided the set into 150 individuals for training and the others for testing. In the iLIDS-VID dataset, we also encounter a similar problem, as for the CUHK01 and VIPeR datasets. We used the pretrained model using the transformed CUHK03 dataset and fine-tuned it for training with the iLIDS-VID dataset. From Table 1 and Fig. 7, DSAN outperforms the state-of-the-art methods. We used a minibatch size of 150 samples and trained the network for 180,000 iterations. Our rank-1 accuracy is 62.55% for the parameters and .
Evaluation with the PRID2011 dataset
This dataset has 385 trajectories from camera A and 749 trajectories from camera B. Among them, only 200 people appear in both cameras. This dataset also has a single hot version, which consists of randomly selected snapshots. The division and pretraining procedure is similar to that for the iLIDS-VID dataset: half for training and the others for testing. Furthermore, the transformed CUHK03 dataset is used to pretrain and fine-tune with the PRID2011 dataset. In our experiments, we used a minibatch size of 150 samples and trained the network for 160,000 iterations. We obtained a rank-1 accuracy of 55.86% with and , and the detailed results are presented in Table 1 and Fig. 8.
In this section, we discuss several effects of OAR and OSR constraints, clustering enter symmetric constraint, and parameter analysis.
Effects of the OAR and OSR constraints
To evaluate the effects of the OAR and OSR constraints, we perform experiments with three datasets with or without utilization of the OAR and OSR constraints. The results obtained using DSAN without the OAR or OSR constraint are denoted as DSN and DAN, respectively. Table 2 reports the rank-1 matching rates of DSAN, DSN, and DAN for the five datasets. We can see that using OAR and OSR constraints improves the rank-1 matching rate by at least 3.55%, which indicates that our OAR and OSR constraints can exploit some discriminative information that is useful for PR-ID.
Effects of the OAR and OSR constraints.
Effects of our clustering center symmetric constraint
To evaluate effects of our clustering center symmetric constraint, we conduct several experiments without clustering center symmetric constraint, which only use impostor into triplet constraint denoted as DTN. Table 1 reports the top-rank matching accuracy of our experiment and triplet-based methods (LISTEN and ImpTrLoss). It can be shown that our clustering center symmetric constraint improves by 7.081% on average
In this experiment, we investigate the effect of parameters, including and . Parameter balances the effect of intraclass term. Parameter controls the effect of interclass term. When one of the parameters is evaluated, the other is fixed as the values given in evaluation of datasets.
We take the experiment on CUHK03 dataset as an example. Figures 9 and 10 show the rank-1 matching rates of our approach versus different values of and on CUHK03 dataset. We can observe that: (1) DSAN is not sensitive to the choice of in the range of [0.10, 0.30]; (2) DSAN achieves the best performance when and are set as 0.35 and 0.25, respectively; and (3) DSAN can obtain relatively good performance when is in the range of [0.20, 0.30]. Similar effects can be observed on other datasets (Besides, the training and testing time are described in Table 3).
Training time and testing time.
We have developed a deep triplet-group network by exploiting symmetric and asymmetric information on clustering center of impostor and its congeners. It differs from existing methods in that it can use the OAR and OSR constraints to exploit more discriminative information from the relationships between positive samples and its impostor clustering center. From the results of extensive experiments, we can draw the following conclusions. (1) DSAN outperforms several state-of-the-art DL-based methods in terms of the matching rate. (2) With the designed OAR and OSR constraints, DSAN can more effectively exploit discriminative information. (3) There exists some useful information in impostor-based clustering center, and the proper utilization of this information can improve performance.
Benzhi Yu is a PhD student at Hubei Key Laboratory of Transportation of Things, Wuhan University of Technology. His current research interests include image processing and computer vision.
Ning Xu received his PhD in electronic science and technology from the University of Electronic Science and Technology of China, in 2003. Later, he was a postdoctoral fellow with Tsinghua University from 2003 to 2005. Currently, he is a professor at the Computer Science Department of Wuhan University of Technology. His research interests include computer-aided design of VLSI circuits and systems, computer architectures, data mining, and highly combinatorial optimization algorithms.