Discriminative deep transfer metric learning for cross-scenario person re-identification

Abstract. A discriminative deep transfer metric learning method called DDTML is proposed for cross-scenario person re-identification (Re-ID). To develop the Re-ID model in a new scenario, a large number of pairwise cross-camera-view person images are deemed necessary. However, this work is very expensive due to both monetary cost and labeling time. In order to solve this problem, a DDTML for cross-scenario Re-ID is proposed using the transferring data in other scenarios to help build a Re-ID model in a new scenario. Specifically, to measure distribution difference across scenarios, a maximum mean discrepancy based on class distribution called MMDCD is proposed by embedding the discriminative information of data into the concept of the maximum mean discrepancy. Unlike most metric learning methods, which usually learn a linear distance to project data into the feature space, DDTML uses a deep neural network to develop the multilayers nonlinear transformations for learning the nonlinear distance metric, while DDTML transfers discriminative information from the source domain to the target domain. By bedding the MMDCD criteria, DDTML minimizes the distribution divergence between the source domain and the target domain. Experimental results on widely used Re-ID datasets show the effectiveness of the proposed classifiers.


Introduction
2] There are a large number of cameras in surveillance systems and they provide huge amounts of video data.The analysis of the computer vision abstained in a surveillance system often requires the ability to track people across multiple cameras.0][11] In the past five years, a large number of models have been proposed for Re-ID models.7][18][19][20] In this paper, we focus on the second type, i.e., we learn the optimal distance measure to give correct matches in Re-ID.
However, it is not easy to develop a deployable and efficient Re-ID model in a new scenario (e.g., from an indoor classroom to an outdoor square).First, due to different illumination environments, posture, and view angle, the robust features obtained in one scenario will not have good performance for another scenario.Second, in order to obtain a robust Re-ID model, one must collect a large number of labeled person images about the new scenario for training.
However, the work is very expensive due to both monetary cost and labeling time.Some unsupervised methods are proposed to address this problem.For example, Ma et al. 21introduced a time shift dynamic time warping model for unsupervised person representation.Ye et al. 22 proposed a dynamic graph matching method to mine the intermediate estimated labels across disjoint cameras, and then with the estimated labels, its remaining steps can be considered as a supervised learning method.However, compared to supervised Re-ID methods, the matching performance of unsupervised methods is less effective when a person recognizable is under severe appearance changes. 23ecently, transfer learning mechanism has been widely used in Re-ID.The principal goal of transfer learning is to help build a Re-ID model in a new scenario (target domain) by leveraging the data collected from the other scenarios (source domain). 24For example, in a crowded station, there may exist of a large number of data used for building some Re-ID models for their own respective scopes.In order to build a Re-ID model for a new scenario, we may use these existed data in the source domain without collecting a lot of labeled data in the target domain.In Ref. 25, it is demonstrated that certain discriminative information or common variations (such as pose and resolution) shared in different scenarios can lead to significant performance gains in a new scenario.Different from original multitask learning which aims to benefit all tasks both on the target domain and source domain, transfer learning for Re-ID mainly aims to benefit the target one.
In this work, we first propose a maximum mean discrepancy based on class distribution called MMDCD to measure distribution difference across domains.MMDCD embeds the discriminative information of data taken from the source domain into the concept of the maximum mean discrepancy (MMD). 26Minimizing MMDCD leads to minimize the distribution difference across domains in a supervised way.Then we propose a discriminative deep transfer metric learning method called DDTML for cross-scenario transfer Re-ID.Figure 1 shows the basic idea of the proposed method.Using a deep neural network, DTDML learns a set of multilayers nonlinear transformations to transfer discriminative information from the source domain to the target domain; meanwhile, DTDML reduces the distribution divergence between the source data and the target data by minimizing MMDCD at the top layer of the network.
The contribution of this work can be summarized in the following three aspects.
(1) Unlike MMD working in an unsupervised way, MMDCD works in a supervised way, which not only exploits the discriminative information of data taken from the source domain, but also sets different coefficients for matched/mismatched pairs.Minimizing MMDCD could enhance the discriminant ability of DTDML.(2) By embedding MMDCD into a deep metric network, DDTML learns a set of multilayers nonlinear transformations to better exploit the discriminative information for cross-scenario Re-ID tasks.(3) Extensive experiments on several Re-ID datasets are conducted and the experimental results demonstrate that the proposed method DDTML obtains better performance compared with several state-of-the-art methods.

Related Work
According to the process of Re-ID, existing works can be generally divided into two categories, namely, seeking robust features methods and seeking the optimal distance learning methods.The goal of seeking robust features methods is to increase their representative capability.For example, Ma et al. 27 proposed a BiCov descriptor based on Gabor filters and the covariance descriptor to track persons.Kviatkovsky et al. 28 constructed an invariant intradistribution structure of color to adopt with a wide range of imaging conditions.Yang et al. 29 developed a robust semantic salient color namesbased color descriptor to calculate photometric variance.
However, descriptors of visual appearance are so highly susceptible to cross-view variations and heavily rely on foreground segmentations that it is difficult for them to achieve a balance between discriminative power and robustness.
As the popular similarity distance learning methods, the goal of metric learning methods is to find a distance or similarity function of extracted features from different persons' images to make the most likely correct matching.For example, Pedagadi et al. 30 applied a two-stage method, local Fisher discriminant analysis (LFDA), in a low-manifold learning framework using principal component analysis (PCA) and the LFDA.Kostinger et al. 16 proposed a metric learning principle of keeping it simple and straightforward (KISSME) to learn a distance metric from equivalence constraints based on a statistical inference perspective.Hu et al. 31 exploited the discriminative information to propose a discriminative deep metric learning (DDML), which is a major reference of this paper.
Note that cross-scenarios transfer learning has been adopted for Re-ID methods in the hope that the target domain (new scenario) can exploit transferable discriminative information from the source domain (other scenarios) with labeled images.For example, Wang et al. 25 proposed the constrained asymmetric multitask discriminative component analysis (cAMT-DCA) method to explore discriminative modeling in the shared latent space for cross-scenarios transfer learning.Cheng et al. 32 proposed a transfer metric learning method OurTransD to learn both the commonalities and the personality of the data from different scenarios jointly.Zhang et al. 33 proposed a two-stage transfer metric learning (TSTML) method, which transfers the generic knowledge information from the source set in the first stage and then transfers the distance metric for each probe-specific person in the second stage.In terms of similarity function, optimization method, whether a transfer learning and deep learning method, Table 1 summarizes seven Re-ID methods, i.e., LFDA, KISSME, DDML, TSTML, cAMT-DCA, OurTransD, and DDTML, which is proposed in this study.Different from the other three transfer learning methods, our proposed DDTML uses a deep learning network to learn a set of multilayer nonlinear projections for the cross-scenario transfer learning.In particular, an MMDCD is proposed to measure distribution difference across domains.
3 Proposed Methods

Discriminative Deep Metric Learning
DDML method is originally proposed for face verification in the wild.DDML uses a deep neural network to learn the nonlinear mapping function of samples for projecting face samples into the feature space.
Assume DDML constructs a deep neural network with M þ 1 layers, p ðmÞ is the units in the m'th layer, where m ¼ 1; 2; : : : ; M. For a given person image sample x ∈ R d , h ð0Þ ¼ x is the original input of the network and h ð1Þ ¼ φ½W ð1Þ x þ b ð1Þ ∈ R p ð1Þ is the output of the first layer, where W ð1Þ and b ð1Þ are the projection matrix and bias vector in the first layer, respectively.φðÞ is a nonlinear activation function, which operates component wisely, such as widely used tanh or sigmoid functions.Then using h ð1Þ as the input of the second layer, we can obtain the output of this layer h ð2Þ , i.e., h ð2Þ ¼ φ½W ð2Þ h ð1Þ þ b ð2Þ ∈ R p ð2Þ .In this case, we can obtain the output of topmost layer fðxÞ where f∶R d ↦ R p ðMÞ is a parametric nonlinear function determined by the parameters W ðmÞ and b ðmÞ (m ¼ 1; 2; : : : ; M).For two person images x i and x j , they will be finally represented as at the topmost layer of the network.Then using the squared Euclidean distance, the distance between x i and x j at the top level can be measured as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 3 8 4 d 2 f ðx i ; The optimization problem of DDML is designed as follows: where the function gðzÞ ¼ 1 β log½1 þ expðβzÞ is the smoothed approximation for ½z þ ¼ maxðz; 0Þ, β is a sharpness parameter, kAk F is the Frobenius norm, λ is a regularization parameter, and τ is a threshold.The pairwise label l ij denotes the similarity of the pairs fx i ; x j g: l ij ¼ 1 means x i and x j are matched image pairs, l ij ¼ −1 means x i and x j are mismatched image pairs.l ij can be determined as follows: From the optimization problem shown in Eq. ( 3), it can be seen that without enough training data in a new scenario, we cannot directly use data collected from different scenarios to help build the Re-ID model in this new scenario.This is the key problem we aim to solve in this work.

Discriminative Deep Transfer Metric Learning
method Based on the projection scheme for deep neural network, we learn a set of multilayers nonlinear transformations to project the data in the source domain and target domain into the same transformed space.Therefore, it is needed to measure the distribution difference between the source domain and target domain in this transformed space.As a well-known criterion to estimate the distance between different distributions, MMD) is a nonparametric estimation criterion and it does not need an intermediate density estimate. 26Let X s ¼ fðx si ;y si Þji ¼ 1; 2;:::;N s g and X t ¼fðx ti ;y ti Þji¼1;2;:::;N t g be the training set in the source domain and target domain, respectively, where both x si and x ti are the samples of dimensionality d, y si and y ti are the labels of x si and x ti , respectively, N s and N t are the numbers of training data in the source domain and target domain, respectively.The distance between distributions of two domains is equivalent to the distance between the mean of total-class data across domains, which can be written as follows: 26 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 3 2 6 ; 2 4 0 However, MMD measures the distribution difference between two domains in an unsupervised way.That is to say, MMD ignores the label information of samples.In addition, for a practical transfer Re-ID task, there often exist imbalance between matched (positive) image pairs and mismatched (negative) pairs.In order to carry out effective transfer learning, we propose an MMDCD.MMDCD embeds the discriminative information of data taken from the source domain into the concept of the MMD by the following equation: where x þ si and x − si are the matched and mismatched image samples in the source domain, respectively.N sþ and N s− (N sþ þ N s− ¼ N s ) are the numbers of matched and mismatched image samples in the source domain, respectively.Following the deep network learning strategy in Ref. 29, the nonlinear representation fðxÞ can be computed using Eq. ( 1) at the topmost layer of the network.Obviously, in order to measure the distance between the mean of the data across domains, MMDCD not only utilizes the label information of data taken from the source domain, but also sets the different coefficients to represent the weight of matched and mismatched pairs according to their different sizes.
As shown in Fig. 1, DDTML constructs a deep neural network to obtain the representations of data in the source domain and target domain through a multiple layers of nonlinear transformations.Considering minimizing MMDCD at the top layer of the network, the optimization problem of DDTML can be given as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 6 3 ; 4 7 8 arg min where MMDCD ðMÞ ts ðX t ; X s Þ is the MMDCD at the M'th layer of deep neural network.αðα ≥ 0Þ and βðβ ≥ 0Þ are the regularization parameters.
To solve the optimization problem in Eq. ( 7), we use the stochastic subgradient descent scheme to obtain the parameters W ðmÞ and b ðmÞ , where m ¼ 1; : : : ; M. The gradient of the objective function J with respect to the parameters W ðmÞ and b ðmÞ can be computed as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ; 2 8 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; 1 7 2 where h i and h ð0Þ j are the original inputs.
For the M'th layer of our network, we can obtain the following updating equations: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 6 6 6 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 3 2 6 ; 5 8 4 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 3 2 6 ; 5 0 1 For the other layers m ¼ 1; 2; : : : ; M − 1 of our network, we can obtain the following updating equations: where Θ denotes the element-wise multiplication.c and z ðmÞ i (m ¼ 1; 2; : : : ; M) are given as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 3 2 6 ; 2 2 3 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 1 ; 3 2 6 ; 1 9 0 Then W ðmÞ and b ðmÞ can be updated using the gradient descent algorithm until convergence as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 2 ; 3 2 6 ; 1 3 where λ is the learning rate.
Based on the analysis above, we summarize the entire construction procedure of DDTML in Algorithm 1.

Datasets and Experimental Setting
In our experiments, four Re-ID datasets are adopted: 3DPeS, 34 i-LIDS, 35 CAVIAR, 19 and VIPeR. 36The 3DPeS dataset is a collection of 1011 person images of 192 individuals from eight different surveillance cameras captured on an academic campus.The i-LIDS dataset is a collection of 119 person images captured in airport.Each person is with an average of four images.Therefore, i-LIDS consists of 476 images in total.The CAVIAR dataset is a collection of 1220 person images from 72 individuals with 10 to 20 images per person.The VIPeR dataset is a collection of 632 person images by two different camera views, so it consists of 1264 images.In order to construct the transfer learning Re-ID model, we choose one dataset as the target dataset and another dataset as the source dataset from the other three datasets following the same settings of. 25So there are in total 12 cross-scenario transfer learning tasks.
In our experiments, all person images from the above four datasets are scaled to 128 × 48 for feature extraction.Following the same settings of Ref. 25, three kinds of features descriptor: color, LBP, and HOG are generated for each image.After extracting the feature vector, we use PCA to compress them into 500-dimensional feature vectors.
For comparison purposes, six state-of-the-art Re-ID methods are applied to compare against our proposed DDTML.The comparison methods can be grouped into two groups: (1) nontransfer learning methods: LFDA, 30 KISSME, 31 and DDML 31 and (2) transfer learning methods: geometry preserving large margin nearest neighbor (GPLMNN), 37 OurTransD, 32 and cAMT-DCA. 25Furthermore, in order to better observe the behavior of MMDCD, we develop another transfer learning Re-ID method called DDTML-MMD through replacing MMDCD in DDTML with MMD criterion.We train a deep network with three layers for DDTML, and its neural nodes are given as: 200 → 200 → 100 for all datasets.Based on our extensive experiments, the tanh function is used in φðÞ function, and the parameters α, β, τ, and λ are set to be 10 −1 , 10, 3, and 0.3, respectively.
In our experiments, we randomly split the target dataset into two equal partitions; one partition is used as target training set and the other partition is used as target testing set.For five transfer learning methods, all person images in the source dataset and target training set are used for training, and all images in the target testing set are used for testing.Following Ref. 38, the performance of each method is evaluated in terms of the cumulative matching characteristic (CMC) in our experiments.The CMC represents the probability of finding the correct match over the top r image ranking, with r varying from 1 to 20.The CMC described above is usually used to measure the performance of closed-set Re-ID problem.It assumes the same person can be found both in the probe set and gallery set.But in many real-world scenarios, this assumption is often not satisfied, e.g., the scenarios with imposters.In order to simulate these open-set scenarios, only images of 40% of the gallery people are randomly removed.The receiving operating characteristic (ROC) curve on i-LIDS as target dataset is used as the evaluation metric to compare DDTML with other algorithms.In order to make our results fair, we repeat the aforementioned partition 10 times for each dataset, and both the CMC and ROC curves for 10 runs are recorded.

Results and Analysis
In this section, we examine the effectiveness of the proposed method DDTML by comparing their performance with LFDA (LFDA-S, LFDA-T, and LFDA-Mix), KISSME    (4) From the aforementioned four tables, we can observe that when using VIPeR as target dataset, the performances for all the methods are all lower.This is because there are about 316 person images in the test dataset for VIPeR, whereas average 60 person images in the other three datasets.Thus it is hard to find the correct match from a larger gallery.However, DDTML achieves the best performance on this dataset.This further indicates that DDTML can specifically consider the essential discrepancy across domains.(5) Similar results are also observed on the ROC curves on the i-LIDS dataset as target dataset under the openset setting.DDTML achieves the satisfactory performance.It can be clearly seen that our proposed DDTML is very suitable for transfer learning Re-ID tasks.

Conclusion
In this paper, by integrating DDML with transfer learning, we propose a DDTML method to learn a distance metric that measures the similarity between image pairs of Re-ID dataset.But DDTML is not a simple transfer learning version of DDML.Taking account of the discriminative information of data and inherent characteristics of Re-ID dataset, the developed method also utilizes an MMDCD to minimize the distribution divergence of source data and target data.
Extensive experimental results on the 3DPeS, i-LIDS, CAVIAR, and VIPeR datasets have shown that our method outperforms the state-of-the-art methods on most of the cross-scenario transfer Re-ID tasks.Since the formula of MMDCD is uncomplicated, how to take full advantage of the Re-ID dataset is still an interesting direction of future work.

Disclosures
This paper has been listed in the proceedings of 2018 SPIE Commercial + Scientific Sensing and Imaging (SI18C), volume DL10670.
e m p : i n t r a l i n k -; e 0 T A R G E T ; t e m p : i n t r a l i n k -; e 0

E
Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 T A R G E T ; t e m p : i n t r a l i n k -; e 0 T A R G E T ; t e m p : i n t r a l i n k -; e 0 For three nontransfer learning methods, all images in source dataset are used for training.In particular, in order to observe the performance change of nontransfer learning methods on transfer datasets, LFDA and KISSME are trained in three cases.LFDA-S and KISSME-S only use the source dataset for training; LFDA-T and KISSME-T only use the target dataset for training, whereas LFDA-Mix and KISSME-Mix use both the source and target training datasets for training.

Fig. 2
Fig. 2 Performance comparison using ROC curves on the i-LIDS dataset as target dataset.(a) 3DPeS, (b) CAVIAR, and (c) VIPeR as source dataset.

Table 2
Matching rate (%) on the VIPeR dataset as target dataset.

Table 3
Matching rate (%) on the i-LIDS dataset as target dataset.

Table 4
Matching rate (%) on the CAVIAR as target dataset.

Table 5
Matching rate (%) on the 3DPeS dataset as target dataset.Compared with the other three transfer learning methods GPLMNN, OurTransD, and cAMT-DCA, DDTML achieves the satisfactory performance.In particular, it obtains the best average matching rate in 10 out of the 12 datasets.It is because that DDTML uses a deep neural network to learn a set of multiple layers nonlinear transformations, so that more reliable representations of data in the feature space can be well exploited.