In this work, we train convolutional neural networks for person re-identification. However, datasets of sufficient size for training all consist of data from fixed camera networks. We show that the resulting models, while performing strongly on camera network data, struggle to handle the different characteristics of aerial imagery, likely because of an overfitting to data bias inherent in the training data. To address this issue we combine the deep features with hand-crafted covariance features which introduce a higher degree of invariance into our combined representation. The fusion of both types of features is achieved by including the covariance information into the training process of the deep model.
We evaluate the combined representation on a dataset consisting of twelve people moving through a scene recorded by four fixed cameras and one mobile aerial camera. We discuss strengths and weaknesses of the features and show that our combined approach outperforms baselines as well as previous work.