Infrared-visible cross-modality person re-identification (IV-ReID) is a challenging task that aims to match infrared person images with visible person images of same identity. The person images of two modalities are captured by visible cameras and infrared cameras, respectively. Due to the variation between two modalities, most existing methods tend to extract common features of different modalities by shared network. However, on account of ignoring the effect of single-modality features, the way of merely extracting common features loses part of single-modality information. To address this problem, we propose an end-to-end model, multi-complement feature network (MFN), to complement common features with single-modality features. We divide MFN into two modules, feature extracting module (FEM) and feature complementing module (FCM). At the stage of FEM, we employ a two-stream network with architecture of multiple granularities to extract single-modality features and common features. Afterward, at the stage of FCM, we utilize the characteristic of graph convolution network (GCN) to associate multiple features of different modalities. In FCM, we design a concise but effective graph structure that takes the features extracted by FEM as input of GCN. Compared with previous methods, our method reserves single-modality features and makes them work with common features. Extensive experiments implemented on two mainstream datasets of IV-ReID, SYSU-MM01, and RegDB demonstrate that our method achieves state-of-the-art performance.
Recently, trackers composed of a target estimation module and a target classification module have presented excellent accuracy with high efficiency. However, they underperform when encountering background semantic interference, large-scale variation, and long-term tracking. To address these problems, we propose a two-stage tracking framework. First, we propose a more applicable objective function for tracking tasks named metrizable intersection over union by considering the alignment mode and the center distance between two bounding boxes. Second, multilevel features are used to eliminate the semantic ambiguity by exploring diverse semantic information. Third, a meta-synthetic decision strategy is proposed to determine the optimum location of the target. In comprehensive experiments on OTB100, Lasot, TrackingNet, TColor-128, UAV123, and UAV20L, our method performs favorably against state-of-the-art trackers.
We propose a parallel network with spatial–temporal attention for video-based person re-identification. Many previous video-based person re-identification methods use two-dimensional convolutional neural networks to extract spatial features, then, temporal features are extracted by temporal pooling or recurrent neural networks. Unfortunately, these series networks will cause the loss of spatial information when extracting temporal information. Different from previous methods, our parallel network can extract temporal and spatial features simultaneously, which can effectively reduce the loss of space information. In addition, we design a global temporal attention module, which obtains the attention weight through the correlation between the current frame and all the frames in the sequence. At the same time, the temporal module can act on the information extraction of spatial module. In this way, we can increase the temporal and spatial constraints. Experiments show that our method can effectively improve the re-id accuracy, better than the state-of-the-art methods.
The emergence of low-cost depth cameras creates potential for RGB-D based human action recognition. However, most of the existing RGB-D based approaches simply concatenate original heterogeneous features without discovering the latent relations among different modalities. We propose a discriminative common structure learning (DCSL) model for human action recognition from RGB-D sequences. Specifically, we extract deep learning-based features and hand-crafted features from multimodal data (skeleton, depth, and RGB). In particular, we propose a deep architecture based on 3-D convolutional neural network to automatically extract deep spatiotemporal features from raw sequences. The proposed DCSL model utilizes a generalized version of collective matrix factorization to learn shared features among different modalities. To perform supervised learning and preserve intermodal similarity, we formulate a graph regularization term by considering both label information and similar geometric structure of multimodal data, which intends to improve the discriminative power of shared features. Moreover, we solve the objective function using an iterative optimization algorithm. Then, an improved collaborative representation classifier is employed to perform computationally efficient action recognition. Experimental results on four action datasets demonstrate the superior performance of the proposed method.
Human action recognition is a challenging task in machine learning and pattern recognition. This paper presents an action recognition framework based on depth sequences. An effective feature descriptor named depth motion maps pyramid (DMMP) inspired by DMMs is developed. First, a series of DMMs with temporal scales are constructed to effectively capture spatial–temporal motion patterns of human actions. Then these DMMs are fused to obtain the final descriptor named DMMs pyramid. Second, we propose a discriminative collaborative representation classifier (DCRC), where an extra constraint on the collaborative coefficient is imposed to provide prior knowledge for the representation coefficient. In addition, we apply DCRC to encode the obtained features and recognize the human actions. The proposed framework is evaluated on MSR three-dimensional (3-D) action datasets, MSR hand gesture dataset, UTD-MHAD, and MSR daily Activity3D dataset, respectively. The experimental results indicate the effectiveness of our proposed method for human action recognition.
Face recognition is a challenging task in computer vision. Numerous efforts have been made to design low-level hand-crafted features for face recognition. Low-level hand-crafted features highly depend on prior knowledge, which is difficult to obtain without learning new domain knowledge. Recently, ConvNets have generated great attention for their ability of feature learning and achieved state-of-the-art results on many computer vision tasks. However, typical ConvNets are trained by a gradient descent method in supervised mode, which results in high computational complexity. To solve this problem, an efficient unsupervised deep learning network is proposed for face recognition in this paper, which combines both 2-D Gabor filters and (2D)2 PCA to learn the multistage convolutional filters. To speed up the calculation, the learned high-dimensional features are further encoded using short binary hashes. Finally, the obtained output features are trained using LinearSVM. Extensive experimental results on several facial benchmark databases show that the proposed network can obtain competitive performance and robust distortion-tolerance for face recognition.
We propose a new representation 3DGBOJ to quickly and precisely classify human action from a series of depth maps. We use Shotton et al's method to predict the best candidate of 3D skeletal joint locations from Kinect depth map. By normalizing and retargeting the human skeleton to a common skeleton, we eliminate the noisy introduced by human agent diversity and view dependent. Some impossible motions are deleted with regard to Kinematics constraint. We design a 3D Gaussian space to map each joint to a bin based sparse feature vector. To weaken the timescale variation, which occurs during the performance with different speed and style, we remove the consecutive repeated vectors. We cluster the motion feature vectors with Affinity Propagation and treat each motion exemplar as a vocabulary in bag of feature (BOF). To better handle overlapping features and contextual dependencies, we trained them over a linear CRFs model. The experiment result shows that our representation maintains appropriate adaptability to variations of different subjects of different gender and size, and with different speed and style from different view.