The emergence of low-cost depth cameras creates potential for RGB-D based human action recognition. However, most of the existing RGB-D based approaches simply concatenate original heterogeneous features without discovering the latent relations among different modalities. We propose a discriminative common structure learning (DCSL) model for human action recognition from RGB-D sequences. Specifically, we extract deep learning-based features and hand-crafted features from multimodal data (skeleton, depth, and RGB). In particular, we propose a deep architecture based on 3-D convolutional neural network to automatically extract deep spatiotemporal features from raw sequences. The proposed DCSL model utilizes a generalized version of collective matrix factorization to learn shared features among different modalities. To perform supervised learning and preserve intermodal similarity, we formulate a graph regularization term by considering both label information and similar geometric structure of multimodal data, which intends to improve the discriminative power of shared features. Moreover, we solve the objective function using an iterative optimization algorithm. Then, an improved collaborative representation classifier is employed to perform computationally efficient action recognition. Experimental results on four action datasets demonstrate the superior performance of the proposed method.
Human action recognition is a challenging task in machine learning and pattern recognition. This paper presents an action recognition framework based on depth sequences. An effective feature descriptor named depth motion maps pyramid (DMMP) inspired by DMMs is developed. First, a series of DMMs with temporal scales are constructed to effectively capture spatial–temporal motion patterns of human actions. Then these DMMs are fused to obtain the final descriptor named DMMs pyramid. Second, we propose a discriminative collaborative representation classifier (DCRC), where an extra constraint on the collaborative coefficient is imposed to provide prior knowledge for the representation coefficient. In addition, we apply DCRC to encode the obtained features and recognize the human actions. The proposed framework is evaluated on MSR three-dimensional (3-D) action datasets, MSR hand gesture dataset, UTD-MHAD, and MSR daily Activity3D dataset, respectively. The experimental results indicate the effectiveness of our proposed method for human action recognition.
Face recognition is a challenging task in computer vision. Numerous efforts have been made to design low-level hand-crafted features for face recognition. Low-level hand-crafted features highly depend on prior knowledge, which is difficult to obtain without learning new domain knowledge. Recently, ConvNets have generated great attention for their ability of feature learning and achieved state-of-the-art results on many computer vision tasks. However, typical ConvNets are trained by a gradient descent method in supervised mode, which results in high computational complexity. To solve this problem, an efficient unsupervised deep learning network is proposed for face recognition in this paper, which combines both 2-D Gabor filters and (2D)2 PCA to learn the multistage convolutional filters. To speed up the calculation, the learned high-dimensional features are further encoded using short binary hashes. Finally, the obtained output features are trained using LinearSVM. Extensive experimental results on several facial benchmark databases show that the proposed network can obtain competitive performance and robust distortion-tolerance for face recognition.
We propose a new representation 3DGBOJ to quickly and precisely classify human action from a series of depth maps. We use Shotton et al's method to predict the best candidate of 3D skeletal joint locations from Kinect depth map. By normalizing and retargeting the human skeleton to a common skeleton, we eliminate the noisy introduced by human agent diversity and view dependent. Some impossible motions are deleted with regard to Kinematics constraint. We design a 3D Gaussian space to map each joint to a bin based sparse feature vector. To weaken the timescale variation, which occurs during the performance with different speed and style, we remove the consecutive repeated vectors. We cluster the motion feature vectors with Affinity Propagation and treat each motion exemplar as a vocabulary in bag of feature (BOF). To better handle overlapping features and contextual dependencies, we trained them over a linear CRFs model. The experiment result shows that our representation maintains appropriate adaptability to variations of different subjects of different gender and size, and with different speed and style from different view.