Human action recognition for videos has been applied extensively in man–machine interaction systems, video surveillance, virtual reality, and patient monitoring, which are still challenging problems in computer vision due to the complex backgrounds, changeable movement speeds, and different shooting scales with multiperspectives. To improve the robustness and accuracy of the recognition algorithm, many state-of-the-art methods have been proposed.
Recently, local spatiotemporal features12.3.–4 applied to describe human movements by treating the action volume as a rigid three-dimensional (3-D)-object have achieved promising performance on many datasets.5 The low-level features are extracted from local regions where the temporal and spatial characteristics change observably or are obtained by dense sampling strategy in videos to represent the patterns of each 3-D volume. These spatiotemporal features usually combine with the pipeline of bag-of-visual-words (BoVW) and its improved variants67.8.–9 to model human behaviors, which do not require any human detection procedures and have strong robustness to illumination and background. Then, the global representation, which is constructed from a set of local features, is fed into support vector machines (SVMs) to achieve action classification.9,10 As the two critical steps of this classic and effective process, ample research progress has been made on the methods of local features extraction and features encoding. Laptev and Lindeberg1 proposed the detector of space-time interest points (STIPs), which is extended from two-dimensional (2-D)-Harris corner detection, and employed histogram of oriented gradients (HOG)11 and histogram of oriented flow (HOF)12 to describe the extracted regions. Because the STIPs are usually sparse and more abundant information about human movement cannot be mined, many improved algorithms were put forward.13,14 Wang et al.15 demonstrated that dense sampling for video local blocks is more efficient than sparse corner detection. The dense trajectories (DTs)16 and improved dense trajectories (IDTs),5 which obtained good performance in various experiments, are presented based on the dense sampling strategy. In the feature encoding stage, several methods can be used to produce a suitable dictionary, such as voting-based encoding,1617.–18 Fisher vectors (FV),3,8,19 and sparse coding techniques.20,21 As a super vector encoding method, FVs were applied to large-scale image classification by Perronnin et al.22 A vector of locally aggregated descriptors (VLAD)23 is an improved algorithm for FVs, where the nearest cluster centers and the per-dimension values of feature points are considered. Although the recognition accuracy of VLAD is slightly lower than FVs, it is more efficient to execute.
The above research for action recognition bypasses body poses and achieves promising results using local spatiotemporal features. Despite their different goals, the two types of features are not only highly coupled but complementary, and it is desirable to study them in a common framework.24 The prevailing methods for pose estimation2425.–26 from still images adopt a pictorial structure model, which resembles the human skeleton and allows for efficient inference based on tree structures.27 Jhuang et al.28 used various types of descriptors containing joint position, translation information, and direction of the translational vector, all of which are derived from joint annotations to represent human postural characteristics by employing the pose estimation algorithm from Ref. 25. Pishchulin et al.29 revealed the potential complementarity between holistic methods and pose-based methods by analyzing two kinds of fusion, namely feature- and classifier-level fusions. Meanwhile, Yao et al.30 proposed a method that requires the videos of training set are from multiple angles and utilizes pose information to optimize the manifold of each action category, then conducts the two tasks iteratively. Nie et al.24 presented a spatial–temporal and-or graph (AOG) model adopted latent structure-SVM for learning to describe actions at three scales, where coarse-level features are regarded as a priori knowledge of pose estimation, and the two tasks benefit from each other in experiments.
Action characteristics in videos ordinarily have many attributes, which describe various categories in different aspects, such as appearance, trajectory of motion, moving boundary, and pose information. The reasonable fusion algorithms3132.–33 can utilize the extracted features efficiently and adequately and then boost the performance of constructed system. There are generally three typical methods of combination in the field of action recognition:8 descriptor-,2,34 kernel-,3,35 and score-level fusions.36,37 Wang et al.34 integrated multiple descriptors into a new descriptor for subsequent processes of the BoVW framework using a simple strategy for feature weighting. Jain et al.35 presented an innovative motion descriptor named divergence–curl–shear (DCS), where a linear combination of kernel matrices belonging to each local descriptor is concatenated directly by the method of kernel average and then fed into the linear SVM. For score-level fusion, Myers et al.36 presented a method that uses cross validation on a training set to obtain the weights of each descriptor, which will combine the scores from multiple classifiers to get the final recognition results.
The core purpose of feature fusion is to enhance the accuracy of recognition using the complementarity among multiple features adequately. In general, each fusion method has its own pros and cons under different circumstances when the action features have fewer types.8 However, with research going deep, description forms for actions are increasingly numerous. To establish an extensible and universal fusion framework, we focus on the score-level fusion, which does not cause the curse of dimensionality that is prevalent in descriptor- and kernel-level fusion methods. In many cases, although there are some typical dimensionality reduction algorithms, including principal component analysis (PCA),38 locally linear embedding,33 and linear discriminant analysis,39 feature reduction will lose some motion information, which leads to a decrease in recognition accuracy.
In this paper, contrary to the aforementioned approaches, many score-level fusion methods obtain the weights of different features by a learning step, in which the randomness and incompleteness of training data are usually neglected. The Dempster–Shafer (DS) evidence theory can narrow the scope of assumptions continually by the accumulation of evidence and resolve the problem of uncertainty of information. The decision results, which conform to objective condition, are then inferred without the prior probability. In view of the above advantages, the evidence theory is employed to our weighted score-level fusion method, which has not received enough attention in the previous research of action recognition. Concretely, the local spatiotemporal features and the optimized pose features will be extracted from the validation samples, which are selected from the training set, to obtain the credible evidence information. Second, the evidence combination strategy is utilized to calculate weight vectors of all feature types for each action class, which will be optimized by the proposed rule of survival of the fittest. Subsequently, the classification results are deduced by the weighted summation strategy. The main contributions of this paper are as follows:
• According to the characteristics of local features and pose features, the corresponding encoding methods and SVM classifiers are employed. Moreover, to describe the human joints in videos more reasonably, the translation matrix and the angle matrix are constructed to obtain the optimized pose features. The effectiveness of the extendible and universal score-level feature fusion method for action recognition is then demonstrated on Penn Action40 and a subset of the joint-annotated human metabolome database (sub-JHMDB) datasets.28
• Considering the randomness and incompleteness of training data, the weighted score-level feature fusion method based on DS evidence theory (WSF-DS) is proposed, in which the validation set of a dataset is constructed, and weight vectors of multiple features belonging to each action are achieved by evidence combination.
• The rule of survival of the fittest and weighted summation strategy are, respectively, proposed to eliminate the components of weight vectors, which are inefficient and adverse for the recognition of particular action category, and calculate the results of classification.
The rest of this paper is organized as follows: Sec. 2 presents the overview of the proposed action recognition framework. Section 3 elaborates the local spatiotemporal features extraction, optimized pose features extraction, the pipeline of BoVW frameworks for different features, and the proposed weighted score-level feature fusion method. Section 4 evaluates the performance of the proposed action recognition framework on the Penn Action and sub-JHMDB datasets and provides the comparisons with other methods. Finally, we conclude this paper in Sec. 5.
The framework of the proposed action recognition method, which consists of two stages, is shown in Fig. 1. In the first stage, to obtain the optimized weight vectors of every feature for each action, the original videos in training set are divided into two parts called 3/4 training videos and validation videos (i.e., the remaining training videos). For the validation videos, the scenes and human bodies in video clips are more distinctive than the other training videos to ensure the validity of evidence. Then, the different BoVW frameworks are adopted for modeling human behaviors based on the local features and pose features.
In the local spatiotemporal features thread, we sample features that include trajectory shape, HOG, HOF, and motion boundary histogram (MBH)12 based on the IDT method from each video clip in the 3/4 training set. Because the primary low-level local features are usually high dimensional and strongly correlated, the PCA with whitening38 is used to reduce the dimensionality and weaken the correlation of features. Then, we randomly sample a subset of features from the 3/4 training set to estimate the Gaussian mixture model (GMM), which is learned through maximum likelihood estimation and regarded as a codebook. Unlike the encoding of vector quantization (VQ)8,16 used in the work of DT,16 we employ FV3,22 to encode features and obtain video descriptors. In the pose features thread, the full body joints of every frame in video clips are estimated via the tree-based models of part mixtures.25 The descriptors for the pose are extracted from two hierarchies to represent different attributes, in which the translation matrix and the angle matrix are constructed to optimize the descriptors of the time hierarchy. All the data of each descriptor type in the 3/4 training set are utilized to generate a codebook by -means clustering41 independently and then concatenated as the pose features for each video after the VQ16,17 encoding method. The implementation of the library for support vector machines (LIBSVM)42 is used to train multiclassifiers for the two threads.
The probability values of validation videos that belong to each action category are obtained by feeding them to the SVM classifiers. The average probability values of the correct classification for positive and negative samples are utilized as two sources of evidence for DS evidence theory, respectively. Then, the weight vectors are calculated by evidence combination strategy and optimized via the proposed rule of survival of the fittest.
In the second stage, the same BoVW frameworks described above for different features are utilized. Unlike the first stage, the input of the BoVW frameworks is replaced by all the training videos. During the testing stage, when the score matrices of each feature are obtained, the final score matrix of testing videos is calculated by the weighted summation strategy, in which the optimized weight vectors are regarded as a priori knowledge. Subsequently, the recognition results (labels) are efficiently inferred by calculating the row maximum of the score matrix.
Action Recognition Framework Based on Weighted Score-Level Feature Fusion
In this section, details of the local spatiotemporal features extraction, optimized pose features extraction, the pipeline of BoVW frameworks for different features, and the proposed WSF-DS are presented.
Local Spatiotemporal Features Extraction
This section presents the dense sampling strategy15 and multiple-features extraction principle. When the trajectory shape, HOG, HOF, and MBH are extracted, a feature preprocessing method is employed to make features have the same variance, which is beneficial for training GMM.
The feature extraction method based on IDT,5 which conforms to the visual attention mechanism of human eye, is insensitive to the background and motion speed and can describe the apparent information of motion perfectly. The multiple features, including HOG,11 HOF, and MBH,12 are extracted around densely sampled points. These points are tracked on each image scale individually. The optical flow field formed by frame and frame on a certain scale is defined as , where and represent the horizontal and vertical components of optical flow, respectively. When a point in frame is given, the can be obtained via median filtering in a dense optical flow field
For the MBH feature, it is defined as the gradient values for horizontal and vertical components of optical flow field, and therefore, two histograms (i.e., MBHx and MBHy) can be calculated.16 To remove the influence of camera motion on recognition accuracy and processing speed, the descriptor for speeded up robust features (SURF)13 is used to implement frame matching in view of its strong robustness to motion blur, and then the motion vectors are reserved. The IDT method also extracts motion vectors from dense optical flow by employing dense matching strategy among frames. Finally, the DTs are corrected through the global motion vector, which is estimated from the homography matrix calculated by the random sample consensus43 algorithm.
The low-level local features are usually high-dimensional, and there is a strong correlation among different dimensions. To enhance the clustering accuracy of GMM and -means,41 PCA is used to eliminate the correlation of feature vectors and reduce the dimensionality of features. Peng et al.8 proved that combining whitening technology and PCA38 can effectively boost the recognition accuracy in the BoVW framework. After the above steps, each dimension of features will have the same variance. The mathematical expression of PCA-whiten is as follows:
For high-dimensional local features, the voting-based encoding method (e.g., VQ encoding method) only expresses the subordinate relationship between feature vectors and visual words (i.e., clustering centers), which will produce the quantization errors. In comparison, the FV encodes both first- and second-order statistics between the feature vectors and a GMM. So, we randomly sample a subset of features from the training data to estimate the GMM with components, which will be regarded as a codebook and employed to calculate the FV. The parameters set of GMM is , where is the mixed weight of the ’th Gaussian, is the mean vector, and is the covariance matrix. The probability distribution model of GMM is defined as follows:44
Optimized Pose Features Extraction
IDT features can extract appearance and motion information from videos and achieve a global representation for the action. However, high-level pose features focus on describing the distribution and coupling relationship of human joints. The two types of features are strongly complementary.29 In this section, the procedure of the optimized pose features extraction will be presented in detail.
The popular methods based on the pictorial structure framework for human pose estimation2425.–26 from 2-D video frames imitate the human skeleton and enable systems to efficiently infer the position of human joints in case of tree structures.27 We follow the framework of Ref. 25 to achieve pose estimation, because it is clear and representative in the principle. It is worth noting that our proposed score-level feature fusion method for action recognition is a universal framework, and completing the pose estimation task is not restricted to one method.
For each image , the pixel location of the human joint is represented as . is the type of joint , which is defined by the position relation between and its parent. The number of is equivalent to the number of -means cluster centers. A -node tree-structured graph is constructed, where is the joint points set and is the edges set used to represent the parent–child relationships among whole joints. Then, a generalized support function can be defined as
Then, the latent SVM framework is used to train the detection model by the coordinate-descent solver,27 where the types of joints are treated as latent variables. In the detection stage, due to the relational graph being a tree structure, human joints in all video frames can be estimated by dynamic programming and nonmaximum suppression. The details can be found in Ref. 25.
Human joints description
When the human pose in the video frame is estimated, the joint data are mined to obtain various descriptors, which are then fed into the BoVW framework. Due to the action in a video clip being decomposed into a series of poses that might change over time, the descriptors for pose need to be carried out from two hierarchies (i.e., space and time) and concatenated into an entire one as the final pose feature. Therefore, we follow Ref. 28 to denote full body pose through 15 joints. For the space hierarchy, joint coordinates are split into and , which have proved to be more effective,28 so 30 descriptor types can be obtained from one frame. For the time hierarchy, the translation of joint coordinates on the time axis (i.e., and ) and the angle of the space-time displacement vector [i.e., ] are calculated according to a certain frame step .
Note that the configuration of descriptors is different from Ref. 28 here. We find that the translation of joint position in the starting and ending frames of a video clip is usually not salient, and the translation in the middle is more representative for movement tendency. Accordingly, to improve the distinguishing ability of the pose feature, the weakening factor is set as , and then the translation matrix of joint coordinates for a video with frames can be written as follows:
Finally, these 75 descriptor types (30 for joints coordinates, 30 for translations, and 15 for angle of space-time displacement vector) are separately fed into the subsequent clustering algorithm to generate codebooks and then encoded by VQ.17
Bag-of-Visual-Words Frameworks for Different Features
In this work, the BoVW pipeline is employed to build a model for each video clip via the extracted low-level local features and high-level pose features. However, algorithms used in each subunit are different in view of the fact that dimensionality of two categories of features has a clear distinction. In the local feature thread, the global feature vector obtained from each training video is processed by PCA-whiten and then encoded by the FV, where the parameters of GMM are learned by the subset of features. In the pose feature thread, all training data are utilized to construct a codebook with a few vocabularies for each descriptor type by -means algorithm41 and there is not any preprocessing, because every descriptor for pose is a one-dimensional vector.
In the classification stage, the two threads are both categorized by SVM use, the implementation of LIBSVM.42 Due to the encoding methods being different, the linear SVM is chosen as the classifier for local features because it has been proven to be more efficient in combination with FV,8 and the SVM with radial basis function kernel (RBF-SVM) is similarly selected for pose features, where the optimal parameters are obtained by fivefold cross validation.
Weighted Score-Level Feature Fusion Method Based on Dempster–Shafer Evidence Theory
Multiple features represent actions in different emphases. For example, local spatiotemporal features are used to describe the state of structural and motion around a sampling point, and pose features focus on expressing the joint position and tree structure of a moving human body. There is a strong complementarity among these features.29 We find that the accuracy of identification is discrepant for an action class when different features are adopted, which means that the strength among each feature is different for a specified action. Therefore, we propose a weighted score-level feature fusion framework, where weight vectors of all feature types for each action are achieved by DS evidence theory45 through the constructed validation set.
The concept of lower and upper bounds of probability distribution proposed by Dempster46 is used to solve the problem of multivalued mapping, which is the original work for evidence theory. Shafer47 proposed a mathematical technique to deal with uncertainty reasoning via a series of rules of evidence combination and introduced the belief function to consummate the evidence theory.
We set as the frame of discernment. The basic belief assignment function (i.e., mass function) is a mapping from set to [0, 1]. is an arbitrary subset of , which satisfies the following equation set:
Let be a set of mass functions on the same frame of discernment . When the focal elements are defined as , the combination rules of DS evidence theory can be utilized to implement information fusion as follows:
In the case of two evidence combination, the rules can be formulated as
Weighted score-level feature fusion method
To obtain convincing evidence that can reflect the difference in effectiveness between different features in the recognition process of a particular action class, we create a validation set for original dataset (i.e., Penn Action dataset and sub-JHMDB dataset) based on its training samples. More specifically, when the training set is divided into several equal parts, one part is treated as the validation set, in which the scenes and human body in video clips are more distinctive than the other parts to ensure the validity of evidence.
In stage 1, the samples of training sets except validation videos are used to train multiclass classifier belonging to each feature through the BoVW framework presented in Sec. 3.3. For multiclassification, we adopt a one-versus-all cross-validation5 training scheme and obtain the prediction with probability scores of each sample in the validation set. Then, the probability scores matrix is defined as
In stage 2, the 3-D scores matrix is split into 2-D score matrices defined as , in which its elements denote the probability score of sample predicted by classifier . Assuming that the number of samples belonging to class is and the number of samples not belonging to class is , the effectiveness of feature for a particular action class can be reflected by and , which are as follows:
In stage 3, we define a set of focal elements as for the two evidences, where means that the positive role of feature in the recognition process of action class . The mass functions and for the evidence can be assigned as follows:
The weight vector of sensitivity for different features belonging to action is calculated by the strategy of evidence combination [i.e., Eq. (11)] and expressed as follows:
In stage 4, inspired by the idea of survival of the fittest, is optimized during the experiment. Specifically speaking, the features with low weight are not only inefficient for the recognition of a specific action class but also adverse for the final classification score, which should be penalized by the penalty thresholds and . For the six action feature types (i.e., trajectory shape, HOG, HOF, MBHx, MBHy, and pose features) in this paper, the rule of elimination is formulated as follows:
• Given the weight vector , the values of its components, which are greater than , are defined as . When , the weights less than will be reset to 0.
• Let be the minimum value of components in . When and , the smallest value in will be reset to 0.
The corrected weight vectors of every feature type for each action class are obtained and then utilized for the subsequent classification of testing samples.
In the final stage, the scores matrix for all samples in the testing set is calculated by summing the weighted score matrices of each feature, which can be written as
Note that the proposed pipeline of action recognition is a universal and extendible framework, which has the following characteristics:
• Our weighted score-level feature fusion method can be embedded in different versions of the BoVW pipeline combined with SVM classifier, which only needs to establish a validation set for the corresponding dataset to obtain the weight vectors in the process of method transplantation.
• When an innovative feature needs to be applied in our recognition framework, its effectiveness for different action categories can be analyzed by the WSF-DS method, and the weight vectors will be updated simultaneously. Furthermore, the extendibility of the framework is also reflected in that the local spatiotemporal features and pose features can be replaced by the parallel algorithms for feature extraction to improve overall performance. For instance, the state-of-the-art works for the pose estimation proposed in Refs. 24, 26, and 48 can be employed to replace the algorithm in Ref. 25.
• The strongly targeted weight vectors of each action, which can not only effectively enhance the efficiency of discriminative features but also restrain interference from relatively ineffective features, are calculated by evidence combination and then optimized via the proposed rule of elimination, which combines the idea of survival of the fittest. The feature with low weight for specific action is eliminated in the experiments, which has been found to be effective in improving the accuracy of action recognition.
The performance of our method is evaluated on two publicly available datasets: Penn Action dataset40 and sub-JHMDB.28 Both datasets are proposed for the purposes of action recognition and pose estimation for the full body, and the annotations of human joints and activity labels for each video clip are provided. The experimental results are presented, including the difference in the effectiveness of different features for each action, the evaluation of our proposed weighted score-level feature fusion, a comparison between WSF-DS and multiple-feature fusion baselines, the performance analysis for the proposed rule of survival of the fittest, and a comparison with state-of-the-art action recognition methods.
The Penn Action dataset40 consists of 15 different actions, 13 human joints for each frame, and 2326 video clips collected from the internet, which have the challenges of larger scale and appearance variations, low-resolution images, and obscured human body. The list of action categories is as follows: baseball pitch, baseball swing, bench press, bowling, clean and jerk, golf swing, jump rope, jumping jacks, pull up, push up, sit up, squats, strumming guitar, tennis forehand, and tennis serve. We follow Ref. 24 to discard the class “strumming guitar” and several video clips where most of the human body is invisible and difficult to achieve pose estimation for the full body.
The sub-JHMDB dataset28 is a subset of JHMDB that contains 15 human joints inside a frame and 316 video clips. The dataset comprises 12 action categories, including catching, climbing stairs, golfing, jumping, kicking ball, picking, pulling up, pushing, running, shooting ball, swinging baseball, and walking. The threefold cross-validation configure presented in Ref. 28 is adopted for testing on the sub-JHMDB dataset. Each split contains on average 229 training samples and 87 testing samples, and the experimental results reported in this paper are the average accuracy of three splits. For the Penn Action dataset, we follow the train/test split released in Ref. 40 (which has been pruned and includes 1206 samples for training and 1017 samples for testing) and report the average accuracy. The sample frames of Penn Action dataset and sub-JHMDB dataset are shown in Fig. 2.
The numbers of validation samples in Penn Action and sub-JHMDB are about 302 and 57, respectively, which are 1/3 of the numbers of 3/4 training samples.
The proposed action recognition framework is performed on an Intel Core i7-5930K processor with 64-GB RAM and 3.50-GHz frequency. The MATLAB® R2015a with 64-bit is used as the software configuration of code execution.
For the local spatiotemporal features, we use the same settings in Ref. 5, where the size of space-temporal grid is and the gradient direction is quantized in 8 directions so that the dimension of HOG is 96. Since the HOF has a stationary state, its dimension is 108. The gradients for horizontal and vertical components of optical flow are defined as MBHx and MBHy, respectively, and the dimension of both features is 96. In addition, the dimension of trajectory shape is 30 when the trajectory length is set to 15 frames. In the remaining experiments, PCA-whiten38 is employed to achieve feature reduction and eliminate correlation. For each feature, the dimensions of HOG, MBHx, and MBHy are reduced to 48. Trajectory shape is reduced to 15. HOF is reduced to 54. In the stage of codebook generation, the 256,000 features randomly sampled from each features set are utilized to train the GMM, respectively, which contains 256 Gaussian components. For the FV encoding, the VLFeat Toolbox49 is employed, and the L2 and power normalization50 are utilized to perform normalization for FV of each feature.
For the pose features, the model of 26 human joints with 6 types (which is learned by the pose estimation algorithm and proved having good efficiency and performance)25 is trained to detect human pose in each video frame. The human pose can be described better with dense joints, but it will reduce the distinguishing ability of joints. Because the translation of joint coordinates for some points is less obvious (e.g., joints on the torso), which are meaningless and inefficient for action recognition. Therefore, 15 key points generated from 26 joints are used as the data of pose, which is similar to the work of Jhuang et al.28 For the descriptors of pose, the frame step size and the weakening factor are both set to 3, which have been proven to have good performance. It is worth noting that the 3225 descriptor types proposed in Ref. 28, including a set of relational features, perform better than only using normalized joint coordinates. However, its running time for an ordinary video clip with 42 frames is about 6.17 s when the spatial resolution is . In contrast, the running time of the 75 descriptor types optimized in this work is about 0.0058 s under the same conditions, and its recognition accuracy on the sub-JHMDB dataset is basically equal to 52.9%, which is achieved by the 3225 descriptor types. For each descriptor type, all the training samples are utilized to generate an exclusive codebook by -means algorithm with 20 clustering centers.
For multiclass classification, a one-against-rest approach is adopted to select the prediction with the highest score.
Evaluation of effectiveness on different features for each action
The performance of the six features, including five local spatiotemporal features extracted by IDT method and pose features, is evaluated separately on the two public datasets based on the BoVW frameworks presented in Sec. 3.3. The classification accuracies of different features for each action are shown in Figs. 3 and 4. For notational convenience, we only provide the classification accuracy comparison on split-3 of the sub-JHMDB dataset.
The results indicate that there is a great difference between the recognition accuracies of specific features for various actions. For instance, HOG and pose features demonstrate the highest accuracies (90.33% and 95.2%) for individual actions, whereas the lowest accuracies for some actions are only 58.64% and 45.1% as shown in Fig. 3, where the phenomenon is more pronounced in Fig. 4. Furthermore, a specific action category is shown to be more sensitive to several feature types, which is the basis for designing the weighted score-level feature fusion approach. From Fig. 3, the recognition accuracy of action “squats” achieved by HOF or MBHy feature is 89.02%, which outperforms the other four features by 13.29% on average. From Fig. 4, the best classification accuracy for the action “swinging baseball” achieved by pose features is 85.71%. However, HOF demonstrates the highest accuracy among the other five features, which is only just up to 42.86%. From Figs. 3 and 4, due to low resolution and large intracategory discriminations, the overall recognition efficiency of six features on sub-JHMDB is much lower than it on Penn Action.
Based on the above results, we find that the weight vectors of every feature for each action are necessary for improving the overall classification accuracy in the decision-making stage. It is worth noting that the sensitivities of a particular action category to the same set of features between different datasets have a great disparity because of the influences of image resolution, human scale, and various viewpoints, so the corresponding weight vectors for different datasets are required to calculate.
Evaluation of weighted score-level feature fusion based on Dempster–Shafer evidence theory
The effectiveness of our proposed WSF-DS method is demonstrated by testing it on two public datasets for human action recognition. To obtain the weight vectors of all feature types for a specific action category, the samples extracted from the training set are assembled as a validation set, and then the evidence used for DS evidence theory is computed by the approach presented in Sec. 3.4. Specifically, to obtain the robust sensitivity information of an action category about every feature type, about 1/4 of training samples that are significantly different from the other 3/4 samples in motion scenes and body appearance are chosen as the validation set.
The influence of different parameters and on classification results will be elaborated in Sec. 4.3.3. Here, we report the best performance of our recognition framework. Note that although the average accuracy is reported both for the evaluation of two datasets, we follow Ref. 28 to calculate the per-video accuracy for the sub-JHMDB, which does differ much from the per-class accuracy adopted in Penn Action.40 The confusion matrices computed by the proposed WSF-DS method for Penn Action and the sub-JHMDB have three splits that are shown in Fig. 5, respectively. Table 1 presents the comparison of average accuracies on the two datasets achieved by different feature types and our WSF-DS method, where the five local spatiotemporal features are extracted by IDT.
Comparison of the performance for WSF-DS and different feature types on datasets.
|Split 1||Split 2||Split 3||Average||Penn Action|
From Table 1, when a single feature is employed, the classification accuracies achieved by MBHy are 60.6% and 90.1% on sub-JHMDB and Penn Action individually, which are close to the results achieved by HOF (i.e., 58.1% and 90.8%) but significantly outperform other feature types. This suggests that the optic flow field and motion boundary of the image are more effective than image appearance, motion trajectory, and human pose in the process of action recognition on the two datasets. It should be noted that the estimated joint positions are not precise compared to the ground truth. We leave such pose estimation problem as significant future work, which has been confirmed to be effective in raising the accuracy of action recognition in Ref. 28. Moreover, we observe that the proposed WSF-DS method improves the performance of each single feature type following the order in Table 1 by 7.9%, 17.0%, 3.7%, 5.0%, 4.4%, and 23.1% on Penn Action and 27.3%, 24.1%, 12.9%, 21.0%, 10.4%, and 18.2% on sub-JHMDB. These results also demonstrate that the proposed score-level feature fusion approach can adequately exploit the complementarity among multiple features and be applied on different datasets robustly.
As shown in Fig. 5, our WSF-DS performs well on the actions such as “golfing,” “pulling up,” “pushing,” and “shooting ball” belonging to sub-JHMDB. However, we achieve low accuracies on several actions, for instance, “swinging baseball” is easy to confuse with “golfing,” because the motion patterns between the two actions are similar. For the Penn Action, only the accuracy about “sit up” is significantly lower than other actions because of large intracategory discriminations and varied shooting angles. The proposed method could accurately classify the vast majority of action categories, which demonstrates the effectiveness of our proposed method.
Comparison with multiple-feature fusion baselines
This section demonstrates the advantage of the proposed WSF-DS method by comparing our results with two baseline methods of combination in the field of action recognition (i.e., kernel- and score-level fusions). For the descriptor-level fusion mentioned in Sec. 1, we concatenate multiple features extracted from a local cuboid of video into an integrated whole as the input of the BoVW framework, and pose is a holistic feature that describes the distribution of human joints in the entire image. The local sampling is meaningless for it, which causes the fusion method to be unavailable in this work. For kernel-level fusion, each feature is fed into the BoVW individually to obtain different descriptions of action video, which represent various aspects of the motion characteristic, and then fused as a single one to implement action classification by SVM.8 For score-level fusion, the process is similar to kernel-level fusion. However, the feature fusion operation is executed in the stage of processing scores, where every multiclass classifier, which is trained by different features independently, achieves the scores. We compare our WSF-DS with two typical score-level fusion methods. Specifically, the geometrical mean is employed to combine the scores, which is presented in Ref. 8. A single set of fixed weights for different features is learned by cross validation on the training set and then utilized to obtain the final recognition score, which is presented in Ref. 36. The experimental results on the two datasets are shown in Table 2.
Comparison with multiple-feature fusion baselines.
|WSF-DS (no survival of the fittest)||94.1||69.2|
From Table 2, we observe that the WSF-DS method demonstrates higher accuracies than other fusion methods on both Penn Action and sub-JHMDB datasets, which outperforms the best results by 1.1% and 4.7%, respectively. For the fusion strategies of local spatiotemporal features and pose features, the score-level fusion is proved to be more effective. The best accuracies of our weighted score-level fusion increased by almost 1.8% and 5.4% compared to kernel-level fusion. Furthermore, the differences of accuracy rates between two typical score-level fusion methods are , but the method in Ref. 36 has a surplus learning step.
We also compare our WSF-DS with the WSF-DS without using the rule of survival of the fittest. The former is 0.4% and 1.8% higher than the latter in the two datasets, which demonstrated the effectiveness of the proposed survival of the fittest. The effect of varying and on the accuracy of action recognition on the two datasets is considered in Fig. 6, where , 0.10, and 0.15 are compared. The idea is that the penalty threshold should not be larger than the average weight of six feature types (i.e., ).
Figure 6 shows that increasing the value of will decrease performance, due to the fact that some valuable features are removed in the decision-making stage. It also shows that the values of corresponding to the optimal accuracies for two datasets are both and larger values can cause failure of the proposed survival of the fittest. We report the performance of and for Penn Action and and for sub-JHMDB in this work.
Comparison with the state-of-the-art
The recognition accuracies obtained by our WSF-DS are compared with state-of-the-art methods on Penn Action and sub-JHMDB datasets, and the results are shown in Table 3.
Comparison of our WSF-DS with the state-of-the-art methods.
|Dense + pose28||2013||—||52.9|
|IDT-FV + pose26||2017||92.9||74.6|
For Penn Action, our WSF-DS has improved the state-of-the-art methods in recent years. For the sub-JHMDB dataset, only the recent work in Ref. 26, which combines improved pose and IDT with FV encoding (IDT-FV),5 achieves better result than our method, because the more advanced pose estimation algorithm is employed. Moreover, the accuracy achieved by a single local spatiotemporal feature or pose feature is lower than their combination in general. In our experiments, the proposed WSF-DS method achieves better recognition accuracy than most of the recently proposed methods based on the ideas of feature fusion using dendrogram and convolutional neural network features, such as MST,52 AOG,24 and P-CNN.53
In this paper, we proposed an extendible and universal weighted score-level feature fusion method for human action recognition using DS evidence theory. Concretely, the BoVW pipeline is employed to build a model for each video clip via the extracted local spatiotemporal features and pose features. The DS evidence theory and the proposed rule of survival of the fittest are utilized to complete evidence combination and calculate optimal weight vectors of every feature type belonging to each action class. The recognition accuracies of WSF-DS on Penn Action and sub-JHMDB datasets are obtained by the weighted summation strategy, and the experimental results revealed that WSF-DS can achieve promising performance, which outperforms other state-of-the-art methods on Penn Action and sub-JHMDB datasets.
The proposed WSF-DS can enhance the accuracy of classification by utilizing the complementarity among multiple features adequately and perform the task of action recognition efficiently. However, to a certain extent, the overall recognition accuracy of multifeature fusion framework depends to the performances of various features. For example, the more advanced pose estimation algorithm can effectively improve the action recognition performance of pose features28 and then improve the efficiency of feature fusion.26
In the future, the method of obtaining the distribution of human joints and the structure information for the incomplete body will be researched to expand the applied range of the pose estimation. Furthermore, the two types of features will be optimized to excavate more abundant information of appearance and structure for human action and further improve the recognition accuracy and the efficiency of the proposed WSF-DS method.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was financially supported by the 2017 Beijing University of Technology United Grand Scientific Research Program on Intelligent Manufacturing (No. 040000546317552) and the National Natural Science Foundation of China (Nos. 61175087 and 61703012). The detailed splits and instructions of the 3/4 training set and the validation set for Penn Action and sub-JHMDB can be obtained by contacting us (firstname.lastname@example.org).
Guoliang Zhang is currently a PhD candidate at the Faculty of Information Technology, Beijing University of Technology (BJUT), China. His research interests include action recognition, machine learning, and man–machine interaction system of robots.
Songmin Jia received her PhD from the University of Electro-Communications, Japan, in 2002. Currently, she is a professor at the Faculty of Information Technology, BJUT. Her research interests include distributed robotics, machine learning, visual computation, and image processing.
Xiuzhi Li received his PhD from Beihang University, China, in 2008. Currently, he is an associate professor at the Faculty of Information Technology, BJUT. His research interests include computer vision, three-dimensional image reconstruction, and mobile robot control and navigation.