Saliency-based foreground trajectory extraction using multiscale hybrid masks for action recognition

Abstract. Action recognition in realistic scenes is a challenging task in the field of computer vision. Although trajectory-based methods have demonstrated promising performance, background trajectories cannot be filtered out effectively, which leads to a reduction in the ratio of valid trajectories. To address this issue, we propose a saliency-based sampling strategy named foreground trajectories on multiscale hybrid masks (HM-FTs). First, the motion boundary images of each frame are calculated to derive the initial masks. According to the characteristics of action videos, image priors and the synchronous updating mechanism based on cellular automata are exploited to generate an optimized weak saliency map, which will be integrated with a strong saliency map obtained via the multiple kernels boosting algorithm. Then, multiscale hybrid masks are achieved through the collaborative optimization strategy and masks intersection. The compensation schemes are designed to extract a set of foreground trajectories that are closely related to human actions. Finally, a hybrid fusion framework for combining trajectory features and pose features is constructed to enhance the recognition performance. The experimental results on two benchmark datasets demonstrate that the proposed method is effective and improves upon most of the state-of-the-art algorithms.


Introduction
Human action recognition from videos is one of the research hotspots in the field of computer vision.As recognition algorithms have been innovated continuously, recognizing fewer categories of actions in a specific environment is not a significant challenge.However, for the videos captured in realistic scenes, where the problems such as camera movement, viewpoint change, and target occlusion are widespread, and the stability of a recognition system needs to be further improved.
In general, the methods used to achieve action recognition are divided into two categories according to different feature types, 1 i.e., handcrafted methods and deep-learned methods.Recently, the handcrafted methods [2][3][4][5][6] using global representations or local representations have achieved promising performance on a variety of datasets. 7,8For the global representation methods, Bobick and Davis 9 extracted the motion energy image (MEI) and the motion history image (MHI) and then used the HU invariant moments of MEI and MHI as templates to achieve template matching.Yilmaz and Shah 10 exploited contour information to extract the three-dimensional (3-D) spatiotemporal volume (STV), and the peak point, valley point, and saddle point on the surface of STV are treated as the expression of human behaviors.Sadanand and Corso 5 generated cascaded features based on time-space pyramids, which are utilized as action representations to train a variety of templates and construct a behaviors warehouse named action bank.Then, action recognition is achieved by calculating the response of testing video to the templates.
For the local representation methods, the final recognition performance is determined by the strategies of feature extraction and feature encoding.An early approach 2 extracts spacetime interest points from videos, then the descriptors of the histogram of oriented gradients (HOG) 11 and the histogram of oriented flow (HOF) 12 are computed at these points.Wang et al. 13 demonstrated that dense sampling is more efficient than all the tested interest point detectors in realistic video settings.Since the dense trajectory (DT) 3,6 method extracts trajectories by densely sampled points across frames and obtains good performance in various experiments, it is frequently employed as a baseline feature to compare with other methods.However, the original DT feature adopts an indiscriminate dense sampling strategy in all regions of each frame, which has some unavoidable drawbacks for the complex action scenes.For example, when there are other moving objects in the background or the camera is in motion, background trajectories are generated extensively because the area of background is usually much larger than that of action subjects.These action-irrelevant trajectories do not contain any information that facilitates action recognition, thereby limiting the performance of trajectory features.To improve the DT features, Wang et al. 7,8 extracted feature point matches between frames using the speeded up robust features (SURF) and dense optical flow.A homography matrix estimated by matches is used to remove the trajectories consistent with homography and cancel out the camera motion from optical flow.Peng et al. 14 proposed a motion boundary (MB)-based sampling strategy, which can effectively filter out background trajectories while retaining the discriminative power of DT features.Yi et al. 15 utilized the appearance saliency and motion saliency to classify the dense trajectories into two categories.Then, the salient foreground trajectories are obtained by subtracting the possible background trajectories based on the low-rank property of background motion.For feature encoding, current methods can be roughly classified into three categories, 16 i.e., votingbased encoding, 3 reconstruction-based encoding, 17 and super vector-based encoding. 7,16As a super vector encoding method, fisher vector (FV) aggregates information using the first-and second-order statistics and performs well on many challenging datasets 7,16,18 when handcrafted features are employed.Another representative encoding method is the vector of locally aggregated descriptors (VLAD), 19 which is an improved variant of FV and only retains the firstorder statistics.Despite the high efficiency of VLAD, its recognition accuracy is slightly lower than FV.
Besides, the human pose feature constructed using joints information is typically designed by experts and considered as another form of the handcrafted feature.It is mainly generated in two steps, i.e., pose estimation and pose feature description.Yang and Ramanan 20 achieved pose estimation in static images based on a flexible mixture model, which is used for capturing contextual co-occurrence relations between human parts and extending the conventional spring model that encodes spatial relationships.Jhuang et al. 21valuated the pose estimation algorithm in Ref. 20 by using various types of descriptors derived from joint annotations.The result suggests that even though the estimated joint positions are not entirely accurate, the performance of resulting pose features is not inferior to handcrafted features.Nie et al. 22 introduced a spatial-temporal and-or graph (ST-AOG) model, where each action is described as a tree structure composed of poses, ST-parts, and parts, and action recognition and pose estimation benefit from each other in the same framework.
Due to the success of deep learning technology in image classification, many studies on human behavior analysis based on deep architecture have been launched.Ji et al. 23 developed a 3-D convolutional neural networks (3-D CNN) model that constructs features from both spatial and temporal dimensions by performing 3-D convolutions.Karpathy et al. 24 extended CNN connectivity in the time-domain and proposed an architecture that processes input at two spatial resolutions for accelerated training.Simonyan and Zisserman 25 designed a two-stream CNN structure in combination with spatial and temporal networks and verified that CNN trained on multiframe dense optical flow could effectively improve recognition performance.Cheron et al. 26 presented a pose-based CNN (P-CNN) features and demonstrated the importance of a representation extracted from poses.Similarly, an action conditioned pictorial structure based on CNN is proposed in Ref. 27.By utilizing long short-term memory (LSTM), Krishnan et al. 28 proposed a recurrent neural network variant to keep track of joints and train the network on joint information across an ordered sample of several frames from a video.Mavroudi and Tao 29 used deep appearance and motion features extracted from STVs defined along body part trajectories to learn midlevel classifiers.Wang et al. 30 proposed the trajectory-pooled deep-convolutional descriptor (TDD), which combines the advantages of handcrafted features and deep-learned features.Wang et al. 31 combined the ideas of segmentation and sparse sampling into the two-stream network and proposed the temporal segment network.Overall, deep-learned methods have improved the state-of-the-art performance on many datasets; 25,31,32 however, still some handcrafted features [e.g., improved dense trajectory (iDT) 7,8 ] are comparable in performance. 18In fact, the optimal classification results achieved by deep learning methods are usually obtained by combining with trajectory features. 30,32,33n this work, we proposed a trajectory feature and constructed an efficient action recognition framework that combines multiple trajectory features and pose features.Although they describe different aspects of human behavior, the two types of features have potential complementarity.In Ref. 34, this property has been revealed by analyzing their combination conducted in both feature level and classifier level.Nie et al. 22 implemented features fusion based on an ST-AOG model, in which dense trajectories are extracted to generate the coarse features, and human poses are estimated to construct the fine-level feature.Peng et al. 16 proved that each fusion method has its pros and cons, and the practical fusion strategy needs to be formulated by analyzing the correlation of descriptors at different processing levels.Iqbal et al. 27 presented a pictorial structure model to incorporate high-level activity information and then combined the pose-based action recognition with FV encoding of iDT using late fusion.Zhang et al. 35 applied an improved score-level fusion to trajectory features and pose features based on the bag-of-visual-words (BoVW) 16 model and Dempster-Shafer evidence theory and demonstrated that score-level fusion is the most effective strategy for the combination of these two types of features.For the trajectory features, inspired by the breakthrough in saliency detection, 36,37 we propose a foreground trajectory extraction method according to the characteristics of video frames.An overview of our method is shown in Fig. 1.Concretely, the MB image derived from the optical flow is processed to obtain a binary image, which is used as an initial mask for dense sampling.Second, the center-bias 38 and dark channel 39 priors are exploited to detect the foreground region, which is optimized by the synchronous updating mechanism based on cellular automata 40 and then treated as a weak saliency map.The strong saliency map is calculated through a superpixel classification model, which is constructed via the multiple kernels boosting (MKB) 41 method.We apply a collaborative optimization strategy to the integration of the two saliency maps and obtain the final foreground detections for each frame.The multiscale hybrid masks are generated by the intersection of the initial mask and the generalized foreground region.Finally, we can extract a set of foreground trajectories that are closely related to human actions by the compensation schemes.
Moreover, considering the complementarity between multiple features, we design a hybrid fusion framework to integrate the foreground trajectory features, iDT features, and pose features by referring to the correlation between different feature descriptors.The contributions of this paper are as follows: • To obtain the trajectories closely related to action subject and filter out the trajectories derived from the camera motion and inherent movements in the background, a saliency-based sampling strategy named foreground trajectories on multiscale hybrid masks (HM-FTs) is Journal of Electronic Imaging 053049-2 Sep∕Oct 2018 • Vol.27 (5)  proposed.Specifically, according to the characteristics of action videos, a foreground region detection algorithm is presented using the weak saliency map optimized by the synchronous updating mechanism of cellular automata and the strong saliency map achieved through the MKB method.• The collaborative optimization strategy is formulated to amend the abnormal detection results by exploiting the cooperation between frames.Furthermore, the compensation schemes are designed to improve the robustness of foreground trajectory features.• A hybrid feature fusion framework, which combines representation-and score-level fusions, is constructed based on the BoVW pipelines.The effectiveness of the HM-FT features and the multifeature fusion method is demonstrated on the Penn Action 42 and sub-JHMDB 21 datasets.
The rest of this paper is organized as follows.In Sec. 2, we describe each extraction step of the proposed HM-FT feature in detail and introduce the iDT feature and an efficient pose feature briefly.A hybrid feature fusion framework for the three types of features is presented in Sec. 3. In Sec. 4, the performance of the HM-FT feature and the features fusion framework is evaluated on the public datasets and compared with state-of-the-art action recognition methods, and this paper is concluded in Sec. 5.

Multifeature Extraction Strategy
In this section, the HM-FT feature is presented based on Fig. 1.Different from previous works, we use the optical flow to constrain the sampling points into the MBs to obtain the initial masks.The saliency detection algorithm is improved in the applicability and detection performance based on the characteristics of action videos to generate foreground masks.These masks are integrated to extract trajectory features that are closely related to actions.We also designed the collaborative optimization strategy and the compensation schemes to deal with abnormal and failed detections.Also, we introduce two features that are complementary to HM-FT briefly, and they will be used for features fusion to improve the overall recognition performance.

Foreground Trajectory Feature Extraction Based on Multiscale Hybrid Masks 2.1.1 Motion boundary detection
Original DT features need to track densely sampled points on multiple spatial scales of frames, and then generate too many motion trajectories.Although the sampled points in a smooth region are removed when the smaller eigenvalue of its autocorrelation matrix is below a threshold, 3 a large number of points still distribute in the background region.Once there is any moving nontarget human body in these regions, or the camera is shaking, potential background trajectories are generated inevitably, which should significantly reduce the discrimination performance of trajectory features.To solve these problems, we first focus on making the sampled points as much as possible distribute in the boundary of the regions where significant movement occurs in a frame.
The Sobel operator is used to calculate the gradient of horizontal and vertical components of the optical flow to obtain the gradient magnitude images.We compute the Journal of Electronic Imaging 053049-3 Sep∕Oct 2018 • Vol.27 (5)  maximum values between the two gradient magnitude images to get a MB image I B .The binary image of I B , which is obtained by the Otsu algorithm 43 and denoted as mask 1 , is used as a mask when an image is densely sampled.The dense sampling strategy based on MBs can filter out most of the sampled points in the background, which do not fall in the foreground region of mask 1 , so the part of background trajectories generated by camera motion can be removed, as shown in the first two rows of Fig. 2.However, for the regions with rich contours and textures, this method has a poor effect on filtering out background trajectories, as shown in the third row of Fig. 2. The MB of a human can be detected completely, as shown in Fig. 2(b).

Moving foreground detection
By researching and summarizing action videos in many datasets, we find that background trajectories are mainly produced by the camera motion and inherent movements in the background.Typical inherent movements in the background include pedestrians passing, vehicles movement, object shaking caused by wind, and so on.To further eliminate the interference of background trajectories, the centerbias and dark channel priors [37][38][39] are exploited to achieve moving foreground detection in each frame.
The process of shooting an action follows the visual attention mechanism of human eyes, and the purposes of almost all of the intentional camera motions are to lock the moving human in the center of a lens.The dark channel prior 39 is proposed for the image haze removal.It is a statisticbased algorithm summarized by analyzing a large number of foggy images.The observation result shows that those regions that do not include the sky have one or more pixels whose intensity values are approximately equal to zero in one of the RGB color channels.The dark channel of an image is mainly generated by shadow regions and the surface of colored or dark objects, which generally appear in the foreground regions, see Figs. 3

(a)-3(c).
Therefore, the dark channel property is exploited as prior information in the process of moving foreground detection.Assuming that the dark channel prior value of pixel p is SðpÞ, which is calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 6 8 6 where Q denotes a 5 × 5 image patch centered on p, and V C ðqÞ denotes the color value of pixel q in channel C.However, for the images with a brighter foreground or darker background, the dark channel prior may lead to a failure of moving foreground detection as shown in Fig. 3(d).To this end, we calculate the mean value of all Sðp e Þ, where p e is a specific pixel on the borders in a frame.If it is greater than 0.8, the influence of dark channel prior on a frame will be eliminated, and the value of SðpÞ will be set to zero.
To obtain structure information of the moving foreground, the simple linear iterative clustering 44 algorithm is exploited to achieve multiscale superpixel segmentation for an input frame.The numbers of superpixels at different scales are set to 100, 150, 200, and 250, respectively, to avoid the incomplete structure information caused by single-scale superpixel segmentation.Let b i denote a superpixel, i ¼ 1; : : : ; N, where N is the number of superpixels for a scale.The saliency value of b i is calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 3 2 6 ; 4 4 3 where d f ðb i ; e i Þ represents the Euclidean distance in f feature space between b i and a border superpixel e j , and N B is the number of superpixels along a frame border.f is the type of features.F includes RGB, CIELab, and LBP features, because there is a complementarity between RGB and CIELab, and the color and texture features used simultaneously are more robust to complex backgrounds.Sðb i Þ is the mean value of all SðpÞ, p ∈ b i .gðb i Þ is the weight of centerbias prior, and its value is equal to the normalized spatial distance between the center of b i and the frame center.
The value of mðb i Þ is assigned to all the pixels in the region b i , and we use Gaussian filtering to generate the moving foreground map M 0 .

Foreground region optimization
To obtain a more accurate foreground region, all the superpixels at each scale are treated as a set of cells, and the synchronous updating mechanism based on cellular automata 40 is exploited to optimize M 0 .Unlike the original cellular automata models, the influences of neighbors to a cell are not fixed.The influence of any pair of cells is closely related to their similarity in CIELab color space.Accordingly, the impact factor z ij between cell b i and its neighbor b j is calculated as ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 3 0 5 z ij ¼ exp where dðb i ; b j Þ represents the Euclidean distance between b i and b j .μ is the regulatory factor.We follow Ref. 45 to set the value of μ to 0.1.The z ij of any pair of adjacent cells is calculated to construct an impact factor matrix Z ¼ ½z ij N×N with the main diagonal elements of zero.Note that all the cells along a frame borders are considered to be interconnected because all of them are regarded as background seeds.Then, each row of Z is normalized by d i ¼ P j z ij , i, j ¼ 1;2; : : : ; N. A coherence matrix T ¼ diagfc 1 ; c 2 ; : : : ; c N g is established to make the moving foreground more complete and avoid losing fine structures on a human body, where the calculation equation for c i is written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 6 3 ; 1 1 7 c i ¼ If there is a significant difference between a cell and its neighbors, the state of the next moment of the cell will be determined primarily by itself.On the contrary, if the cell is more similar to a neighbor, it is likely to be assimilated by the neighbor.Considering that the evolution of a cell will produce extreme results when c i is too high or too low, we follow Ref. 46 to convert the value of c i to ½γ; γ þ η by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 3 2 6 ; 4 7 5 where j ¼ 1; : : : ; N. The synchronous updating mechanism for cellular automata is formulated based on the impact factor matrix and the coherence matrix as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 3 2 6 ; 3 9 9 where I is the identity matrix.When t ¼ 0, the initial M t is the moving foreground map M 0 .The optimized saliency map M C is achieved by iteratively executing the updating mechanism.Note that we use the saliency value of each superpixel as its state, which can describe the relationship between the cells more comprehensively and reasonably.In the iterative process, since the influences of neighbors are changed, cellular automata based on the broader definition of neighborhoods can enhance saliency consistency among similar regions and form a clear boundary between the action subject and the background naturally.Besides, when salient superpixels are selected as the background by mistake, they will automatically increase their saliency values under the influence of the local environment.The Otsu algorithm is used to calculate the binary image M i B of saliency map M i C at the i'th superpixel scale.We consider both M i B and M i C to construct the weak saliency map M W by Eq. (7), which is written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 3 2 6 ; 1 7 3 where n is the number of superpixel scales.

Multiscale superpixels classification
We select training samples from every M i W , where , and then construct a superpixels classification model based on the MKB 41 method to obtain the strong saliency map under each superpixel scale.Specifically, if V j >¼ λ max × V M , the j'th superpixel will be regarded as a positive sample.Otherwise, if V j < λ min , the j'th superpixel will be considered as a negative sample.V M represents the average saliency value of M i W , and V j represents the average saliency value of the j'th superpixel.To control the number of training samples, shorten the training time, and ensure the balance of positive and negative samples, λ max and λ min are set to 1.5 and 0.05, respectively.
The RGB, CIELab, and LBP features extracted from training samples are utilized to train the strong classifier by MKB.The discriminant function of MKB is constructed based on the traditional multiple kernel learning method and shown as follows: where H is the number of training samples, R is the number of weak classifiers, x h denotes a training sample, and y h ∈ f−1;1g denotes the label corresponding to x h .Moreover, β r is the weight of the kernel function k r ðx h ; xÞ, α h is the Lagrange multiplier, and b r is a constant term.α h and b r of each weak classifier can be obtained by solving the corresponding quadratic programming problem.We can achieve 3 × K basic classifiers based on the three feature sets and K types of kernel functions.The AdaBoost algorithm is used to solve the weight β r of each weak classifier iteratively and output weak classifier models.All the superpixels at n superpixel scales derived from a frame are considered as the testing samples.Equation ( 8) can be used to output the decision values of every superpixel, which will be assigned to all the pixels in the region and then normalized to achieve the strong saliency map at each scale.The Gaussian filtering is used to generate a smoother saliency map M i S .M i H denotes the binary image of M i S .The strong saliency map M S for a frame is constructed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; 3 2 4 M S ¼ 1 n Considering that the guided filter has the properties of preserving strong edges and blurring weak edges, we use it to optimize M S .The resulting saliency map is represented as M S .

Multiscale hybrid masks acquisition
The weak saliency map is easier to capture the local structure information of the moving foreground.However, the strong saliency map achieved by the MKB 41 method, which transforms the task of foreground detection into solving a binary classification problem for superpixels, tends to describe the global information of objects.The two saliency maps are integrated by a weighted fusion method to obtain the final result of foreground detection (which is represented as M E ), where the ratio factors are ω S ¼ 0.7 and ω W ¼ 0.3.
The action scenes and appearance of a moving human in all frames of the same video are usually highly consistent, so the detections for these frames are similar, especially for the adjacent frames.Although the subtle local changes occur in the foreground region due to the lens movement and humanpose adjustment, there is strong cooperation between the detections of frames.The collaborative optimization strategy is proposed to amend the abnormal detections based on the above analysis.The specific steps are as follows.
First, we concatenate the saliency values of all pixels in the normalized M E to generate a feature vector f i .Assuming that there is an action video with m frames, the Euclidean distance between any two saliency feature vectors is dðf i ; f j Þ, where i, j ¼ 1;2; : : : ; m, and i ≠ j.The sum of dðf i ; f iþ1 Þ and dðf i ; f i−1 Þ is denoted as φ i .Then, abnormal frames are selected out as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 6 2 0 where the scale factor σ is set as 1.5.If φ i >¼ ξ, the i'th frame will be regarded as an abnormal frame.Otherwise, it will be defined as a keyframe.Finally, the saliency values of abnormal frames, which are not adjacent to each other, are reset to the average of saliency values of the previous and subsequent keyframes.If the abnormal frames are continuous, we calculate the abnormity degree for each of them by P m j¼1;j≠i dðf i ; f j Þ and regard the frame with minimum abnormity degree as a relative keyframe ψ.Let ψ c be the nearest keyframe of ψ, the saliency values of the frames between ψ and ψ c are reset to the average of ψ and ψ c .
We use two iterations of the morphological dilation on the binary image of M E to generate a robust foreground mask denoted as mask 2 .To make a human body covered by mask 2 more complete and overcome the problem that the extremities and head are lost in detection due to the low image resolution, mask 2 is generalized as follows: if the area of the foreground region in mask 2 is less than or equal to 0.08 of the image area, the bounding box of foreground will be constructed using the maximum and minimum values of its pixel coordinates in the horizontal and vertical directions.The distances between the center of a bounding box and its four borders are respectively increased by 3 pixels, which decrease as the spatial scales of a frame decrease progressively, to obtain a generalized mask 2 .
We calculate the intersection of mask 1 and mask 2 on each spatial scale of a frame separately to achieve the multiscale hybrid masks for the moving foreground.

Foreground trajectory features extraction
Two-dimensional grids are constructed for each frame with a sampling step size of 5 pixels on eight spatial scales spaced by a factor of 1∕ ffiffi ffi 2 p to extract foreground trajectories.The multiscale hybrid masks are used to refine the sampled points.Specifically, when the sampled points do not fall in the foreground region of the mask, it will be removed.Foreground trajectories are generated by tracking the remaining points on multiple spatial scales of a frame.To fully mine the motion information from foreground trajectories, multiple descriptors [i.e., trajectory shape, HOG, HOF, and motion boundary histogram (MBH) 12 ] within a space-time volume around trajectory are computed.We use the identical settings to Ref. 3, so the final dimensions of the descriptors are 30 for TS, 96 for HOG, 108 for HOF, and 192 for MBH.
When the variations between adjacent frames are too subtle to generate a large number of MBs, sampled points extracted by the proposed method are relatively few.However, when hybrid masks of each frame can completely cover the moving foreground, the frames with fewer sampled points are only a small part of a video.In extreme cases, since the colors of background and foreground are highly consistent, or there are some objects in the background that are more salient than the moving foreground, the detection will deviate from the foreground region.These deviations cause too few foreground trajectories related to the human motion to describe actions adequately, thereby reducing the discrimination power of the trajectory features.To solve the above problems, we have formulated two compensation schemes: • Sampled points are extracted from each frame using the hybrid masks, and the number of frames in which the number of points in the first layer of the image pyramid is not larger than τ 1 is counted as m f .If m f ∕ðm − 1Þ ≥ 0.5, the hybrid mask will be replaced by mask 1 , and the trajectory features for the video will be re-extracted.Since the number of sampled points is proportional to image resolution, τ 1 is adjusted by τ 1 ¼ p • τ 2 adaptively, where p is the baseline number of points and τ 2 is the scaling factor for resolution.• When only mask 1 is employed, the real background trajectories are sparser than the real foreground trajectories, so failed foreground detection will lead to a decrease in the number of trajectories.We denote the number of trajectories for a video as N ts .If N ts ∕m ≤ τ 3 , where τ 3 ¼ d • τ 2 , d is the baseline number of trajectories, then we will re-extract the original DT features for the video.
Note that the first scheme is given priority.Trajectory features processed by the first scheme will be judged and corrected again by the second scheme.In the end, the HM-FTs can be obtained.Figure 4 shows the visualization of different stages of foreground trajectory extraction, including the trajectories by DT, MB image, trajectories by MB, foreground detection results, and our proposed HM-FTs.From Fig. 4, we find that the MBs can filter out most of the trajectories in the smooth region of background.However, for the background regions with rich contours and textures, the method cannot remove background trajectories generated by camera motion, as shown in the first-, third-, and sixth-rows of Fig. 4(d).For the trajectories generated by inherent movements in the background (e.g., pedestrians passing, nontarget object movement, water surface fluctuations, etc.), the method does not have any effective removal mechanism, as shown in the second-, fourth-, and fifth-rows of Fig. 4(d).
The hybrid masks achieved by exploiting MB detection and foreground detection are utilized to refine the densely sampled points.The resulting foreground trajectories not only suppress the influence of background trajectories on the discrimination power of features but also contain abundant information for human action, which makes the trajectory features more expressive, as shown in Fig. 4(f).

Improved Dense Trajectory Feature Extraction
Although the above method can effectively filter out background trajectories, the offset of foreground trajectory caused by camera motion lacks necessary amendments.Therefore, we combine the foreground trajectory features with the iDT features to make up for the deficiency.iDT feature is an improved version from DT, which makes reasonable estimation and effective utilization for the information of camera motion so that the trajectory feature is more focused on describing the subject of an action.Specifically, iDT assumes that there is a homography transformation between adjacent frames because the changes between them are relatively slight.Then, camera motion estimation can be solved by calculating a homography matrix between adjacent frames.The SURF and dense optical flow are used to achieve frames matching and obtain the matching point pairs.The global homography matrix is calculated by the random sample consensus algorithm 47 based on these point pairs.The original DT will be amended by the camera motion information.

Pose-Based Feature Extraction
Trajectory features are used to describe the apparent structure and motion state around trajectories.However, pose features focus on describing the distribution and coupling relationship of human joints.These two types of features are highly complementary. 35he popular methods 20,22 for pose estimation usually describe human joints as a tree-structured graph and use the dynamic programming algorithm to deduce the positions of every joint.Considering that the framework mentioned in Ref. 20 is representative, it is employed to achieve pose estimation.Some pose estimation results for the full body with 26 human joints are shown in Fig. 5.The pose descriptors are designed from both time-and space-level based on the results of pose estimation.To remove redundant joints, we follow Ref.21 to retain 15 joints for describing a full body.When the frame step is set to s, if the joint coordinate is ðx; yÞ, the coordinate displacements are d x and d y , and the angle of the space-time displacement vector is arctanðd x ∕d y Þ.
To improve the pose features, a weakening factor R is used to attenuate the effect of joints information in the initial and last frames, because the motion amplitude in middle frames is usually more apparent.For a video with m frames, the multiple sets of descriptors at time-level can be obtained by constructing the coordinate displacement matrix P tr and vectorial angle matrix P an , which are shown as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 3 2 6 ; 5 7 6 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 4 9 4 where f represents all the joint coordinate data for a frame, and its subscript indicates the frame number.ðf 1 ; f 1þs Þ is a column vector consisting of 15 angles of space-time displacement vectors.Each type of time-level descriptors for a video is composed of the data in the same dimension of all elements in P tr or P an .Therefore, we can obtain 75 types of descriptors derived from human joints.

Features Fusion and Classification
In this work, different pipelines of BoVW are employed to construct the video-level representation from a set of descriptors.The trajectory features focus on describing the appearance structures, motion states, and MBs of a video.The pose features focus on describing the changes for the position and movement of joints at both temporal and spatial hierarchies.
To make full use of the complementarity between these two features and exert their respective advantages, we designed a simple and effective features fusion method.
For the trajectory features, we extract two sets of trajectories from a given video, namely, HM-FTs and iDTs.For iDTs, we follow the framework of Ref. 7 to compute multiple descriptors under the default settings.For each set of trajectories, four types of descriptors are encoded separately to obtain the video-level representations.The global representation derived from a specific type of descriptors can be obtained by concatenating the same type of video-level representations.Due to the strong correlation between different descriptors, representation-level fusion is exploited, which has been proved to be the best choice in Ref. 16. Generally speaking, different global representations are concatenated as a final representation for a video.The principal component analysis (PCA) is employed to reduce the dimension of descriptors to half of the original dimension.The whitening technique is combined with PCA to ensure that each dimension of the dimensionality-reduced vector has the same variance.We randomly select 256,000 descriptors from each descriptor set to train a Gaussian mixture model with 256 components respectively.FV is utilized to encode the processed descriptors by the implementation of VLFeat Toolbox 48 and then normalized by the L2 and power normalization.A linear SVM with fixed C ¼ 100 is used for classification because it has been proven to be more efficient in combination with FV. 7 The decision matrix M tr of the combination of HM-FT and iDT for all testing samples is calculated by the one-against-rest approach.
For the various descriptors of pose features, we employ the representation-level fusion to obtain a global representation for a video.All the training samples of a particular descriptor type are exploited to generate a codebook of size 20 by the k-means 49 algorithm.We use the vector quantization to encode these descriptors, which are then normalized and concatenated to generate a 1500-dimensional pose feature for a video.An SVM with RBF kernel is selected for classification, where the fivefold cross-validation is used to calculate the optimal parameters.The one-against-rest approach is utilized to calculate the decision matrix M po for testing samples.
Since trajectory features and pose features are independent of each other, the score-level fusion 35 is chosen to achieve their integration.Let the final decision matrix be Z f ¼ M tr þ M po , the prediction with the highest score of each row in Z f is selected as the classification result.

Datasets
Our method is compared and analyzed on two benchmark action datasets including Penn Action 42 and sub-JHMDB. 21hese two datasets are utilized to evaluate the algorithms for pose estimation and action recognition, and the research object is full body.The Penn Action dataset contains 2326 video clips that belong to 15 action categories.They are "baseball pitch," "baseball swing," "bench press," "bowling," "clean and jerk," "golf swing," "jump rope," "jumping jacks," "pull up," "push up," "sit up," "squats," "strumming guitar," "tennis forehand," and "tennis serve."To achieve the extraction of pose features, we remove the action "strumming guitar" and several samples according to Ref. 22 because most of the human body in those data is invisible.The pruned dataset contains 1206 training samples and 1017 testing samples, and its average accuracy is reported by utilizing the train/test split provided in Ref. 42.
We have considered using the complete HMDB51 dataset, but we found that the dataset is not suitable as a benchmark for the performance evaluation of our method.The pose features cannot be extracted from a large number of samples in the HMDB51 dataset, because they do not contain any fully visible person.Some actions only contain head and shoulder, such as "smile," "chew," "laugh," "talk," "drink," "kiss," "eat," and "smoke."Moreover, the pose estimation algorithm presented in this paper also cannot be applied to the actions where more than 1/2 or even 2/3 of a person is invisible, such as "shake hands," "sit down," "brush hair," "pour," "hug," and "clap hands."To this end, all the performance evaluations and comparisons are achieved on the sub-JHMDB dataset. 21Some sample frames from Penn Action, sub-JHMDB, and HMDB51 are shown in Fig. 6.

Basic performance evaluation for HM-FT feature
The effectiveness of HM-FT feature is demonstrated by testing it on the two public datasets for human action recognition.All the experiments are conducted on a lab computer running Windows 10 with 3.50 GHz Intel Core i7-5930K CPU and 64 GB of RAM.We have used Matlab R2015a and Visual Studio 2013 for implementation purposes.
For the HM-FT feature, the range of c i in the coherence matrix during the process of foreground detection is set to [0.2,0.8].If η is fixed to 0.6, the results are virtually unchanged when γ varies from 0.1 to 0.3.In the stage of superpixels classification, three kinds of kernel functions, including linear kernel function, polynomial kernel function, and RBF kernel function, are utilized to train basic classifiers.Moreover, considering that the image size for all videos in sub-JHMDB is 320 × 240 pixels, it is set as the baseline resolution.The influence of different parameters p and d on recognition results will be discussed in Sec.4.2.4.Here, we report the best performance of HM-FT feature with p ¼ 8 and d ¼ 5.With the HM-FT feature, the average recognition rates on Penn Action and sub-JHMDB are 88.91% and 63.19%, respectively.Note that although the average accuracy is reported both for the two datasets, we follow Ref.21  to calculate the per-video accuracy for sub-JHMDB, which does differ from the per-class accuracy employed in Penn Action. 42The confusion matrices on the two datasets are shown in Fig. 7.
Figure 7(a) shows that on Penn Action dataset, we achieve high accuracies on most of the actions, such as "clean and jerk," "jump rope," and "bowling."However, "bench press" has the lowest recognition accuracy with 0.71.In most cases, its testing samples are incorrectly recognized as "push up" because both of the actions only include the up-and-down motion of arms, and their movement ranges are similar.
As for sub-JHMDB in Figs.7(b)-7(d), although the numbers of testing samples for the same action are different in three splits, the proposed HM-FT feature performs well on the actions such as "golf" and "shoot ball" in general.Moreover, we find that "climb stairs" and "walk" are easily confused with each other.Compared to the latter, there is an upward trend in the trajectory of "climb stairs," but the camera always adjusts objects to the center of the visual field, which make this difference insignificant.
From the four confusion matrices, we can infer that even though the extracted HM-FT features have effectively suppressed the interference of background trajectories to action recognition, it is not entirely robust to the action classes with highly similar motion patterns.The future work will focus on identifying motion-related objects in the scene to provide necessary semantic information for different actions, which is considered as an auxiliary discrimination basis to improve the discrimination power of HM-FT features.

Overall recognition performance
For comparison, the recognition performance of different features, including trajectory features, pose features, and their combinations, are evaluated on the two public datasets.To ensure the objectivity of results, we apply the same BoVW pipeline to different types of trajectory features.The specific settings of each step of BoVW (i.e., feature preprocessing, codebook generation, feature encoding, and normalization) and the selection of classifiers are determined by referencing to Sec. 3.
For the pose features, we set both the frame step s and the weakening factor R to 3. Unlike the 3225 types of descriptors shown in Ref.21, the optimized pose features only contain 75 types of descriptors, but they can significantly reduce running time and preserve discriminative power (note that the accuracy achieved by the combination of 3225 descriptor types and DT on the sub-JHMDB dataset is 52.9%). 21For example, when the video contains 42 frames with a resolution of 320 × 240 pixels, the running time of the optimized pose features is about 0.0058 s, which is far less than 6.17 s consumed by the 3225 descriptor types.
Table 1 presents the comparison of the average accuracies achieved by different methods, where Comb. 1 is the combination of HM-FT and iDT, and Comb. 2 is the combination of HM-FT, iDT, and pose.We observe that the iDT feature demonstrates higher accuracies than other trajectory features on both sub-JHMDB and Penn Action datasets, which outperforms HM-FT by 2.3% and 3.4%, respectively.As an improved version of DT, the accuracies of HM-FT on the two datasets are improved by 10.1% and 6.2%, which shows that the discrimination performance of original DT has been significantly enhanced after filtering out background trajectories.We also use two state-of-the-art saliency detection methods presented in Refs.36 and 37 to generate masks individually and test the recognition performance on the two datasets.However, their recognition accuracies are significantly inferior to that of HM-FT, where the multiscale hybrid masks are exploited.It could be attributed to the failed saliency detections.Actually, due to the inherent challenges of saliency detection and the characteristics of action videos, where a frame does not necessarily contain a salient motion subject, saliency detection methods are not sufficient to provide reliable prior information for trajectory features without any auxiliary strategy.
Furthermore, the combination of HM-FT and iDT always performs better than each set of trajectories, but worse than the combination of trajectory features and pose features, which has achieved the best accuracies of 72.4% and 95.2%.Thus, we conclude that the proposed feature fusion framework can effectively exploit the complementarity among the two types of features, thereby boosting the overall recognition performance.

Comparison of the performance of different trajectory features
The recognition results of each class based on different trajectory features are computed on the two datasets.For the sub-JHMDB dataset, to show comparison results intuitively, the recognition accuracy for a class is defined as the quotient of the number of samples that have been correctly classified and the total number of testing samples in all three splits.As shown in Fig. 8, HM-FT achieves higher accuracies for 7 out of 12 classes on sub-JHMDB compared to DT, while the same results are obtained on "pick" and "run."In particular, the accuracy of "shoot ball" achieved by HM-FT is 100%, which outperforms DT by 75%.Moreover, HM-FT +iDT is better than HM-FT alone by 13.51% on average for seven classes of actions and only worse than it on "kick ball" and "swing baseball."From Fig. 9, HM-FT+iDT achieves the highest recognition accuracies for almost all classes on Penn Action dataset, especially on the easily confusing "bench press," "tennis forehand," and "tennis serve."In addition, HM-FT is greater than or equal to DT on 10 classes, but gets much lower accuracy than DT on "bench press."By comparing the different trajectory features, we conclude that although the detections deviate from the foreground region in a few cases and lead to a decline in accuracy, HM-FT is more efficient than DT for most actions.In the vast majority of cases, the combination of HM-FT and iDT can always improve the classification power of HM-FT.
To visualize the computational cost of the proposed HM-FT, we compare its performance with three trajectory feature extraction methods in different aspects, including the time taken to process a video frame, the average number of trajectories per video clip, and the recognition accuracy.We randomly select 12 video clips from the sub-JHMDB dataset with a resolution of 320 × 240.There are 14 videos selected from the Penn Action dataset with a minimum resolution of 480 × 270 and a maximum resolution of 480 × 393.
From Table 2, since the computational cost of tracking sampled points decreases significantly by using DT-MB, 14 its time taken to process a video frame is the lowest.However, its recognition accuracy has not improved compared to DT. iDT-RCB 50 is an improved strategy based on DT, where the warped optical flow is exploited to adjust the interest points sampling to remove subtle motions.Although the recognition accuracy of DT is enhanced, its computational cost is higher than DT, which should be attributed to the calculation of optical flow and saliency detection.The proposed HM-FT further filters out invalid points by the multiscale hybrid masks to produce a minimum number of trajectories, which further reduces the computational cost of tracking points compared to DT-MB.However, since the moving foreground detection in the proposed scheme requires additional computational cost, the final computational cost of HM-FT is more than DT on the two datasets.This disadvantage will be decreased with the increasing of image resolution because more invalid points are removed, as we can see from Table 2. HM-FT has significantly improved the recognition accuracy of DT.Indeed, a limited reduction and efficient selection tend to improve the accuracy with minor computational cost.Taking into account the subsequent recognition procedure, fewer trajectories also lead to faster video encoding process.

Evaluation of the parameters for compensation schemes
We evaluate the impact of the compensation scheme parameters on recognition performance.The relationships between the performance of HM-FT and the two parameters of the compensation schemes (p and d) are respectively shown in Fig. 10.Overall, increasing the baseline number of trajectories from 0 to 5 on both datasets can improve performance.
Instancing the baseline number of trajectories (from 5 to 10) yields significant performance degradation.In the cases of p ≤ 8 and d ≤ 5, increasing the baseline number of sampled points improves performance, likely because the samples with foreground detection deviation are corrected.We find that p ¼ 8 with d ¼ 5 provides a good tradeoff of performance versus computation.

Comparison with the state-of-the-art
The recognition accuracies achieved by our method are compared with the state-of-the-art methods on Penn Action and sub-JHMDB datasets, as shown in Table 3, where F-level indicates feature level fusion, and S-level indicates score level fusion.For Penn Action, the average accuracy achieved by this work is 95.2%, which has improved the state-of-theart methods.For sub-JHMDB, only the work in Ref. 27, which combines iDT and a pose feature based on CNN, produces a better result than ours.However, if we replace the pose estimation results with the ground truth (GT) provided by the datasets, the recognition rate of "Comb. 2 (Pose-GT)" is 81.3%, which means that the insufficient of pose estimation does not affect our contributions in improving trajectory features and designing the hybrid multifeature fusion framework.We find that the multifeature fusion strategies in Refs.16,  21, 22, 27, 34, and 35 can always improve the recognition performance of single feature by integrating more abundant human motion information.Our method that benefits from the proposed HM-FT features and the appropriate fusion strategy has improved upon most of the similar algorithms.Moreover, although deep-learned methods have improved the state-of-the-art performance on many datasets based on the massive video data and large-scale training, they have no significant advantage over the handcrafted methods when the two datasets have less training data.The comparisons of deep-learned methods (i.e., P-CNN, 26 Pose, 27 iDT +Pose, 27 ARRNET, 28 and Deep Moving Poselets 29 ) have confirmed this conclusion.From Ref. 27, we find that the  Journal of Electronic Imaging 053049-13 Sep∕Oct 2018 • Vol.27 (5)  fusion strategy for the deep-learned features and handcrafted features at different levels is a promising research direction to improve recognition performance.

Conclusion
In this paper, a saliency-based sampling strategy (i.e., HM-FTs) and a hybrid multifeature fusion framework are proposed to improve the action recognition rate in realistic scenes efficiently.To obtain the trajectories closely related to action subject and filter out the trajectories derived from camera motion and inherent movements in the background, multiscale hybrid masks, which are generated by the weak saliency map optimized by the synchronous updating mechanism of cellular automata and the strong saliency map achieved through the MKB method, are utilized to refine the original dense sampling points.The collaborative optimization strategy is used to ensure that the foreground detection results are more reasonable and effective.The compensation schemes are employed to improve the fault tolerance of the proposed features.The experimental results show that the HM-FT feature has effectively improved the recognition performance of the original DT.Furthermore, the discriminative power of the overall recognition framework can be enhanced significantly using the hybrid feature fusion strategy.However, during the experiments, we found that when the motion patterns and amplitudes of two types of actions are highly similar, neither trajectory features nor pose features can satisfactorily solve the confusion between their testing samples.In the future, we will focus on identifying critical objects in the scene that provide auxiliary discrimination information for action classification and trying to incorporate deep learning methods into the proposed framework to improve the recognition accuracy in realistic scenes.

Fig. 1
Fig. 1 Flowchart of the foreground trajectory extraction based on multiscale hybrid masks.

Fig. 2
Fig. 2 Comparison of the original DT and the dense trajectories on MB.(a) Video frame, (b) MB image, (c) trajectories by DT, and (d) trajectories by MB.

Fig. 3
Fig. 3 Examples of the dark channel prior for the frames of action videos.(a)-(c) show the dark channel generated by foreground regions and (d) shows the dark channel generated by the image with a brighter foreground or darker background.

Fig. 4
Fig. 4 Visualizations of trajectories in different stages of the proposed method.(a) Video frame, (b) trajectories by DT, (c) MB image, (d) trajectories by MB, (e) foreground detection results, and (f) trajectories by HM-FT.

Fig. 5
Fig. 5 Some pose estimation results for the full body with 26 human joints on datasets.

Fig. 6
Fig. 6 Sample frames from Penn Action, sub-JHMDB, and HMDB51.The frames in the first four rows are from Penn Action and sub-JHMDB, and the last row shows the frames with failure pose estimations form HMDB51.

Fig. 7
Fig. 7 The confusion matrices on the two datasets: (a) for Penn Action dataset; (b), (c), and (d) for three splits of the sub-JHMDB dataset.

Fig. 8
Fig. 8 Accuracy comparison of each class by DT, HM-FT, and HM-FT +iDT on sub-JHMDB.

Fig. 9
Fig. 9 Accuracy comparison of each class by DT, HM-FT, and HM-FT +iDT on Penn Action.

Fig. 10
Fig. 10 Performance of HM-FT as a function of p and d on (a) sub-JHMDB and (b) Penn Action.

Table 1
Overall recognition performance of different methods for the sub-JHMDB and Penn Action datasets.

Table 2
Comparison of HM-FT with other trajectory feature extraction methods.

Table 3
Comparison of our method with the state-of-the-art methods.