Energy flow: image correspondence approximation for motion analysis

Abstract. We propose a correspondence approximation approach between temporally adjacent frames for motion analysis. First, energy map is established to represent image spatial features on multiple scales using Gaussian convolution. On this basis, energy flow at each layer is estimated using Gauss–Seidel iteration according to the energy invariance constraint. More specifically, at the core of energy invariance constraint is “energy conservation law” assuming that the spatial energy distribution of an image does not change significantly with time. Finally, energy flow field at different layers is reconstructed by considering different smoothness degrees. Due to the multiresolution origin and energy-based implementation, our algorithm is able to quickly address correspondence searching issues in spite of background noise or illumination variation. We apply our correspondence approximation method to motion analysis, and experimental results demonstrate its applicability.


Introduction
Motion analysis is a very significant topic in computer vision because of its demand in the area of human-computer interaction, video surveillance, intelligent transportation system, and others. As motion is a time-varying quantity reflecting the variation of an object's status, in contrast to static image analysis, more useful changing information is available via spatial feature comparison between frames for motion analysis. 1,2 Therefore, at the center of motion analysis is to represent different motions according to their dissimilarities in space-time. From this perspective, techniques for analyzing motion can be divided into two categories: spatial dissimilarity-oriented and temporal dissimilarity-oriented methods.
To be definite, we regard spatial dissimilarity-oriented methods as techniques focusing on exploring dissimilarities of image features, and then combine or extend by adding time labels for motion representation. As a good example, Gilbert and Bowden 3 proposed a dense interest points detection algorithm for human action feature extraction, which is further temporally grouped for classification. Recently, spatiotemporal shape template 4-7 for motion representation attracts much attention for its effectiveness; however, the templates rely strongly on spatial shape representation. Similarly, approaches based on bag of spatiotemporal interest points [8][9][10][11] has great success in the field of motion analysis for its spacetime invariance. Generally, in spite of spatial dissimilarity-oriented methods being very suitable for motion representation where spatial characteristics are obvious, they often fail to extract adequate global relationships of motion.
In contrast, temporal dissimilarity-oriented methods tend to first extract image features, and then focus on exploiting the relationship and dissimilarities between motion frames. Frame difference is a very direct and useful scheme to express motion temporal dissimilarities. For example, in Ref. 12, motion energy image (MEI) is built up through image difference, based on which motion history image is formulated by fusing MEI for human movement recognition. Moreover, optical flow 13 is another popular temporal dissimilarity-oriented scheme by assuming brightness is constant between adjacent frames. Inspired by optical flow, Liu and Torralba 14 developed scale-invariant feature transform (SIFT) flow using SIFT points substituting raw pixels for dense correspondence analysis, which is further applied for motion field prediction and face recognition. Furthermore, Huang et al. 15 presented a correspondence map-based algorithm which can be employed for object recognition. Generally speaking, temporal dissimilarity-oriented methods cover both global and local features of motion, and many attempts have been made to address the motion analysis problem from the perspective of image correspondence approximation, as it is more accessible and applicable than frame difference techniques in most cases.
Motivated by the aforementioned observations, this paper solves the motion analysis problem by developing an image correspondence approximation scheme called energy flow, which can be used for dissimilarity searching in spacetime between temporally adjacent frames. Particularly, our work first generates a multiscale energy map for image spatial effective representation, which allows for image detail preservation while extracting main features. Using energy map, energy flow at each scale is computed by Gauss-Seidel iteration based on the energy invariance constraint as well as global smoothness assumption. 16 Ultimately, we reconstruct an energy flow field on different scales for accurate image correspondence approximation.
The proposed scheme is capable of finding out dissimilarities between two images, which has great prospect in computer vision domain. Compared with optical flow techniques, 13 our algorithm is more reliable and has higher tolerance to illumination changes since multiscale energy rather than brightness is employed for pattern flow searching. As the application for motion analysis, our approach is very practical in contrast to SIFT flow 14 and other spatialtemporal representation methods, for its cheap and accessible characteristics.
The remainder of this paper is organized as follows. Sec. 2 gives an overview of related work. In Sec. 3, our energy flow concept is introduced. Section 4 shows the motion analysis results using energy flow. Finally, Sec. 5 concludes this paper.

Related Work
As energy flow is an image correspondence-based scheme, as well, motion analysis is a very broad topic allied closely with image segmentation, background modeling, tracking, object recognition, and others, we review previous work from three aspects: image correspondence, motion detection, and human action recognition.

Image Correspondence Approximation
Initially, Horn and Schunck 16 proposed an optical flow estimation method to find dense correspondence fields between images. Optical flow is very efficient for small motions, so a great deal of research 13,17,18 following this pipeline has been done for correspondence approximation. However, optical flow makes the brightness constancy assumption and therefore fails to deal with large lighting changes, it also cannot accurately describe the motion region if there is overlap or noise on the brightness layer.
Another popular image correspondence technique is SIFT, 19 which matches the images using sparse points that are robust to geometric and photometric variations on multiple scales. SIFT flow, 14 mentioned earlier, is actually an extension of SIFT by fusing it into optical flow formulation. Unfortunately, SIFT-based algorithms are either computationally consuming or too sparse to achieve precise correspondence approximation. To deal with these shortcomings, Tau and Hassner 20 further seek to propagate image scale information from detected interest points to its neighboring pixels context by considering locations where scales are detected, and then use the context for images separately and within correlated images, which results in more useful features for dense correspondence while keeping the computational burden low. Similarly, Zhang et al. 21 proposed an energy flow equation by replacing the brightness using image temperature features within the Horn-Schunck optical flow framework, which is employed for video segmentation.
Moreover, researchers present many approaches for approximating image correspondence from other points of view, such as Refs. 15 and 22, no matter if they work on pixels or interest points, the dilemma between accuracy and efficiency is challenging especially for wide-range practical applications.

Motion Detection
Broadly speaking, existing work for motion detection can be roughly divided into model-based and appearance-based detections. Model-based methods detect motions by comparing the target with a built model. It is ideal to directly use the background image 22 without interference as the model if the scenario is static, but more often, using an estimated model from a priori knowledge is more actual, e.g., Gaussian mixture model (GMM) 23 is proposed for dynamic model estimation according to the Gaussian mixture distribution of pixels, which is widely applied for object tracking. In a very recent work, Haines and Xiang 24 further used a Dirichlet process GMM to provide a per-pixel density estimate for background computation. Model-based techniques are quick, but rely strongly on the established model. Appearance-based approaches pay more attention to learn a large number of sample features, and then accomplish motion detection by classification, e.g., histogram of oriented gradient (HOG) 25 is formulated to represent gradient features of an image, according to which, pedestrians can be detected via support vector machines (SVMs) framework. 26 In Ref. 4, a detector named action bank is presented for human motion detection, and on this basis, motion can be accurately localized through SVMs. Tamrakar et al. 10 introduced a bag of SIFT features for complex event detection.

Human Action Recognition
As human action is a very large-volume data digitally, the heart of action recognition is to extract spatiotemporal features 3 to represent actions. Considering the characteristics of action, many action descriptors have been presented, e.g., Derpanis et al. 6 developed a spatial-temporal orientation template generated via three-dimensional Gaussian filtering on raw raw image intensity features for reflecting the dynamics of actions. In Ref. 7, action videos are segmented into spatiotemporal graphs expressing hierarchical, temporal, and spatial relationships of actions, and then a matching algorithm is formulated for action recognition. Additionally, a lot of techniques originated from image correspondence and motion detection are widely applied for action recognition, e.g., Laptev et al. 8 build a spatiotemporal bag of words (BoW) model to represent action interest points consisting of HOG and optical flow features. Furthermore, context of interest points is able to be used for action representation, e.g., in Ref. 27, the action context feature is defined as the relative coordinates of pairwise interest points in space-time, and then GMMs are used to describe the context distributions of interest points.

Methodology
Our goal is to explore correspondence between images for motion analysis. In this work, a temporal dissimilarityoriented scheme is presented while the spatial features of images are deep extracted. Given two temporally adjacent frames, we start from building multilayer Laplacian stacks for both, respectively, using Gaussian kernel convolution implementation, and energy map is further established for image feature extraction. We compute the energy flow between two energy maps based on the energy invariance constraint, and energy flow field is reconstructed to approximate the correspondence.

Energy Map
To exploit the local features of an image, the first step of our algorithm is to represent an image I on multiple scales employing Laplacian stacks. Let GðσÞ denote a two-dimensional normalized Gaussian kernel with standard deviation σ, and let Ã denote the convolution operator, the image I can be decomposed into a m-scale (m ≥ 1) descriptor fL S ðIÞj0 ≤ S ≤ mg, where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 7 1 9 L S ðIÞ ¼ Despite the fact that Laplacian stacks are able to find out full details as its origin at multiresolution processing, for each subband, it is band limited. 28 Therefore, in order to describe an image more accurately with fewer noises by considering the dissimilarity between different scales, a rectification process is implemented in our work. Based on the Laplacian stacks, and inspired by power maps proposed in Refs. 22 and 28, we establish our energy map according to the absolute value of Laplacian coefficients because the variation produced by difference of Laplacian stacks rather than its orientation is the point of our concern. For I on the S'th scale, we define the transfer energy as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 5 4 3 T S ðIÞ ¼ ln jL S ðIÞj Ã Gðσ Sþ1 Þ: (2) Here, we transform the absolute value of Laplacian coefficients into logarithmic domain. Since the value of jL S ðIÞj at many pixels is 0, which brings infinitely small quantity impacting the following computation, we make the following revision: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 4 5 6 T 0 where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 6 3 ; 4 1 3 jL 0 Then we continue to define the energy map considering both the absolute value of L S ðIÞ and the exponent of weighted transfer energy: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 6 3 ; 3 4 0 E S ðIÞ ¼ jL S ðIÞje λT 0 S ðIÞ ; where λ is an adaptable parameter. Since the revision process adds noises to e λT 0 S ðIÞ by conserving zeros of jL S ðIÞj, we further modify it using P S ðIÞ: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 6 3 ; 2 7 3 P S ðIÞ ¼ where ϵ is the infinitely small quantity, and ρ is a parameter determined by image quality.
Finally, energy map is built up as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 3 2 6 ; 7 4 1 E S ðIÞ ¼ jL S ðIÞjP S ðIÞ: Thus, we can conclude that our energy map is essentially the multilayer Laplacian energy stacks for action spatial feature extraction. Figure 1 shows an example of energy map, it is worth noting that the four layers of energy map are displayed with the same size in spite of actually every backward layer decreases into one-fourth with respect to its forward layer. Additionally, it is worth noting that σ is set as 2, m is chosen as 4, λ is selected as −0.3, and ρ ranges from −2 to −0.5 in our work which are practically proven to work well.

Energy Flow
To extract temporal features between frames, we regard motion as the apparent motion of the energy. Therefore, as we know, there are two smoothness assumptions 13 for optical flow computation: global smoothness 16 which can produce dense optical flow field but fail to describe boundaries and local smoothness 17 which is more robust but often results in sparse motion description. Considering the advantage of the energy map on depicting boundaries, and motivated by Horn-Schunck optical flow formulation, 16 we make the assumption that the spatial energy at two continuous times on the same scale is equal using global smoothness assumption. Moreover and likewise, we define "energy conservation law" as follows: let E S ðx; y; tÞ denote the energy of a pixel ðx; yÞ of an image I at time t on the S'th scale, after a small time interval δt at the point ðx þ δx; y þ δyÞ, we thus define E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 3 2 6 ; 3 9 7 E S ðx; y; tÞ ¼ E S ðx þ δx; y þ δy; t þ δtÞ: Based on this assumption, we expand the above equation using Taylor series: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 3 2 6 ; 3 4 4 E S ðx; y; tÞ þ δx ∂E S ∂x þ δy ∂E S ∂y þ δt ∂E S ∂t þ oð2Þ ¼ E S ðx; y; tÞ; where oð2Þ denotes the first-order of infinitely small quantity. Then dividing δt on both sides of Eq. (9), and as δt → 0, we can get E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 2 4 3 Here, we define the velocity of a pixel as ν S ¼ ðν Sx ; ν Sy Þ and ν Sx ¼ ðdx∕dtÞ, ν Sy ¼ ðdy∕dtÞ, so we can get the energy flow constraint equation: Then we describe energy flow using the energy flow field descriptor ν S ¼ ðν Sx ; ν Sy Þ, which can be computed by minimizing the following objective function: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 6 3 ; 6 8 5 where α 1 ðα 1 ≠ 0Þ and α 2 are respectively the weights for data and smoothness terms indicating the energy invariance and global smoothness assumption. 13 Likewise, the ratio α 2 ∕α 1 is determined by the image quality. 29 Utilizing the Gauss-Seidel iteration, Eq. (12) can be solved as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 6 3 ; 5 4 3 ν kþ1 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 6 3 ; 4 8 7 ν kþ1 where kðk ≥ 0Þ denotes the iteration number, and in our work, k is set as 100 to guarantee both efficiency and accuracy.

Energy Flow Field Reconstruction
Therefore, after iteration via Eqs. (13) and (14), from the macropoint of view, for two frames, we can get a final energy flow field sequence abbreviated as fV S ¼ ðν kþ1 Sx ; ν kþ1 Sy Þj0 ≤ S ≤ mg on multiple scales. Because for high-pass scales, the energy map averages response over a larger region of the image; 28 to represent the details produced by tiny variation during the time interval δt and to guarantee the avoidance of noise simultaneously, we reconstruct energy flow field on the velocity layer rather than on the energy map layer for expressing image correspondence relationship using V 0 , which can be computed by iteration as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 3 2 6 ; 7 5 2

Experiments
As our algorithm is an image correspondence-based scheme for dissimilarity searching between adjacent frames, to better reveal its performance, we test our algorithm for motion analysis from two facets: motion detection and human action recognition. Also, we believe that our method can be used in more areas.

Motion Detection
We verify our algorithm for motion field prediction using frames from ChangeDetection.NET 2014 change detection database 30 without additional processing. ChangeDetection. Net 2014 is a very complex benchmark for event and motion detection consisting of 31 videos depicting indoor and outdoor scenes with boats, cars, trucks, and pedestrians. To visualize energy flow velocities, we display oriented arrows of energy flow field from the previous frame to the current status, and one velocity vector in 2 × 2 or 5 × 5 pixels is set to be visible and the magnifying scale factor of arrows is 5 or 10 determined by image quality. As well, we utilize color maps to show energy flow field regions according to the value of arctan ðν 101 0x ∕ν 101 0y Þ at each pixel, it is worth noting that the previous frames are often not given but can be inferred from our visualizations which reflect motion variations. Figure 2 gives the example results of continuous human motion detection in a relatively static scenario, the grabbing motion is slow, a large part of the human body is not moving, and a small part moves slightly. From detection results, we can see that our algorithm is able to depict moving parts effectively with little noises and the boundaries are precisely detected. Also, the overlap within motions is successfully addressed. Figure 3 gives the example results of motion detection in the lake and highway scenarios. The lake scenario is very challenging as it includes motions of a man driving a boat, a black car's motion far away from lens, and the lake water flow. However, we deal with the case well and the main motion variations are detected. For the highway scenario, the motion is very quick leading to big variations, and it is shown from the results that the motions are localized very accurately, but a part of the car's body is disregarded. Figure 4 gives the example results of motion detection in a shadow scenario and at night. The results of pedestrian Fig. 2 Example results of human motion detection. Images in the top row are continuous frames with oriented arrows describing energy flow velocities from its previous frame to the current status, and the bottom row shows the color maps. The previous frame of the first image is not given. detection with shadow are promising since we are aimed at motion detection instead of detecting pedestrians. As motion detection at night with illumination changes, our approach is also very robust.
As a comparison, Fig Moreover, to further validate our approach, we compare its overall results with another four methods for motion detection on ChangeDetection. Net 2014 shown in Table 1. We select three popular metrics for evaluation: recall (Re ¼ N tp ∕N tp þ N fn ), false positive rate [Fpr ¼ N fp ∕ ðN fp þ N tn Þ], and precision [Pr ¼ N tp ∕ðN tp þ N fp Þ], which are determined by the number of true positives (N tp ), true negatives (N tn ), false positives (N fp ), and false negatives (N fn ). From the comparison, we can see that our method outperforms popular optical flow methods, 16,17 and can handle real-time action detection well in contrast to GMM 23 and background modeling 24 based algorithms.   To evaluate our energy flow errors, we compute average angular errors (AAE) of energy flow using groundtruth sequences ("TxtRMovement," "TxtLMovement," "blow1Txtr1," "drop1Txtr1," "roll1Txtr1," and "roll9Txtr2") from University College London (UCL) database 18 by averaging all the AE calculated by the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 6 3 ; 5 3 (16) where ðν x ; ν y Þ denotes the velocity of ground-truth at ðx; yÞ. As a comparison, the AAE of Refs. 16 and 17 is also shown in Fig. 6.

Human Action Recognition
For action recognition issue, we select sequences from Kungl Tekniska Högskolan (KTH) (2391 video clips including 6 actions performed by 25 persons) 26 and human metabolome database (HMDB) (6849 video clips divided into 51 action categories) 31 action databases. Using energy flow field between two frames as features, we cluster 100k features of the energy flow field descriptors using k-means algorithm by setting k as 4000, then encode them via a BoW as depicted in Ref. 10, and finally we classify actions under SVMs framework with radial basis function kernel which is practically demonstrated robust. For each action, same as in Refs. 26 and 31, we select 16 persons' video clips for training and the rest for testing on KTH, while we choose 70 video clips for training and 30 video clips for testing on HMDB. Figure 7 gives the confusion matrix using our method on KTH database, and the average recognition rate (ARR) reaches 93.65%. Table 1 compares our algorithm with other related works. 6,11,23,26 In the meanwhile, with the same settings except using SIFT 31 and optical flow features 17 replacing our energy flow features, we get the ARR which is shown in Table 2.    Table 3 shows the recognition results of our approach (ARR is 27.92%) and others 26,32,33 on HMDB database. Also we substitute energy flow using optical flow and SIFT features for comparison, and the corresponding recognition rates are also given. From experimental results, we can see that our method is very effective.
Finally, we record different ARRs on HMDB database by setting different parameters of m, σ, and λ in Fig. 8. We can observe that both the standard deviation σ and the threshold λ perform well in a limited range, which verifies that noises would contaminate the contributing data if parameters are too small while useful information would be omitted if too large. Also, the layer of Laplacian stacks m should be chosen as large as possible if the resolution of image permits. Note that we change only one parameter's value while setting others as default in our experiments.

Conclusion
In this paper, we present an image correspondence framework for motion analysis by estimating energy flow field between two adjacent frames. Energy map is introduced for image feature extraction, based on which energy invariant constraint is proposed for energy flow calculation. The reconstructed energy flow field considering the smoothness degrees of multiple scales is applied for both motion field prediction and human action recognition. A number of experiments are carried out, and promising results are given.
Energy flow scheme is very suited for real-time motion analysis regardless of background noise or illumination change. However, we also find a limitation: in some cases, we may lose a part of specific energy flow field within the object's boundaries due to the poor image quality. So, additional postprocessing should be considered if the whole motion silhouette is needed. In our future work, we are very interested in applying our approach into more computer vision fields.