Human tracking is an essential work for human gesture recognition, surveillance applications, augmented reality, and human–computer interfaces. Therefore, the tracking of humans in videos has received considerable attention in the computer vision field, and many successful human tracking approaches have been proposed in recent years.
Human tracking researches can be divided into two categories according to the sensors. Electro-optical (EO) sensors such as charge-coupled devices (CCDs) are the most widely used cameras for human detection and tracking. Human tracking based on the input images captured by RGB-EO sensors has already been producing reliable performance using color information when the illumination is constant and the target image quality is good.1 However, much of the human tracking research based on EO sensors is not applicable to certain tasks in dark indoor and outdoor environments because of the changeable illumination, existence of shadows, and cluttered backgrounds. In contrast to EO sensors, thermal sensors allow the robust tracking of a human body in outdoor environments during day or night, regardless of poor illumination conditions and the body posture.
In general, thermal sensors can detect relative differences in the amounts of thermal energy emitted or reflected from different parts of a human body in a scene.2 That is, the temperature of the background is largely different from that of the human being. Moreover, the price of a thermal camera has fallen significantly with the development of infrared technology, and thermal cameras have been used in many industrial, civil, and military fields.3,4 However, there are still many problems to solve for reliable human tracking with thermal sensors.
• Nonhuman target objects such as buildings, cars, animals, and light poles having intensities similar to those of humans.4
• Persons overlapping while crossing paths.5
• Low signal-to-noise ratios and white-black or hot-cold polarity changes.6
• Halos appearing around very hot or cold objects.6
• Differences in temperature intensity between humans and backgrounds depending on weather and season.
Other important disadvantages of many successful tracking methods are the assumptions that the background is static, the target appearance is fixed, the image quality is good, and the illumination is constant.7 However, in practice, the appearances of humans and the lighting conditions are changing constantly. Further, the background is not static, especially in the case of a camera that is installed in a mobile platform.
In our work, we use a long-wave infrared (LWIR) thermal camera instead of EO sensors to track humans, particularly at night and in outdoor environments, under the assumption that the camera is freely oriented for a mobile platform application.
Object tracking has been studied widely in the field of video surveillance. In tracking approach, there are two types of object tracking: multiple target tracking and single target tracking. In this paper, we focus on single target tracking. The purpose of our research is to automatically track the object’s bounding box in every frame by using a given bounding box provided by the user including the object of interest in a first frame.
In the current research, object tracking can be classified as follows.
First, deterministic methods8 typically track an object by performing an iterative search for the local maxima of a similarity cost function of the template image and the current image. Jurie and Dhome9 employed the color distribution, with a metric derived from the Bhattacharyya coefficient as the similarity measure, and used the mean-shift procedure to perform the optimization. The mean-shift algorithm10 is a popular algorithm for deterministic methods.
Second, the statistical methods solve tracking problems by taking the uncertainties of the measurements and model into account during object state estimation.11 The statistical correspondence methods use the state space approach to model object properties such as position, velocity, and acceleration. Kalman filters12 are used to estimate the state of a linear system when the state is assumed to have a Gaussian distribution. One limitation of the Kalman filters is the assumption that the state variables are normally distributed. Thus, the Kalman filters will give poor estimates of state variables that do not follow a Gaussian distribution. This limitation can be overcome by using particle filters,13 also known as condensation algorithms or sequential Monte Carlo methods, which are efficient statistical methods to estimate target states. Most recent studies7,8,1415.–16 have attempted to apply particle filters to the tracking systems so that dependable object tracking results can be achieved. Yang et al.8 proposed hierarchical particle filters for tracking fast multiple objects by using integral images for efficiently computing the color features and edge orientation histograms. The observation likelihood based on multiple features is computed in a coarse-to-fine manner. Deguchi et al.14 employed the mean-shift algorithm to track the target and incorporate the particle filters into the mean-shift result in order to cope with a temporal occlusion of the target and reduce the computational cost of the particle filters. Khan et al.15 also employed particle filters and mean shift jointly to reduce computational cost and detect occluded objects by estimating the dynamic appearances of objects with online learning of a reference object. Sidibe et al.16 presented an object tracking method based on the integration of visual saliency information into the particle filter framework to improve the performance of particle filters against occlusion and large illumination variations.
For online learning, Klein et al.7 proposed a visual object tracking method using a strong classifier that comprises an ensemble of Haar-like center-surround features. This classifier is learned from a single positive training example with AdaBoost and quickly updated for new object and background appearances with every frame. Saffari et al.17 and Shi et al.18 proposed the online random forest (RF) for the object tracking by continuous self-training of an appearance model while avoiding wrong updates that may cause drifting.
The first challenge in object tracking is to build an observation model. The color histogram is a well-known feature for object tracking because it is robust against noise and partial occlusion. However, it becomes ineffective in the presence of illumination changes or when the background and the target have similar colors.16 A combination of color and edge features is also used for mutual complements.8,13
The second challenge in object tracking is to design an estimation of the likelihood (distance) between the target object and candidate regions. Several types of distances, such as histogram intersection or Euclidean distance, are used to compute the similarity between feature distributions.8 The most popular method to estimate likelihood is using the Bhattacharyya coefficient as a similarity measure.14,16
The third challenge in object tracking is to recognize and track objects in images taken by a moving camera, such as one mounted on a robot or a vehicle, because this is much more challenging than real-time tracking with a stationary camera. In moving camera applications, the background is not static and the appearance, pose, and scale of a human vary significantly. To track humans in a moving environment, Jung and Sukhatme19 proposed a probabilistic approach for moving object detection when using a single camera on a mobile robot in outdoor environments. Klein et al.7 proposed an object tracking method based on particle filters by adapting new observation models for object and background appearances changing over time in moving camera. Leibe et al.20 integrated information over long time periods to revise its decisions and recover from mistakes by considering new evidence from different camera environments (such as static or moving cameras) and large-scale background changes. Kalal et al.21 proposed a tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary. The learning estimates detector’s errors and updates it to avoid these errors in the future.
However, since much of the human tracking research based on CCD cameras has many limitations, especially for dark indoor and outdoor environments owing to poor illumination, a few algorithms for tracking humans in thermal images have been tried.
Li and Gong4 constructed the regions-of-interest histogram in an intensity-distance projection space model with a particle filter to overcome the disadvantage of insufficient intensity features in thermal infrared images. Padole and Alexandre5 used two types of spatial and temporal data association to reduce false decisions for motion tracking with thermal images alone. Xu et al.22 proposed a method for pedestrian detection and tracking with a single night-vision video camera installed on a vehicle. The tracking phase for the heads and bodies of pedestrians is a combination of Kalman filter prediction and mean shift.
Fernandez-Caballero et al.23 proposed an approach to real-time human detection and tracking through the processing of thermal images mounted on an autonomous mobile platform. This method simply used static analysis for the detection of humans through image normalization and optical flow for enhancing the human segmentation in moving and still images.
However, there exist nonhuman target objects, such as buildings, cars, animals, and light poles, which have intensities similar to that of humans in thermal images.4 Therefore, it is very difficult to maintain correct tracking when humans overlap while crossing paths. To solve these problems, some recent tracking systems use additional information from color CCD cameras.24 Leykin and Hammoud1 proposed a system to track pedestrians by using the combined input from RGB and thermal cameras. First, a background model is constructed with color and thermal images. Then, a pedestrian tracker is designed using particle filters. Han and Bhanu25 proposed an automatic hierarchical scheme to find the correspondence between the preliminary human silhouettes extracted from synchronous color and thermal image sequences for image registration without tracking. Cielniak et al.24 proposed a method for tracking multiple persons with a combination of color and thermal vision sensors on a mobile robot. To detect occlusion, they proposed a machine learning classifier for a pairwise comparison of persons using both the thermal and color features provided by the tracker.
However, these human tracking methods based on thermal sensors or thermal and color sensors have the following typical disadvantages:
• Even though a combination of thermal and color images aid human tracking in daylight, color images are useless in darkness.
• Combinations of thermal and color sensors impose additional costs for camera equipment and computation time.
To improve the human tracking performance for moving cameras while minimizing the computation time for darkness, this study proposes a novel human tracking approach for thermal videos that are based on online RF learning and combination of a local intensity distribution (LID) with oriented center-symmetric local binary patterns (OCS-LBP). As shown in Fig. 1, we design a real-time RF, which is the ensemble of decision trees for confidence estimation, and confidences of the RF are converted into a likelihood function of the target state. In the initial stage, the target model is selected by the user and particles are sampled. In the second and third stages, subblock-based RFs are generated using the long-term positive and negative examples with LID and OCS-LBP features by online learning. The learned RF classifiers are used to detect the most likely target position in the subsequent frame in the fourth and fifth stages. Then, the RFs are learned again by means of fast retraining with the tracked object and background appearance in the new frame.
This human tracking method based on RF combined with an LID and OCS-LBP allows human tracking to be performed in near real-time with a mobile thermal camera. Moreover, the tracking accuracy increases compared with that of a conventional human tracking method for thermal images.
The remainder of this paper is organized as follows. Section 2 describes the target representation method using LID and OCS-LBP features. Section 3 introduces the basic human tracking method using particle filters. Section 4 introduces the proposed human tracking method that incorporates online RF learning to avoid tracking drift caused by pose variations, illumination changes, and occlusion. Section 5 presents an experimental evaluation of the accuracy and applicability of the proposed human tracking method. Section 6 summarizes our conclusions and discusses the scope for future work.
Target Representation Using LID and OCS-LBP
To track a human, a feature space should be chosen for the target. Choosing an optimal feature of the target model is a more critical step because a thermal image has different characteristics than a color image. Therefore, we combine two appearance features: LID and OCS-LBP.
Local Intensity Distribution
A color histogram based on distance is a frequently used feature for object tracking.4,10,14 However, the major characteristic of a human body in a thermal image is high intensity without color information, and individual humans exhibit distinct temperatures. Therefore, intensity is a better feature than color to distinguish humans from a background and other objects. In this research, we divide the bounding boxes of a target model and a target candidate into adjacent subblocks to create a robust feature model for object occlusion as shown in Fig. 2. Partitioned subblocks are beneficial if the size of a box is relatively large. This was justified by the experiment of Khan et al.,15 which revealed that an object often contains multiple local modes and that partitioned subblocks track objects more correctly than a single box when occlusion occurs.
In accordance with the research of Deguchi et al.14 and Comaniciu et al.,10 let be the normalized pixel locations in the ’th subblock defined as the target model. The normalized LID is represented by -component (-bin) histograms and the LID of the ’th subblock of the target model is denoted by , where is the ’th histogram component of ’th subblock. Since pixels will be more in the peripheral region than in the center, a normal histogram is affected by occlusions and interference from the background. Therefore, we use the Epanechnikov kernel , which is an isotropic kernel that assigns greater weights to pixels at the central points of the subblocks as follows:13
Finally, the probability of the features in the target model of ’th subblock is estimated asRef. 10.
The LID of the ’th subblock of the target candidate centered at in the current frame is denoted by , where is the ’th histogram component of the ’th subblock. Using the same Epanechnikov kernel and different bandwidth, depending on the size of the candidate box, the probability of the ’th subblock in the target candidate is estimated as
Oriented Center-Symmetric LBP
In human detection, texture features such as the histogram of oriented gradient (HOG)26 and LBP (Ref. 27) are popular features to discriminate humans from backgrounds. Recently, the LBP texture operator has been successfully used in various computer vision applications, such as face recognition,28 human detection,29 and human tracking,30 because it is robust against illumination changes, very fast to compute, and does not require many parameters.31 LBP describes the gray-scale local texture of the image with low computational complexity by using a simple method. The original LBP descriptor forms different patterns based on the number of pixels by thresholding a specific range of neighboring sets with the central gray-scale intensity value. Even though LBP are widely used as a texture operator, they produce rather long histograms. Ma et al.32 combined HOG and LBP to compute oriented LBP feature. First, they define the arch of a pixel as all continuous “1” bits of its neighbors. Then, the orientation and magnitude of a pixel is defined as its arch principle direction and the number of “1” bits in its arch, respectively.
CS-LBP (Ref. 33) uses a modified scheme comparing the neighboring pixels of the original LBP to simplify the computation while keeping the characteristics such as tolerance against illumination changes and robustness against monotonic gray-level changes. CS-LBP is different from LBP in that differences between pairs of opposite pixels in a neighborhood are calculated, rather than comparing each pixel with the center. This halves the number of comparisons for the same number of neighbors and produces only 16 () different binary patterns. However, since the original CS-LBP lose the orientation and magnitude information, we introduce a new lower-dimensional feature-oriented CS-LBP (OCS-LBP) using a different approach of oriented LBP.32
In order to extract an oriented histogram of OCS-LBP from a subblock, gradient orientations are estimated at every pixel and a histogram of each ’th orientation in a neighborhood is binned using Eqs. (5) and (6). Each pixel influences the gradient magnitude for an orientation according to the closest bin in the range from 0 to 360 deg at 45 deg intervals. In Eq. (6), robustness is maintained in flat image regions by thresholding the intensity-level differences using a small value in Eq. (5), as follows:
In Fig. 2, gradient orientation is confirmed when the differences between pairs of opposite pixels in a neighborhood are over the threshold. For example, the absolute difference between the values of (130) and (80) is over the threshold and is greater than , so the absolute difference (magnitude) is assigned to the zero bin. The gradient orientation histogram for each orientation of a subblock is obtained by summing all the gradient magnitudes whose orientations belong to bin . After that, the final set of OCS-LBP features of a single subblock is normalized by the min-max normalization.
Using the same method with the LID, the bounding box of a target model and a candidate are divided into adjacent subblocks and OCS-LBP histograms are extracted from each subblock. The number of subblock is decided according to the experiment results of Ref. 7. In Ref. 7, the target object was divided by nonoverlap subblocks to make a robust target model about occlusion. However, we change the number of subblocks as based on the human body ratio. All local OCS-LBP histograms are then used for online learning of the RF classifier.
An object tracking algorithm based on particle filters13 has drawn much interest over the last decade. This is a sequential Monte Carlo method, which recursively approximates the posterior distribution using a finite set of weighted samples. In addition, it weights particles based on a likelihood score and then propagates these particles according to a motion model.
Originally, particle filters consisted of the following three steps.34
Given all available observations up to time , the prediction state uses the probabilistic system transition model to make a posterior prediction at time .
At time , the observation is available, so the state can be updated using Bayes’ rule.
The candidate samples are drawn from an importance distribution and the weights of the samples.
In the case of bootstrap filters, particle weights are iteratively estimated from the observation likelihood.
Human Tracking Based on Online RF Learning
To estimate the observation likelihood for the weighting of the particles, the Bhattacharyya distance is generally used by calculating the object appearance similarity.10 In this paper, we estimate the observation likelihood for each particle by using an RF classifier instead of normal distance measures. Even though Saffari et al.17 and Shi et al.18 proved the robustness of object tracking by using online RF learning, two methods cannot avoid the template drift problem when images are taken by a moving camera because they only used the positive and negative samples from the current frame. In addition, because two methods only train one RF by considering full body region regardless of the extent of occlusion, they cannot track an object correctly in the case where an object has a severe occlusion.
Therefore, we design a new classifier with online RF learning with long-term samples as well as subblock-based RFs to avoid tracking drift caused by pose variation, illumination changes, and long-term occlusion.
Initialization of Target Model
Particle filters are sequential Monte Carlo methods that recursively approximate the posterior distribution using a finite set of particles over time . In this paper, we define the set of particles as , where the ’th particle at time consists of its weight and state vector
In the initial stage, the position and bounding box are manually selected and the state vector of the initial target is then set automatically according to the user selection. is the classifier determined by online learning at time .
In the prediction stage of the particle filters, particles are propagated through the second-order autoregressive motion model35 to predict the particle positions. The center position of the ’th particle is interpolated from the previous position (), the average velocities at times and , and white Gaussian noise .
In the case of the second frame, only the velocity is linearly combined with the previous position and white Gaussian noise.
The box size of the ’th particle is linearly interpolated from the previous box size of the target object () and white Gaussian noise .7
Subblock-Based Random Forest Learning
An RF proposed by Breiman36 is a decision tree ensemble classifier, with each tree grown using some type of randomization. This RF has a capacity for processing vast amounts of data, with high learning speeds, based on a decision tree.
For the learning of the initial RF, training data are constructed using a positive example that is selected by the user and two negative examples that are randomly sampled from the background of the first frame. In the second frame, the training data are increased to two positive examples and four negative examples. Negative samples are randomly selected from outside of a tracked object regardless of background cluttering. Training data are increased in the ratio until 15 frames. The memory capacity is 15 for positive examples and 30 for negative examples. Every new target is added to positive memory and the RF is learned using the limited number of positive examples until the 15th frame. In contrast, negative examples are updated as the background at each frame according to the increase in frames. After the 15th frame, we always keep the five positive examples from the 1st through 15th frames in order to avoid the template drift problem by modifying the idea of Klein et al.7 Moreover, the reminder of the positive memory is occupied by the new example and the oldest example is discarded, like in a queue, because the more similar history of the positive examples produces more confident classifiers.
In this research, each particle is divided into six subblocks as mentioned in Sec. 2.2, and two types of RF classifiers for the ’th subblock are learned using the LID and OCS-LBP extracted from the corresponding blocks in the 45 training examples.
Let be the set of RFs , where is the number of subblocks. The ’th RF, , is represented as . Here, we construct two RFs, and , for each subblock: one uses only the LID feature () and the other uses only the OCS-LBP feature (), rather than combining these into one feature vector according to the experiments of Ko et al.,37 because the basic characteristics of the LID and OCS-LBP are different. Therefore, the total number of RFs at time is 12 ( subblocks).
The learning of the RFs in ’th subblock at time is summarized below.
1. Set the number of decision trees for two RFs.
2. Choose the number of variables for and . These variables are used to split each node from eight LID input variables and eight OCS-LBP input variables. By using different ’th variables, the split function iteratively splits training data into left and right subsets.
3. Each tree for an individual RF is grown according to the following steps:
When the class label set is denoted by , a leaf node has a posterior probability, and the class distributions of trees, , are estimated empirically as a histogram of leaf nodes on a class label, .
The depth of the trees is set at 20 according to the results of Ko et al.,31 and the number of trees is four each for and . The experimental results for deciding the appropriate number of trees are described in Sec. 5.3.
Likelihood Estimation Using RFs
After a set of RFs is learned on positive and negative training examples of frame , the observation likelihoods for each particle of frame are estimated using RF classifiers. The reference feature histogram, the LID, and the OCS-LBP of the ’th subblock of a test particle are applied to the corresponding . The likelihood of the ’th subblock is estimated by combining the probabilities of and . The test image is used as input to the learned RF, and the probability distribution (likelihood) of the ’th subblock in the positive class is generated by ensemble (arithmetic) averaging of each distribution of all trees using Eqs. (15) and (16).
Hence, the final likelihood of a particle is estimated from Eq. (17).
This process is continued iteratively until the likelihoods of all particles are computed.
Once the final likelihood () of the ’th particle is estimated, the weight () of the ’th particle at time is replaced by using the likelihood obtained from the RF and each weight is normalized.
The state of the current target is updated as the top particles having greater weight.
Online Relearning of RF Classifiers
When a tracking human target is detected in a current frame, the RFs should be relearned using the updated history including positive and negative examples. The purpose of online RF learning is to avoid tracking drift caused by pose variation, illumination changes, and occlusion.
The basis of the proposed online RF learning is to compute the difference in target state between the current and previous targets and only relearn the RF for the current frame if this difference and the probability of the target satisfy the conditions. In this paper, the learning condition is adaptively changed by using Eq. (21) according to the variance in thermal intensity of a target region.
Online RF learning consisted of the procedures described below.
: Center of a tracked target specified by the current state vector at time .
OC: Counter for duration check of full occlusion ()
1. Compute the difference between centers of previous and current target regions.
2. If // normal tracking.
2.1 Compute the probability of current target region by using RF.
Where threshold is the half width of the current target.
2.2 Compute learning condition using intensity variance of target region and Eq. (21).
2.4 If // full occlusion
2.5 If // partial occlusion
3. If // abnormal tracking
4. If // tracking terminal condition
In procedure 2.2, we use the intensity variance of the target region to determine the learning condition. This condition is based on the fact that the probability of the RF on the current target is low as the intensity variance of the target is high. The test result for a minimum RF learning threshold of 0.72 is described in Sec. 5.1. In procedure 2.4.3, we check the duration of full occlusion and occlusion counter (OC) increases its number whenever continuous full occlusion is occurring. Then, if the total number of OC is over the terminal condition, tracking is terminated in procedure 4. The terminal condition T2 is a changeable threshold according to the application and we set it as 30 frames.
To evaluate the performance of the proposed algorithm, we used four types of LWIR thermal videos containing moving object with background clutter, sudden shape deformation, unexpected motion change, and long-term partial or full occlusion between objects at night.
• Type I: Four thermal videos captured by a static camera in a dynamic background (OTCBVS benchmark dataset38).
• Type II: Four thermal videos captured by a static camera in a dynamic background.
• Type III: Two thermal videos captured by a moving camera.
• Type IV: Two thermal videos captured by moving and static cameras.
The frame rates of the video data varied from 15 to 30 Hz, while the size of the input images was . All test videos were captured in outdoor environments. Table 1 lists the detailed descriptions of the 12 test videos.
Properties of 12 test videos (S, static camera; M, moving camera).
|Video type||Video sequence||Total frames||Description||Season|
|Type I (OTCBVS)||Video 1||300||Two persons walking in the woods (S)||Unknown, outdoors|
|Video 2||274||Multiple persons walking in the street (S)||Unknown, outdoors|
|Video 3||209||Multiple persons walking in the street (S)||Unknown, outdoors|
|Video 4||733||One person walking in the yard (S)||Unknown, outdoors|
|Type II (our data)||Video 5||550||Two persons walking in the street (S)||Winter night, outdoors|
|Video 6||400||Two persons walking in the yard (S)||Summer night, outdoors|
|Video 7||197||Two persons walking in the street (S)||Winter night, outdoors|
|Video 8||500||One person walking in the yard (S)||Summer night, outdoors|
|Type III (our data)||Video 9||338||Multiple persons walking in the yard (M)||Summer night, outdoors|
|Video 10||880||Multiple persons walking in the yard (M)||Summer night, outdoors|
|Type IV (YouTube data)||Video 11||371||Multiple persons walking in the same direction (M)||Unknown, outdoors|
|Video 12||196||One person walking in the cluttered background (S)||Unknown, outdoors|
Note: Video (MPEG, 16.2 MB) [URL: http://dx.doi.org/10.1117/1.OE.52.11.113105.1].
To evaluate the performance of the proposed method, we use the spatial overlap metric defined in Ref. 39. Let us define the concepts of spatial and temporal overlap between tracks as ground-truth (GT) tracks and system (ST) tracks in both space and time. After the ground truth and the estimated bounding box of the target in the ’th frame of a sequence are determined, the spatial overlap is defined as the amount of overlap between and tracks in a specific frame .
The initialization of the rectangle including the tracking object is manually selected by the user. The proposed human tracking system has been implemented in Visual C++ and tested using a PC with an Intel Core 2 Quad processor.
Tests on Minimum Threshold and Condition for RF Learning
In our study, the most appropriate minimum threshold to use in Eq. (21) for updating training data and RF learning was found to be 0.72 on the basis of several experiments. To determine the proper threshold for RF learning, four test data were selected from the test dataset shown in Table 1, namely, Videos 1 and 2 (OTCBVS data) and Videos 5 and 6 (our data). We selected the two videos from the OTCBVS data (i.e., Videos 1 and 2) because in these videos two persons walk in a cluttered background and become fully occluded by each other. We selected Videos 5 and 6 of our data because two persons walk in different directions and become fully occluded by a tree and each other. In the first experiment, the minimum threshold for RF learning was estimated by changing the value of the static threshold. As shown in Fig. 3, a minimum threshold of 0.72 for RF learning exhibited the best performance, with an average value of 80.3%. Therefore, 0.72 was adopted in Eq. (21) as the minimum threshold for RF learning.
For RF learning, we imposed the learning condition [Eq. (21)] using the minimum threshold. The purpose of the condition [Eq. (21)] is to design an adaptive RF classifier depending on the variation of intensity. To verify the performance of the learning condition, we compared the average values of the static threshold determined in Fig. 3 with those of the learning condition [Eq. (21)] for the same four test data. As shown in Fig. 4, the adaptive learning condition exhibited the better performance for all four videos, with an average of 84.5 versus 80.3%.
Determination of Optimal Number of Particles
The main disadvantage of particle filters is the computational cost of using a large number of particles, even though particle filters are known to be robust in visual tracking through occlusions and cluttered backgrounds.15 Therefore, it is essential to find the proper number of particles by considering the computational cost. Figure 5 shows the results of experiments using five possible values for the number of particles. As shown in Fig. 5, even though 40 particles gave the shortest processing time, the tracking performance was the worst. In contrast, 120 particles gave the best tracking performance and relatively good processing time, so 120 was adopted as the number of particle filters.
Determination of Optimal Number of Trees
The RF is known to be very fast in learning and testing as compared to other classifiers, i.e., the multiclass support vector machines.31 The important parameters of the RF are the depth of the trees and the number of trees, . Although increasing the depth of the trees and the number of trees improves the performance, the runtime cost depends on the depth of each tree and the number of trees. In our study, we set the maximum depth of the trees at 20 according to the experiments of Ref. 31.
To determine the proper number of trees for a local RF, we used the same four test data and compared the tracking performance by changing the number of trees. As shown in Fig. 6, when the number of trees for a local RF was four, the tracking performance was the best and the processing time was relatively good. Therefore, we adopted four trees for a local RF. In this study, we constructed two RFs per subblock: one uses only the LID feature and the other uses only the OCS-LBP feature, so the total number of trees for a target is 48 ().
Performance Comparison for Online RF Learning Versus Static RF
Online RF learning is the main technique to track occluded humans and avoid tracking drift caused by pose variation, illumination changes, and occlusion in a cluttered background captured by a moving camera. To evaluate the effectiveness of the proposed online RF learning, we compared the tracking performance with online learning to that without online learning (static RF). A static RF is learned but once, when the user selects the human rectangle, and human tracking is performed by the static RF classifier without relearning.
Figure 7 shows a performance comparison of human tracking methods for the same four test data. As shown in Fig. 7, the online RF learning produced a better tracking performance with an average tracking success rate of 84.5% compared to 75.1%. The main reason for the higher tracking success rate of the proposed online RF learning is that the long-term full or partial occlusion between persons and tree is reflected in the training history and the RF learning.
Comparisons Between Different Algorithms
To evaluate the performance of the proposed algorithm, the proposed method was compared with OCS-LBP with RF () and LID with RF (). In addition, we evaluate three different types of related works: (1) LID with particle filters using thermal image4 (), (2) simple online RF learning using Haar-like feature,17 (3) TLD tracker21 that is known as a robust object tracking algorithm in a moving camera. The experiments were performed using the same dataset as described in Table 1. As shown in Fig. 8, the overall performance of our proposed approach exceeded that of the other two combinations, the particle filters,4 simple online RF,17 and TLD tracker,21 based on the percentages of 81.9, 69.6, 57.2, 69.9, 70.9, and 62.2%. From the results, we can infer that an individual intensity feature is not a distinguishing feature for human tracking in thermal images, particularly for cases of human occlusion. In contrast, the OCS-LBP feature produced reasonable tracking results even in thermal images. Even though simple online RF and particle filter produced the second and third best tracking performance of the other three methods, they still showed a few missing or false detection results when occlusions occurred. TLD tracker showed the worst tracking results, showing that learning and detection algorithm of TLD is not appropriate for human tracking in thermal image. The test results showed that for robust and practical tracking, the combination of two features is superior to the individual feature-based human representation model in thermal video.
For a more detailed evaluation of tracking performance, Fig. 9 shows comparisons between the proposed method and the two methods [particle filters4 with LID and simple online RF (Ref. 17)] in terms of the performance versus the frame number for Videos 1, 5, and 9. The ground truth of the target object is marked manually.
In Videos 5 and 9, the tracking method based on particle filters4 and simple online RF (Ref. 17) lost the target object when occlusions occurred or camera is moving. In case of simple online RF,17 it showed the worst performance in Video 9 because it could not distinguish the real target from background when occlusions occurred. However, for all three videos, the proposed scheme had a significantly smaller error and more robust results than the other methods, regardless of the full or partial occlusion and the camera movement.
Figure 10 shows the computational speeds of the six methods. As shown in Fig. 10, the proposed method (at 10.2 fps) requires more computation time than the other methods (at 14.8, 18.8, 15, 20.2, and 22 fps) because it uses subblocks of particles, online RF learning, and two types of RF. When the online learning was not applied (static RF), the tracking speed was approximately the same as that of the () and faster than that of the using conventional particle filters. Simple online RF (Ref. 17) showed highest computational speed as 22 fps because it used simple learning RF with Haar-feature. Because the main reason for computational delay is the online RF learning for individual subblocks, optimization of the real-time learning may be considered for the next version.
Figure 11 shows the tracking results obtained for Videos 4, 5, 9, 11, and 12 by using our proposed method. From the results in Figs. 11(a) to 11(e), we deduce that our proposed method accurately and robustly tracks moving objects, despite background clutter with similar intensity distributions [(b), (c), and (e)], object intersections [(a) to (d)], long-term full (or partial) occlusion [(a) to (d)], and camera movement [(c) and (d)].
The complete video sequences can be viewed at the following webpage: http://cvpr.kmu.ac.kr.
In this paper, we have demonstrated that the proposed online RF learning method with particle filters improves human tracking performance for thermal videos, especially in cases of poor illumination, object occlusion, background clutter, and moving cameras.
To track a human region, an RF is relearned using the updated history, including positive and negative examples, whenever a new target is detected. Once a set of RFs is learned, the observation likelihood for each particle of frame is estimated using RF classifiers. The proposed online RF learning computes the difference in target state between the current and previous targets, and the RF for the current frame is updated only if the difference and probability of the target satisfy the conditions. In this study, the learning condition was adaptively changed according to the variance of the thermal intensity of a target region.
This paper also proposed a new lower-dimensional OCS-LBP feature and proved that a combination of the OCS-LBP feature with the LID produces robust and practical tracking results from the individual feature-based human representation model, especially for thermal videos.
In the future, we plan to improve our human tracking algorithm to track multiple persons in dynamic environments by designing a faster learning algorithm with a small portion of particles and a robust feature model appropriate to thermal images.
This research was partially supported by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education, Science and Technology, and it was also financially supported by the Ministry of Education, Science Technology and National Research Foundation of Korea through the Human Resource Training Project for Regional Innovation.
Byoung Chul Ko received his BS degree from Kyonggi University, Korea, in 1998 and his MS and PhD degrees in computer science from Yonsei University, Korea, in 2000 and 2004. He was a senior researcher of Samsung Electronics from 2004 through 2005. He is currently an assistant professor in the Department of Computer Engineering, Keimyung University, Daegu, Korea. His research interests include content-based image retrieval, fire detection, and robot vision.
Joon-Young Kwak received his BS and MS degrees from Keimyung University, Korea, in 2011 and 2013. He is currently a PhD student of Keimyung University, Korea. His research interests include fire detection and human tracking.
Jae-Yeal Nam received his BS and MS degrees from Kyongbuk National University, Korea, in 1983 and 1985. He received his PhD degree in electronic engineering from University Texas at Arlington in 1991. He was a researcher of ETRI from 1985 through 1995. He is currently a professor in the Department of Computer Engineering, Keimyung University, Daegu, Korea. His research interests include video compression and content-based image retrieval.