Multispectral detection and tracking of multiple moving targets in cluttered urban environments

Abstract. This paper presents an algorithm for target detection and tracking by fusion of multispectral imagery. In all spectral bands, we build a background model of the pixel intensities using a Gaussian mixture model, and pixels not belonging to the model are classified as foreground pixels. Foreground pixels from the spectral bands are weighted and summed into a single foreground map and filtered to give the fused foreground map. Foreground pixels are grouped into target candidates and associated with targets from a tracking database by matching features from the scale-invariant feature transform. The performance of our algorithm was evaluated with a synthetically generated data set of visible, near-infrared, midwave infrared, and long-wave infrared video sequences. With a fused combination of the spectral bands, the proposed algorithm lowers the false alarm rate while maintaining high detection rates. All 12 vehicles were tracked throughout the sequence, with one instance of a lost track that was later recovered.


Introduction
Automatic detection and tracking of moving targets in full motion video from aerial imaging systems such as unmanned aerial vehicles (UAV) and satellites are of significant interest in the defense and security communities. [1][2][3] These aerial platforms can remain undetected from prospective targets and encompass a large surveillance area. Satellites have been a primary "spy" tool for decades and continue to provide for their respective nations, but their coverage is limited by orbital mechanics, and is hence not always sufficiently timely, nor can a satellite generally be launched on demand to address a short-term tactical matter in the field. Vast amounts of research have been invested in UAV surveillance, and UAVs have been a significant resource for intelligence gathering. 1,4,5 Large areas, such as open waters or borders, can be surveyed for intrusions, regions can be assessed for building of weapon facilities, or urban areas can be checked for potential threats.
The urban environment is of interest for this work. Urban environments provide significant challenges to the problem of automatically detecting and tracking moving vehicles. These areas generally contain complicated clutter and a collection of different targets, e.g., humans, buildings, roads, and vehicles. Each of these different entities also varies greatly in shape and size that challenge automatic target detection and recognition algorithms. Trees, buildings, tunnels, and other formations result in object occlusions that affect the appearance of the targets and sometimes completely block the targets from view for a few to several consecutive frames. Images in the visible spectrum (0.4 to 0.7 μm) provide reflected spectral information that creates contrast between targets, and between targets and the background. The visible spectrum requires good illumination during the daytime hours. Imaging in the long-wave infrared (LWIR) band (8 to 14 μm) is dependent on the temperature and thermal emissivity of the target, but is not dependent on solar illumination, and thus provides nighttime imaging capabilities. The effects of atmospheric aerosols also play a role in these imaging modalities. For example, Mie scattering significantly hinders the performance of visible imaging in the presence of fog aerosols. 6 The wavelengths of the midwave infrared (MWIR) and LWIR bands are longer than the visible wavelength, making them less susceptible to the attenuation due to Mie scattering, and thus provide some immunity to the effects of fog and other aerosols on image quality. 6 The approach of multispectral detection and tracking fuses information obtained from images in different spectral bands to improve detection statistics. Various approaches have been taken in algorithm development for detecting and tracking using multispectral imagery where the fusion framework takes place in three stages of the processing: pixel level, 7-11 feature level, [12][13][14] and decision level. 15 Fusion at the pixel level creates a single image that is a composition of the pixels in the multispectral images. It is often used to create a single image that is interpreted by an operator. 16,17 The combination of pixels into a single image is difficult as there is not always a correlation of the pixel values from the different spectral images, and it has been found that a mild anticorrelation exists between the visible and LWIR bands. 18 Feature-level fusion combines the by-product of processing of individual spectral bands. These processing products include numerous classifications of features, such as foreground maps, histograms, edge contours, and texture features. Processing of individual spectral bands allows feature extraction algorithms to be optimized for each band. In decision-level fusion, processing is performed on each independent spectral band where a decision is made, such as object size and location. These decisions are fused based on band-specific confidence levels to give an overall decision.
To exploit benefits of each spectral band, feature-level and decision-level fusion allow algorithm development tailored for their respective bands. Algorithms fusing background models from different image modalities to create a common foreground for target detection have been demonstrated. [12][13][14][15] Chen and Wolf 13 model the foreground in both visible and LWIR imagery with the mixture-of-Gaussians model, while using an adaptive learning rate that is based on the decision of each spectral band. They also fuse the two spectral bands for their appearance model to increase the performance of target association. Torresan et al. 15 perform the background subtraction on each individual spectral modality and merge the results by picking a master and slave foreground map based upon the confidence of each modality. By modeling each background pixel's intensity as a single Gaussian distribution, Davis and Sharma 12 extract regions-of-interest by the intersection of the visible and LWIR foreground maps. Salient contours from the regions-of-interest are then calculated from both visible and LWIR images and fused to create a single contour saliency map.
The aforementioned works consider visible and LWIR bands; our algorithm additionally exploits near-infrared (NIR) and MWIR bands, and the combinations of spectral bands. Table 1 shows the spectral bands used and their associated wavelengths. The main contribution of this work is an algorithm to fuse multispectral data sets to reliably detect and track moving targets with high-probability and low-false alarm rate. We focus on detection and tracking of vehicles through an urban scene that includes partial occlusions and crowded traffic intersections. A block diagram of our proposed algorithm is shown in Fig. 1. To compensate for fluctuating pixel intensities in each spectral band, background models using a Gaussian mixture model (GMM) adapt to the evolving scenes and detect foreground pixels. Foreground pixels from different spectral bands are fused into a foreground region and filtered to obtain a single foreground map that represents pixel regions belonging to target candidates. Features based on the scale-invariant feature transform (SIFT) are extracted from these target regions and used for two purposes: 19 detecting targets missed by the segmentation detection, and associating targets from a tracking database constructed from prior frames. Lastly, locations for each target are estimated and the GMM mixture is updated.
To develop and evaluate the algorithm, we created a UAV imaging scenario that was synthetically generated from the digital imaging and remote sensing image generation (DIRSIG) toolset. 20 DIRSIG is a mature and widely used simulation package for 0.4-μm to 20-μm wavelengths. An urban scene with 12 vehicles was simulated at visible, NIR, MWIR, and LWIR wavelengths. A normal traffic scenario was simulated using the open-source tool simulation of urban mobility (SUMO) to provide realistic traffic maneuvers. Figure 2 shows a 2000 × 2000 pixel frame from each spectral band. By visual inspection, the appearance of target vehicles varies between the scenes, providing different intensity information.
The remainder of the paper is organized as follows: Sec. 2 presents the method for foreground extraction using the GMM and the region growing process to group disjoint pixels. We also discuss our method for fusing the spectral modalities in Sec. 2. Section 3 presents the association target candidates with track sequences. Experimental results on the performance of the algorithm are presented in Sec. 4. In Sec. 5, conclusions are presented.

Detection and Segmentation Algorithm
In this section, we describe the detection and segmentation algorithms; we then present the fusion process used to combine foreground maps to build pixel regions that represent target candidates. Pixel intensities fluctuate due to changes in illumination and movement from both background and target objects. This does not allow a single value to characterize the time history of the intensity of a single pixel for a given video sequence. To compensate for these changes, background modeling techniques are used to describe the probability distribution of the pixels' intensity by empirically deriving the parameters from the video sequence. The GMM has been successfully demonstrated to compensate for the fluctuations in pixel intensities. [21][22][23] In a scene where the sensor is fixed, keeping the viewpoint stationary, we use statistical information extracted from the time history of the intensity fluctuations to understand the probability distribution of intensity at each pixel, and use these distributions to make hypotheses about the label of each pixel. Each pixel in the scene is classified as a foreground or background pixel, and we update the parameters of the GMM during each frame. We now describe this process in detail. We define Xðx; y; tÞ as the pixel intensity at location ðx; yÞ and time t. The goal is to classify this pixel as a background or foreground pixel by fitting it to a distribution model. The distribution of the time history of the intensity, P½Xðx; y; tÞ, is modeled as a sum of weighted Gaussian distributions: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 1 2 2 P½Xðx; y; tÞ ¼ X K j¼1 w j;t ðx; yÞN ½Xðx; y; tÞ; μ j;t ðx; yÞ; Σ j;t ðx; yÞ; (1) where K is the number of Gaussian distributions; μ j;t ðx; yÞ is the mean of the distributions; and the covariance matrix, which is assumed to be diagonal, is given by Σ j;t ðx; yÞ ¼ σ 2 j;t ðx; yÞI, where I is the identity matrix. The weighting factor w j;t ðx; yÞ represents the portion of which the j'th Gaussian that comprises the entire model, and is dependent on the number of occurrences for the particular distribution. This weighting has range 0 < w j;t ≤ 1, and is normalized such that P K j¼1 w j;t ¼ 1. The Gaussian probability density function is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 3 2 6 ; 7 5 2 N½Xðx; y; tÞ; μ j;t ðx; yÞ; Σ j;t ðx; yÞ ½Xðx; y; tÞ − μ j;t ðx; yÞ T × Σ j;t ðx; yÞ −1 ½Xðx; y; tÞ − μ j;t ðx; yÞ : (2) From the K distributions, it must be determined that the number of distributions are classified as belonging to the background. We select the top B weighted distributions as the background, where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 3 2 6 ; 6 2 5 The threshold Thr is user defined with range (0,1) and is dependent on the scene.  In a complex scene with multiple moving targets where pixel distributions vary among targets, and among targets and background, more Gaussian models will be present and thus require higher Thr. In the scenes tested with this algorithm, few objects were moving and typically only one Gaussian mode was needed to describe the background. By executing the GMM algorithm with a series of parameters on the test data set, the optimal Thr was empirically derived for each spectral band by comparing correctly detected pixels to falsely detected pixels. The resulting values of Thr are shown in Table 2. In the algorithm, LWIR had the lowest Thr at 0.5, which is attributed to the distributions of the target intensities being similar, along with being lower than the background surrounding the targets.
To evaluate whether the current pixel intensity Xðx; y; tÞ is a background or foreground pixel, we calculate the a priori probability of that pixel intensity belonging to each of the K distribution components. If the intensity value falls within 2.5 standard deviations of any background distribution, it is labeled background; otherwise, it is labeled as foreground. Following the classification of the pixel, the distribution parameters are updated as 23 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 6 3 ; 5 1 0 w j;tþ1 ðx; yÞ ¼ w j;t ðx; yÞ þ α½1 − w j;t ðx; yÞ; (4) E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 6 3 ; 4 8 0 μ j;tþ1 ðx; yÞ ¼ μ j;t ðx; yÞ þ ½α∕w j;t ðx; yÞ½Xðx; y; tÞ − μ j;t ðx; yÞ; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 6 3 ; 4 3 5 σ 2 j;tþ1 ðx; yÞ ¼ σ 2 j;t þ ½α∕w j;t ðx; yÞf½Xðx; y; tÞ − μ j;t ðx; yÞ T × ½Xðx; y; tÞ − μ j;t ðx; yÞ − σ j;t ðx; yÞ 2 g; (6) where α is the learning rate. In a scene where objects typically move slowly, the update equations should also update at a slower rate and require a smaller α. After the experimentation, we found the optimal α for each data set as shown in Table 2. All spectral bands used a low α, with the lowest value in the LWIR band, which can be attributed to no shadows being present. The GMM algorithm produces intermediate foreground maps in all spectral bands that do not represent the complete target region and do not necessarily correlate with one another. This is a consequence of discrepancies in the foreground modeling, and is caused by low SNR between the target and background. Examples of intermediate foreground maps at frame 600 are shown in Fig. 3. In the NIR band, the bottom target has a high number of foreground pixels in comparison with the other bands. In the MWIR band, the target on the left has a low number of foreground pixels whereas the other bands have a high number of pixels. The fusion of foreground maps from multispectral video creates combined foreground maps that accurately estimate the centroid of the target with a low-false alarm rate. This is distinct from previous efforts [12][13][14][15][16][17] in that we have considered additional spectral bands for foreground fusion, and use SIFT features for unique target identification and detection of missed targets.
We define a fused foreground map, FG FUS ðx; yÞ, as the sum of individual weighted foreground maps, where w represents the weighting and the subscript represents the respective band, E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 3 2 6 ; 7 5 2 FG FUS ðx; yÞ ¼ w VIS FG VIS ðx; yÞ þ w NIR FG NIR ðx; yÞ þ w MWIR FG MWIR ðx; yÞ þ w LWIR FG LWIR ðx; yÞ; FG FUS ðx; yÞ is spatially filtered with a 3 × 3 Gaussian filter with σ ¼ 0.5. Thresholding of FG FUS ðx; yÞ is performed to remove pixels that have a low-foreground probability of belonging to the foreground, FG FUS ðx; yÞ < th. The spectral combinations and their respective thresholds are shown in Table 3. Thresholds were chosen by the lowest false alarm rate produced by the detection algorithm. False alarm rates for the series of tested thresholds are presented in the Sec. 4 in Table 4.
An example of a fused foreground is shown in Fig. 4(a). A smoothing of the combined foreground map is applied using a two-dimensional Gaussian filter and shown in Fig. 4(b). The filtering results in filling of gaps where pixels were missed from the foreground map without overdilating the region. The final foreground map is shown in Fig. 4(c). A zoomed area on a car region depicting the foreground fusion process is shown in Fig. 5. The effect of thresholding the fused and filtered foreground map is illustrated; the target shadow is removed from the foreground region.
The final step of creating the pixel regions that represent the detected candidates is an image closing, which consists of a dilation followed by an erosion. The structuring element of this procedure is a disk with a radius of four pixels. The dilation operation fills in voids between pixel segments and grows the size of the region. In the erosion operation, we attempt to remove any unnecessary region growth that is a by-product of the dilation. Pixel regions that do not exceed an area of 200 pixels are filtered to remove the objects that may not represent vehicle-sized objects.

Target Tracking
The association of targets involves relating a track sequence from prior frames with target candidates detected in the current frame. This task is trivial in the case where targets stay separated and no occlusions exist. However, in actual practice and in this data set, targets become merged or occluded, making distinguishing between targets difficult. We have chosen to use SIFT features for identification due to their robustness with respect to changes in rotation and scale, and their invariance to change in camera viewpoints and illumination changes. 19 Due to our reliance on these features to uniquely identify targets, we require them to be robust in long-term tracking applications. A disadvantage of SIFT is the heavy computations required for the keypoints, where typical processing times are tenths of seconds to multiple seconds per frame in a normal CPU implementation. 24,25 Developments in graphics processing units Note: dashes (-) imply the threshold exceeded the maximum obtainable pixel value in the image.
(GPUs) and field programmable gate arrays (FPGAs) have created opportunities for real-time algorithms. SIFT implementations have been developed for both GPUs [25][26][27] and FPGAs, 24 where the results demonstrate real-time SIFT calculations. SIFT features are composed of a keypoint that gives subpixel location and orientation of the feature, along with a descriptor that is calculated based on local pixel texture. In the SIFT algorithm, keypoints are first identified at multiple scales. A scale space of the image is created with varying amounts of blur applied to each image using the Gaussian kernel. The blurred image is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ; 2 0 5 Lðx; y; kσÞ ¼ Gðx; y; kσÞ Ã Iðx; yÞ; where the Gaussian kernel Gðx; y; kσÞ with variance kσ is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; 1 6 3 Gðx; y; kσÞ ¼ Within the scale space, difference of Gaussian (DoG) images are calculated by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 3 1 5 Dðx; y; σÞ ¼ Lðx; y; kσÞ − Lðx; y; σÞ: The local extrema in the DoG images at each scale are found by comparing the pixel value with its eight surrounding pixels and the nine neighboring pixels from each of the nearest blurred images. To create an invariance to scale, the extrema must exist on multiple scales. A filtering step of the detected extremas in the DoG images is implemented based on the intensity; an extrema with a low intensity is susceptible to changes in illumination and is therefore unstable and removed from the keypoints. A reference orientation for subsequent processing is given to the keypoint to provide invariance to rotation. From the blurred image in which the extrema was located, the gradient magnitude mðx; y; kσÞ is calculated by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 6 3 ; 1 0 9 mðx; y; kσÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ½Lðx þ 1; y; kσÞ − Lðx − 1; y; kσÞ 2 þ ½Lðx; y þ 1; kσÞ − Lðx; y − 1; kσÞ 2 q (11) A histogram of 10 deg bins is created of the orientations, and the magnitudes added to the histograms are the gradient magnitudes that are Gaussian weighted with a variance of 1.5kσ. The peaks of the histograms are detected, where the highest peak and any peaks above 80% of the highest peak are selected as orientations for the new keypoints. Peaks in the histogram represent dominant directions of the local gradients.
Unique identifications are generated for each keypoint, referred to as descriptors. A 16 × 16 region around the keypoint, with respect to the calculated orientation, is divided into 4 × 4 subregions. Gradient magnitudes and orientations are calculated for each pixel in these subregions, and histograms with 45 deg bins are calculated for each subregion. Through the use of a Gaussian weighting mask with σ ¼ 1∕2 of the descriptor window width (16 pixels for our case), points are inversely weighted proportional to their distance from the keypoint to decrease their contributions and reduce errors caused by window displacements.
To match features from a tracked objects database to features from the current scene, a matching score is calculated by the Euclidean distance between two descriptors. The matching score between a tracked object and a frame object is calculated by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 6 3 ; 4 1 7 where D obj ¼ ðd obj 1 ; d obj 2 ; : : : ; d obj n Þ is the tracked object descriptor, D frm ¼ ðd frm 1 ; d frm 2 ; : : : ; d frm n Þ is the frame object descriptor, and n is the length of the descriptor vector, which is 128 for our case.
The feature with the shortest Euclidean distance, i.e., the nearest neighbor, is selected as the matching feature. To remove matches that do not have a good match, a comparison is made between the nearest neighbor and the next nearest neighbor. If the ratio of the match scores between the nearest and the next nearest neighbor is >0.8, the match is rejected. Lowe 19 found that this method rejects 90% of all incorrect keypoints and only removes 5% of the correct matches.
Updating the track location is based on several factors. The search region for a matching target candidate is limited to the track's estimated bounding box, preventing erroneous associations with targets of similar appearance but at a distance away. In the event that multiple targets are located in the track bounding box, such as at a road intersection when cars become merged, SIFT features are used to select the correct target. If no target is found in the bounding box, SIFT features are matched in the bounding box region and provide a velocity measurement for a linear motion model. Tracks are propagated if no target is matched and no SIFT features are found, a typical occurrence when the target may be partially or fully occluded from view of the sensor. The propagation projects the location of the bounding box linearly into future frames based on the most recent position and velocity prior to the occlusion.

Experiment
The performance of this algorithm was evaluated with a synthetically generated data set using the DIRSIG toolset. 20 A standard midlatitude summer model was used for the atmospheric model MODTRAN. 28 The thermal signatures for 12 vehicles were simulated with the thermal prediction software MuSES. 29 Realistic traffic patterns were generated using the SUMO traffic simulator. 30 The video sequence consists of 600 frames of 2000 × 2000 pixels sampled at 20 frames∕s. The ground sample distance is 0.0635 m and frames are coaligned where pixels correspond geometrically between frames and registered between the spectral bands. As this is a synthetically generated data set, the locations of pixels corresponding to each vehicle is known, providing ground truth centroids of vehicles in the scene.

Detection
We now present the performance for detecting moving targets using our fusion algorithm applied to the DIRSIG data set. For the evaluation of segmented detection rates, a successful detection is a group of pixels that has a centroid with a Euclidean distance within 0.95 m of the centroid of a ground truth object; otherwise, it is considered a false alarm. False alarms are reported on a per frame basis. Targets occluded by 20% or more are not factored into the target detection score.
False alarm rates for a series of examined thresholds are shown in Table 4, where the optimal thresholds are given in bold. The false alarm rate is presented by the number of false alarms per frame. The optimal thresholds for image filtering were selected by choosing the lowest false alarm rate for their respective spectral combination. LWIR resulted in the highest false alarm rate at 1.30. Three of the fusion combinations have false alarms less than 1: MWIR-LWIR, VIS-NIR-MWIR, and NIR-MWIR-LWIR. MWIR-LWIR presented the lowest false alarm rate of 0.96. In the single spectral bands, NIR had the lowest false alarm rate with 1.07.
Detection rates by segmentation for a series of examined thresholds are shown in Table 5, where the results for the optimal thresholds for each spectral combination are given in bold. LWIR achieved the highest detection rate of 0.94 and VIS had a detection rate of 0.93. VIS-MWIR and VIS-LWIR resulted in detection rates of 0.93, while the weighted VIS-NIR-MWIR-3*LWIR (TOT3) and VIS-LWIR3 had results of 0.91 and 0.92, respectively.
Total detection rates and false alarm rates are shown in graph form in Fig. 6. The total detection rates include detections by both segmented objects and features. In the single spectral bands, LWIR resulted in a detection rate of 0.94, but suffered from the highest false alarm rate of 1.30. The detection rate of VIS was slightly lower at 0.93, but lowered the false alarm rate to 1.14. VIS-MWIR had a detection rate of 0.94, while lowering the false alarm rate to 1.14. MWIR-LWIR produced the lowest false alarm rate at 0.96 with a detection rate of 0.91. These presented fusion results demonstrate that fusing multiple spectral bands lowers false alarms while maintaining high detection rates.
The contribution to the overall detection by segmented objects and features is shown in Fig. 7. The black bar indicates the rate by segmented detection and the gray bar is the additional detection rate by using the SIFT features. Detection by segmentation is the primary detection mechanism and contributes to the bulk of the detection rate, whereas feature detection is secondary and has a smaller impact on the overall detection rate. Pixel texture varies in each spectral band, providing different spatial features that are independent of one another. Extracting features from different spectral bands provides additional features for tracking and identification. Single spectral bands did not have any detection by features, which we attributed to an insignificant number of features to match between the target database and the scene. NIR-MWIR had the highest contribution for detections by features at 0.040. We attributed this to the high number of false pixels that were detected by segmentation, incorporating features that belong to the background. VIS-MWIR-LWIR had the next highest contribution of    feature detections with 0.028. The overall detection contribution by features is not significant for this data set, but provides a means to track targets in difficult situations such as busy intersections or partial occlusions when they would otherwise be lost. Algorithm performance for estimating the targets true centroid by correctly detecting pixels that belong to ground truth objects will now be discussed. A correctly detected pixel is defined as belonging to a ground truth object; otherwise, it is classified as a false pixel. A pixel scoring example is shown in Fig. 8. Figure 8(a) is a ground truth vehicle in MWIR and Fig. 8(b) is labeled as foreground pixels. Blue pixels represent true positives that belong to the ground truth object, and red pixels represent pixels that were falsely detected. For this example, false pixels are attributed to the vehicle shadow.
We define the pixel detection rate for the full video sequence as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 3 2 6 ; 7 5 2 pixel detection rate ¼ P N i¼1 detected ground truth pixels P N i¼1 total ground truth pixels ; (14) where N is the number of frames. High pixel detection rates result in accurate estimates of target centroids, but falsely detected pixels can negatively affect the centroid calculation, resulting in less accurate results. The false pixel rate is measured per frame and presented as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 3 2 6 ; 6 4 7 false pixel rate ¼ P N i¼1 false pixels detected P Ã N ; (15) where P is the number of nontarget pixels in the frame and N is the number of frames. Pixel detection rates and false pixel rates for all spectral bands and fusion combinations are shown in Fig. 9. LWIR   produced the highest pixel detection rate of 0.95 and a false pixel rate of 0.0007. The fusion combination VIS-MWIR produced a high pixel detection rate of 0.94, but suffered the highest false pixel rate of 0.0017. TOT3 resulted in a detection rate of 0.89 and false pixel rate of 0.0003. Seven fusion combinations resulted in false pixel rates <0.0005. Displacement error between detected objects and their respective ground truth centroids is a measure of how accurately an algorithm estimates the true centroid of the target. For target tracking, centroids are input to filters that predict future targets locations, i.e., Kalman filtering, which require accurate estimates. Displacements of detected targets over all frames were measured and the root-mean-square error (RMSE) was calculated as where ðx;ŷÞ are the coordinates of the measured centroid, ðx; yÞ are the ground truth centroids, and N is the number of detections. The results are presented in Fig. 10. NIR-LWIR had an RMSE of 0.07 m, which was the lowest for all spectral combinations. LWIR had the next lowest RMSE at 0.08 m, whereas the other single spectral bands had errors >0.3 m. The fusion results presented highlight the centroid accuracy improvements made by fusing spectral bands as compared with using single bands.

Tracking
We now evaluate the performance of the fusion algorithm to associate targets between scenes and create a tracking profile using the foreground combination map (VIS-LWIR). This weighted combination was chosen due to the low centroid error, along with the high detection rate and low-false alarm rate. In this evaluation, the bottom 400 rows of pixels are not considered for the tracking results due to trees obstructing the view of the imaging sensors, preventing full vehicle segmentation. Sample images from the tracking sequence are shown in Fig. 11. The yellow box indicates the track algorithm has used a segmented object to update the track location. A teal box indicates that no segmented object matched and SIFT features were used to update the track location. The motion of 12 vehicles was simulated for an urban traffic environment. Tracks were initiated on all 12 vehicles during the video. Of those 12 vehicle tracks, 11 were tracked though the entire video sequence with no errors. Due to being idle for extended periods of time, one track was lost, but a new track was initiated after it initiated movement. There were no instances where track identities were switched between vehicles and only one instance of a false track. Three false tracks were produced, where two false tracks are attributed to the idle vehicle. The tracking results are summarized in Table 6.

Conclusion
In this paper, we proposed an algorithm to fuse multispectral data sets to increase detection accuracy of a video tracker, while maintaining a high detection rate and low-false alarms per frame. Previous works consider visible and LWIR data sets; [12][13][14][15][16][17] we extend previous work to include NIR and MWIR. In these four spectral bands, we build a GMM to detect foreground pixels by modeling the time history of the pixels intensities. Foreground pixels from all spectral bands are weighted and fused into foreground maps, and formed into targets candidates. Target candidates are tracked through the frame sequence using SIFT features to track missed detections and uniquely identify targets. Our proposed algorithm was tested on synthetically generated data using the DIRSIG toolset of visible, NIR, MWIR, and LWIR imagery. Compared with the single spectral band base, the fused algorithm improves detection accuracy while improving detection rates and lowering false alarm rates. The detection results provided input to a video tracker that detected the 12 moving vehicles in the scene. Of those 12 targets, 11 were tracked with no failures, one vehicle showed track-loss, but this track was reinitiated, and three false tracks occurred.