Aerial videos acquired by airplanes or fixed-wing unmanned aerial vehicles (UAVs) are a good source of data for ground surveillance. Such kind of data can be used for applications, such as automatic traffic monitoring, border surveillance, or protection of critical infrastructures. In this article, we will focus on video data acquired with a frame rate of 25 Hz by a small UAV in top-down view at an altitude of 400 m. The ground coverage is up to with a ground sampling distance (GSD) of about . Such parameters are usual for so-called full-motion video (FMV) sequences. Additionally, there is one specific property that we can find in top-view videos with small-sized objects on the ground that are observed; there is no significant change in appearance of the objects in the image over time. This is crucial for the successful application of image stacking.
Image stacking is a well-known method that is used to improve the quality of images in video data. A set of consecutive images is aligned by applying image registration and warping.1 In the resulting image stack, each pixel position contains redundant information about its intensity or color value. This redundant information can be used to perform image fusion to suppress image noise,2,3 handle motion blur,45.–6 detect independent motion in images using background models,7,8 or even enhance the spatial image resolution as done in multiframe super-resolution.910.11.–12 However, the performance of the above-mentioned methods is usually dependent on the assumption that the scene is static and that the viewing angle does not change significantly over time. At the same time, moving objects can get blurred or distorted as the images inside the image stack are registered and aligned with respect to (w.r.t.) the stationary background. Hence, moving objects need to be handled explicitly as this is done for super-resolution.13,14
In this article, we apply image stacking in an innovative way: image alignment is not performed w.r.t. the stationary background but to small moving objects instead. In our aerial FMV data, moving objects are mainly cars but also other vehicles on the ground such as motorcycles. Then, we apply a pixelwise temporal filter (e.g., a median filter as done in Ref. 3) to the image stack to remove noise and compression artifacts from the moving objects. More interestingly, we intentionally add motion blur to the background that surrounds the moving objects. This effect is visualized in Fig. 1. We aim at detecting the moving vehicle that is marked with a red cross. As we can see in the zoomed original image (left box inside each image), there are potential distractors, such as parked and overtaking vehicles. By applying our proposed image stacking method (right box inside each image), it takes only 28 images () to blur the distractors effectively. Although camera and objects are moving, the observed vehicle’s appearance is nearly constant over time due to the top-view camera angle and due to the large distance between camera and scene. In this way, stationary image content is removed that can modify the observed object’s appearance and, thus, can disturb the detection and segmentation process, such as (1) partial occlusions by trees, power supply lines, or buildings, (2) stationary objects close to the observed object, such as parked vehicles or buildings, and (3) street textures, such as cobblestones or road markings. To demonstrate the effectiveness of the proposed approach, two baseline object detection and segmentation algorithms are implemented, improved by our image stacking approach, and evaluated using FMV sequences that contain occlusions, parked vehicles, and street textures.
The main contributions of this article are:
1. the introduction of image stacking for moving ground objects observed by a moving airborne camera to intentionally blur the surrounding stationary background,
2. handling and management of multiple potentially overlapping image stacks that occur when observing multiple moving ground objects in parallel, and
3. how to use the proposed image stacking approach to improve standard moving object detection and segmentation methods.
We briefly mentioned the proposed image stacking approach in prior work already,15 but no details were given, such as how image stacks can be initialized, how multiple image stacks can be managed, and how our proposed approaches can be implemented efficiently. Furthermore, we provide an extensive study of the image stacking parameters that contribute most to the object detection and segmentation performance. Parts of this article are based on the first author’s doctoral thesis.16
The remainder of this article is organized as follows: literature is reviewed in Sec. 2. The baseline moving object detection and segmentation algorithm are described in Sec. 3. The proposed image stacking approach is presented in Sec. 4. Experimental results are given in Sec. 5. We conclude in Sec. 6.
Moving object detection in aerial videos is usually based on the detection of image regions that move independently of the camera followed by either segmentation techniques or machine learning approaches. Independent motion is then represented either by pixel clusters as done in background subtraction8,17 and frame differencing18,19 or by clusters of tracked local image features, such as Kanade–Lucas–Tomasi features or Harris corners.2021.–22 Segmentation techniques can be based on thresholding,23,24 morphological operations,25 edge detection,15,26 or superpixels27,28 in combination with connected component labeling while machine learning approaches use trained classifiers in a sliding-window framework2930.–31 often only applied to independently moving image regions.3233.–34
To further improve those methods, several approaches exist for spatial information fusion15,26,31,35,36 and consideration of context knowledge, such as street networks or tracking statistics.18,25,32,33,3738.–39 Temporal information fusion, however, is often introduced by using single or multiple object tracking that is based on initial detections. Pure feature tracking20 is possible, too but not sufficient in general since no information is gained about individual object instances. Hence, it remains unclear if the track contains one object completely or partially, or even multiple objects moving in the same direction at similar velocity. Actually, our method is applied between object detection and object tracking and, thus, can be used to further improve multiple object tracking by providing more reliable initial detections.
Few approaches for image stacking exist in the literature that are somehow related to our proposed method. Prokaj and Medioni40 align and average several samples of a moving vehicle to set up a template for a regression tracker in wide-area motion imagery (WAMI) data. Ali et al.41 calculated an intraclass variation between aligned samples of a vehicle and compared it to other vehicles in the scene (interclass variation) to reacquire the vehicle after an occlusion in FMV data. Finally, Mise and Breckon42 apply multiframe super-resolution to low-resolution moving objects to enhance the object’s image quality and, thus, improve the performance of correlation-based object tracking. However, while the above-mentioned approaches use image stacking to improve object tracking, our proposed image stacking approach is applied at an earlier stage to improve object detection and segmentation.
Moving Object Detection
In this section, we present the baseline processing chain, in which the proposed image stacking approach is embedded. An overview of the algorithm’s pipeline is given in Sec. 3.1, and the two main submodules are described in more detail in Secs. 3.2 and 3.3.
The concept of the entire moving object detection process, including the proposed image stacking approach, is visualized in Fig. 2. Incoming image sequences are processed by two main modules: (1) independent motion detection and clustering and (2) object detection and segmentation. The first module is depicted in yellow color and mainly aims at determining motion that is independent of the camera motion. Image regions of independent motion are represented by motion clusters that can contain multiple objects or even parts of objects due to split and merge effects. There may even be no object at all, if the motion cluster is the outcome of imprecise image registration or parallax effects. In the second module that is depicted in red color, these motion clusters are used as regions of interest (ROIs) to detect and segment individual moving objects. Although “object segmentation” and “object detection” can be considered as two different topics in computer vision, they can be used to solve the same problem in the context of this article: improved localization and boundary determination of individual objects. The proposed image stacking approach (bright red color) is part of this module and supports the detection and segmentation process. It is described in detail in Sec. 4.
Independent Motion Detection
Independent motion detection is used to detect image regions that move independently of the camera. In our FMV data, these regions usually represent moving objects, such as vehicles or motorcycles. To detect independent motion, it is crucial to estimate the camera motion first. Therefore, we only use the images of the video sequence without any additional meta-information, such as camera calibration parameters, digital elevation models, or UAV flight parameters.
The image sequence is compensated for camera motion by the application of image registration.1 Local image features are detected with subpixel accuracy and tracked over time.43 We use Förstner–Harris corners44,45 as local features. Corresponding corners between subsequent images are used to estimate frame-to-frame homographies as global image transformations to register images.46 Outlier correspondences are removed by applying random sample consensus47 with subsequent refinement. Using homographies is possible since the depth differences in the evaluated scenes are small compared to the distance of the observing camera, and the scene can therefore be approximated well with a ground plane. Those homographies represent the camera motion as shown in Fig. 2. Robust feature tracks that do not fit to the camera motion are assumed to be part of moving objects. Since these features have been tracked for a few frames, they can be represented not only by their positions but by motion vectors. Similar motion vectors are grouped to motion clusters. Therefore, single-linkage clustering48 is performed based on position and velocity of the motion vectors. The choice of distance thresholds is based on the known GSD and the expected size of objects. As its output, the independent motion detection module delivers a set of motion vectors and motion clusters.
This process is visualized in Fig. 3. One original image taken from an image sequence of the test dataset is shown in Fig. 3(a). In Fig. 3(b), we see all detected and tracked local features represented by red vectors. There are usually around 20,000 motion vectors per image. We apply frame-to-frame homography estimation to calculate a global image registration that represents the camera motion. Using this homography, we can distinguish between stationary and independently moving feature tracks. Stationary features are depicted in red color and motion vectors in yellow color in Fig. 3(c). Clustered motion vectors are used to calculate a local image registration for local image alignment. On average, each moving object in the test dataset produces a cluster of 15.57 motion vectors. Hence, robust image alignment based on small moving objects, which is crucial for the proposed image stacking approach, is possible as we need at least two motion vectors to estimate the stack alignment (translation and rotation). Motion vector clustering leads to the cyan boxes in Fig. 3(d), where each box represents one motion cluster.
Instead of motion vector clustering, background subtraction8,17 or frame differencing18,19 is the popular method for independent motion detection. However, for image stacking, we want to use the temporal information that is implicitly contained in the motion vectors. Furthermore, motion vector clustering is less prone to be affected by noise.20,22,49
Object Detection and Segmentation
For object detection and segmentation, we use two different approaches and implement them as a baseline that we want to improve by the application of image stacking. The first approach is motivated by appearance-based object detection using machine learning34 and the second approach is based on object segmentation by the clustering of edge pixels.49 The result of both methods is bounding boxes that surround the moving objects. For simplicity and to be consistent with existing literature, we call those bounding boxes “detections” even if they are the result of object segmentation.
The feature clustering during independent motion detection that provides the motion clusters is prone to produce split and merged detections of moving objects. Split detections can occur for weakly textured vehicles, and merged detections appear due to the similar motion direction and velocity in dense urban traffic.49 Thus, the motion clusters are considered as initial object hypotheses and define a search space. Before object detection and segmentation are applied, the motion clusters are spatially extended in the motion direction to fully contain objects that are detected partially by independent motion detection. Those extended regions are denoted by “motion ROIs.”
For object detection, we train a classifier using 790 vehicle samples and 790 nonvehicle samples. Each sample has a size of . We use Integral Channel Features as a descriptor and train a boosted decision trees (BDT) classifier using Real AdaBoost.50 This classifier is applied locally to the motion ROIs in a sliding-window framework. To consider different vehicle sizes, image rescaling is done with three different scales.34 Note that, in this work, image rescaling is not used to detect objects in different distances (as done in pedestrian detection, for example), since we perform a scale normalization using the known GSD before we run the detector. Finally, a nonmaximum suppression is applied based on the intersection-over-union (IoU) criterion and the classifier confidence.
For object segmentation, we calculate gradient magnitudes using local binary patterns16 and use quantile-based thresholding for binarization.51 A typical value for the quantile is 0.15. Thus, we assume that 15% of the pixels inside a motion ROI belong to moving objects. After the application of morphological closing and connected-component labeling, we calculate bounding boxes for each blob that exceeds a certain area or size (w.r.t. number of pixels). Each bounding box represents one moving object. The above-described approach is chosen because it is able to outperform16 other segmentation methods that are based on global probability of boundary,52 superpixels,53 spatiotemporal saliency,54 top-hat transform,25 or Canny gradient thresholding.26 The main reason, therefore, is the image quality, which is severely limited by low resolution and atmospheric effects causing noise and blurry object edges.
Finally, duplicate and outlier detections are removed. Duplicate detections occur due to overlapping motion ROIs. They can be rejected by applying the earlier mentioned nonmaximum suppression for object detection and by fusing all overlapping bounding boxes that exceed a certain IoU threshold for object segmentation.16 Outliers are detections that no motion vectors can be associated with. Therefore, we simply check the number of motion vectors inside the detection bounding box. Detections with associated motion vectors are rejected, as we assume them to be stationary.16,34
Image Stacking for Moving Objects
The main contribution of this article is presented in this section: (1) the introduction of image stacking for moving objects, (2) handling and management of multiple potentially overlapping image stacks, and (3) the replacement of motion clusters by image stacks to improve object detection and segmentation.
Object detection and segmentation as presented before are working well as long as the object appearance is clearly visible. However, there are three effects that can affect the object appearance and decrease the detection and segmentation performance by causing inaccurate object boundaries or even producing false positive (FP) or false negative (FN) detections:
• partial occlusions by trees, power supply lines, or buildings,
• stationary objects close to the observed object, such as parked vehicles or buildings, and
• street textures, such as cobblestones or road markings.
To suppress these effects, image stacking is introduced to mask and blur the stationary background around a moving object. Several motion ROIs of the same object in subsequent images are registered and aligned using the motion vectors, i.e., we apply local image registration as described and visualized in Figs. 3(c) and 3(d). Each aligned motion ROI is one layer of the image stack. By calculating the pixelwise mean or median pixel value, the stationary background and even other moving objects with a nonzero relative velocity compared to the observed object are blurred as seen in Fig. 1. The red cross visualizes a motion vector that is tracked for 250 consecutive images. The image region (“stack region”) where stacking is applied is generated using the motion direction and fixed values for width and length of the stack area. This stack region is zoomed and visualized without stacking in the left and with stacking in the right area of each image. The center of the stack region is fixed by the position of the motion vector (red cross). After that, the stack region, which is oriented in the motion vector direction in the original image, is rotated upright and added to the stack. The pixelwise mean image of the entire stack is visualized in Fig. 1. For a pixel position in the stack and in each aligned image with , the corresponding pixel value is given as
The concept of the proposed image stacking approach is visualized in Fig. 4. It shows the processing that is implemented in the image stacking module depicted in bright red color in Fig. 2. As there can be multiple moving objects in the scene, there usually are multiple image stacks and we aim at having at least one image stack per object. New image stacks are initialized either by using motion vectors or detections. New motion vectors are associated to existing stacks based on position and motion to improve the robustness of image alignment for small moving objects. Image stacks are updated by adding the corresponding aligned image area of the current image to the stack. Finally, image stacks replace related motion ROIs in the original unstacked image for the application of the object detection and segmentation algorithms. As seen in Fig. 2, image stacking can be skipped if image alignment is not robust or if the stack height is not sufficiently large, yet. In this case, the motion clusters in the original unstacked image are used for object detection and segmentation.
Image Stack Initialization
The implementation of image stacking in a fast and efficient way is not easy. In a feasibility study,15 it was demonstrated that generating and initializing a stack for each motion vector in the image can lead to hundreds of detections (and stacks) per object, but adequate duplicate removal of sufficiently overlapping bounding boxes improves the performance of object segmentation with respect to detection rate and precision. However, this method is highly inefficient w.r.t. processing time and needs to be improved for implementation. The main difference between the feasibility study and this concept is the introduction of “master vectors” as representatives for clustered motion vectors. Each master vector is a virtual motion vector and owns exactly one image stack. In Fig. 5, master vectors are visualized as colored crosses. Thus, the total number of image stacks decreases from several hundred to per image. The master vector lies in the center of its related image stack.
To initialize an image stack, one has to detect a cluster of similar motion vectors. This cluster can be different from the motion clusters, which are the result of independent motion detection and clustering. The reason is that moving objects driving with a small relative velocity close to each other may be clustered together in the same motion cluster while image stacking aims at initializing at least one stack per object. Only in this way, it is possible to benefit from the blurring effect since the first object is assumed to be focused and sharp in the first stack while the second object gets blurred over time due to its velocity difference, and vice versa, the second object stays sharp in the second stack while the first object is blurred. If only one stack is initialized for two objects, the blurred object may be missed by the detection and segmentation algorithm. This effect is visualized in Fig. 5 with two vehicles overtaking each other. Each master vector has between 3 and 50 related motion vectors. These related motion vectors are represented by dots in the same color as their master vector. The blue master vector is not in the center of its cluster since it was initialized while both objects were merged in one large bounding box. In contrast, the magenta master vector was initialized later when both objects were detected separately. After 1 s or 25 stacked images, the pixelwise mean image shows that usually only one object stays sharp per stack. Two strategies have been implemented for image stack initialization.
1. The results of object detection and segmentation are called “detections.” These detections are assumed to have a higher detection rate compared to the motion clusters that are prone to produce splits and merges. By using them as feedback information to initialize stacks, the ratio of stacks to objects is close to 1, which is most efficient. Even for merged detections there will be separate image stacks, if each single object has been correctly detected at least once. Immediately after that, the individual stacks are initialized. This is the case in Fig. 5. The blue master vector is not in the center of its cluster as it was initialized while both objects were merged in one large bounding box. The magenta master vector was initialized later when both objects were detected separately for at least one time step. To avoid any information loss, the blue master vector is kept and not reinitialized. However, objects that cannot be detected and segmented separately at all end up in a common image stack. In such a case, the blurring effect is assumed to be weak for each object, and detection performance is similar to using no image stacks.
2. A weak detection performance can lead to the initialization of image stacks for merged detections. In this way, image stacking may even decrease the overall detection performance. To use the motion vectors and motion clusters instead of the detections, “-means clustering”55 is introduced. is the number of clusters after clustering and can be chosen based on the known standard vehicle size and the size of the related motion cluster. Then, the motion vectors are grouped using position and motion. This approach is less efficient since usually more stacks than objects are initialized and the calculation of -means is time-consuming.
Both approaches are evaluated in Sec. 5.
Association of Motion Vectors to Image Stacks
Due to occlusions or varying object appearance, existing motion vectors can disappear and new motion vectors can appear at any time in the image sequence. A master vector is deleted if it loses all related motion vectors. While each master vector can have many related motion vectors, each motion vector can be related to only one master vector. A new motion vector is associated to a master vector by applying a k-nearest neighbors (-NN) algorithm.48 -NN is a voting algorithm searching for the nearest motion vectors in the image area around the new motion vector. Each neighbor votes for its master vector, and the new motion vector is associated to the master vector with the most votes. The search space can be significantly reduced by only considering motion vectors in the same motion cluster. A typical value for is 3. This process is shown in Fig. 6. Each master vector (image stack) is visualized with a cross in a different color. The associated motion vectors are displayed as dots in the same color as their related master vectors. After -NN using the motion vectors and clusters, the new motion vectors are associated to master vectors.
Image Stack Update
With each new incoming image, image stacks are updated in two steps: (1) the position of the master vector is updated and (2) the current stack area is added as a new layer to the image stack. Since the master vector is a virtual motion vector, its position and orientation can only be updated by its related motion vectors. Each related motion vector stores its relative position and orientation to the master vector right after it has been associated. This way, it can suggest a position and orientation of its related master vector in the current image. The final master vector position and orientation are determined by calculating the median of all suggested positions and orientations. Hence, even few incorrectly tracked related motion vectors will not affect the final master vector.
The stack area surrounding the updated master vector is rotated in upright position and added to the image stack as shown in Fig. 7. There are two considered strategies to arrange the image stack.
One image with the size of the rotated stack area is initialized. Each pixel value is zero. To append a new stack area, it is rotated in an upright position, and each pixel value of the rotated stack area is added to the accumulation image. The stack height is stored by incrementing a counter variable with each appended stack area. The accumulation image is a very efficient way to arrange the image stack since only one image has to be kept in the memory and new layers are simply added to this image. The motion ROI can be calculated very fast by dividing each pixel by the stack height. This is the mean pixel value of all stack areas appended to the stack just as described by Eq. (1). The main disadvantage of this approach is that a slightly varying object appearance due to changes in the camera angle or UAV altitude will strongly affect the resulting motion ROI by blurring the observed object.
To avoid this self-blurring effect, old stack areas can be replaced after a certain time and, hence, the stack can be arranged as a circular buffer. The size of the circular buffer directly corresponds to the stack height. It is fixed and determined a priori. In this way, the maximum number of stack areas in the stack is limited. New stack areas are added to the buffer and as soon as the buffer is full, the oldest stack area is replaced by the new one. The motion ROI is calculated either by the pixelwise mean or median pixel value of all stack areas. This is much more time-consuming than the first approach with the accumulation image. Figure 7 shows a circular buffer with stack height and where the stack area at position 4 is replaced. The standard buffer size used in this article is since all stationary objects and most objects moving relative to the observed object disappear in the motion ROI.
The comparison and evaluation of both approaches will be presented in Sec. 5. An image stack is initialized as soon as the master vector is initialized. Hence, a region of the stack area can be outside of the image borders. To add the whole area to the stack but minimize the influence of the outer region for the calculation of motion ROIs, the pixels in these regions are set to the mean of the gray-value range. We set this value to 127.
Replacement of Motion Clusters by Image Stacks
The replacement of motion clusters by motion ROIs that are provided by image stacking is optional. Motion ROIs are used, if suitable stacks are available, and motion clusters are used otherwise. A suitable image stack has to meet the following four criteria:
1. The velocity and the direction of the master vector have to be close to the velocity and the direction of the motion cluster.
2. The extended motion cluster has to be fully inside the stack area.
3. The master vector has to be inside the motion cluster.
4. The stack height has to exceed a minimum threshold .
This process of finding suitable image stacks is shown in Fig. 8. Image stacks shall be found for the lower motion cluster (cyan rectangle). The extended motion cluster is visualized with a dashed cyan rectangle. There are two vehicles inside the motion cluster, and for each vehicle, an image stack has been initialized (shown in magenta and blue color). For both stacks, all four criteria are fulfilled, so two suitable image stacks have been found. These image stacks replace the original motion cluster by replacing the original image with the stacked image. This stacked image can be calculated using the pixelwise mean of all images in the accumulation image or all images in the circular buffer as described in Eq. (1) or the pixelwise median of all images in the circular buffer using the following equation:
Image stacking is introduced to improve the object detection performance by considering temporal context. Since only motion vectors and no object representations are used, image stacking takes place on feature level before multiple object tracking is applied on object level. Object detection and segmentation are applied to each motion ROI (stacked image) separately and the original image motion cluster is discarded, if suitable image stacks have been found. As soon as a motion cluster has been replaced by suitable image stacks, the aim is to detect only the sharp object in each stack and, thus, avoid a potential merged detection containing background structures or several moving objects. If a blurred object is detected, too, this detection is suppressed by the outlier and duplicate removal module. In contrast to the first implementation of image stacking,15 the approach presented here can be processed in real time. The runtime will be analyzed in detail in Sec. 5.
Figure 9 shows examples for each of the three situations where image stacking improves object detection as mentioned in the beginning of Sec. 4. These are (1) partial occlusion by a tree in Fig. 9(a), (2) a parked vehicle close to the observed moving vehicle in Figs. 9(b) and 9(c), and (3) disturbing street textures in Fig. 9(d). The observed object is located in the center of each image. Since image stacking will improve object detection and segmentation only in such situations, only minor improvement of detection performance is expected. Furthermore, image stacking can even cause additional false detections if moving objects with small relative motion merge due to blurring. This effect becomes apparent in Fig. 5.
Important parameters for image stacking are the size of the stack area , the minimum stack height , and the stacking strategy. has to be chosen large enough so that the stack region fully covers even large objects such as trucks. is the threshold of stack height that has to be exceeded before a stack is returned as motion ROI. This is important since small stacks of or less are prone to cause objects merging due to blurring. If is too large (e.g., 50 or 100), fewer image stacks are used, and potential benefit decreases as objects may already be outside of the camera view before is reached and the motion ROI is returned. Good results have been achieved with , , and circular buffer instead of accumulation image. The influence of varying these parameters for object detection and segmentation is analyzed in Sec. 5.
Experiments and Results
Four gray-value video sequences are used in our experiments. Sample frames for each sequence are visualized in Fig. 10. While the first three sequences SEQ 1, SEQ 2, and SEQ 3 are coming from your own top-view aerial video data, the fourth is the EgTest01 sequence taken from the publicly available video verification of identity (VIVID) dataset.56 To the best of our knowledge, no further publicly available FMV datasets currently exist that contain challenging urban scenarios with dense traffic. WAMI datasets, such as Columbus large image format57 or Wright-Patterson air force base,58 cannot be used here due to their low frame rate.
Although the EgTest01 sequence is recorded in oblique view, we consider this sequence in our experiments as we want to show some limitations of applying our proposed method to oblique view videos. The number of frames is approximately between 200 and 2000, and the frame rate is between 25 and 30 Hz. Our own data consist of busy urban scenes. There are between 5 to 20 moving objects per frame. With a GSD of about , a standard car covers an area of in the image. Some statistics of the evaluated sequences are given in Table 1.
Statistics of the aerial video datasets.
|Video||Frames||Frame rate (Hz)||GT|
|Moving objects||Single detections||Moving vehicles||Single detections|
Moving objects are represented by bounding boxes. The ground truth (GT) was labeled manually. We distinguish between vehicles (red boxes in Fig. 10) and nonvehicles, such as moving motorcycles or pedestrians (orange boxes in Fig. 10). This is important as we compare a model-based vehicle detection approach and a model-free object segmentation approach. Since the vehicle detection approach is not able to detect nonvehicles, we declare nonvehicles as “do not care” objects for the quantitative evaluation to guarantee a fair comparison between the two approaches. In the original GT of the EgTest01 sequence, only one object is labeled for an evaluation of single object tracking. We extended the GT manually to all six moving vehicles.
A detection is true positive (TP), if the bounding boxes of GT and detection overlap at least 10% w.r.t. to the IoU criterion.59 In contrast to the PASCAL criterion of 50% overlap,60 we choose this small overlap as for small objects covering only 50 to 200 pixels even small deviations in size or position of the detection bounding box from the GT bounding box can induce a significantly lower overlap. For the quantitative evaluation, we use standard measures, such as precision, recall, and -score,61 as well as the normalized multiple objects detection accuracy (N-MODA) and the normalized multiple object detection precision (N-MODP).62
Four parameters are selected for optimization: minimum stack height and stack area are continuous parameters, whereas stack initialization and stack arrangement are discrete parameters. The height of a stack has to exceed before the motion cluster is replaced by the stack for object detection and segmentation. If is not exceeded, the motion cluster is used instead. is the spatial extend of a stacked area in the image. With a larger size, it is more likely that extended motion clusters are inside the stack area and can be replaced by stacks. Two approaches for stack initialization are proposed in Sec. 4.2: initialization by -means clustering of moving corners and initialization by detections. Finally, two different methods are introduced for stack arrangement in Sec. 4.4: accumulation image and circular buffer. The stack image of a circular buffer can be calculated by pixelwise mean or median. Maximization of the -score for SEQ 1 is chosen as optimization criterion.
The optimization results are visualized in Fig. 11. Stack initialization is evaluated for both the model-based vehicles detection approach and the model-free object segmentation approach in Fig. 11(a). For both approaches, -means clustering performs worse compared to initialization by detections, so the latter one is chosen to initialize stacks. In Fig. 11(b), the -score of for all three stack arrangement methods is depicted. is the chosen range clearly demonstrating that is the best value for each method. Figure 11(c) shows the optimization for stack area . The -score of the three stack arrangement approaches is plotted against the stack area size in pixels. With and , two different ratios of width to length of are evaluated. The variance of the -score is smaller for ratio compared to , so is further considered. is the best value for the stack area since there is no improvement of the -score anymore for larger values. In all evaluations, the accumulation image achieves a worse maximum -score than the circular buffer. Eventually, the circular buffer with pixelwise median image is chosen since it performs slightly better compared to the pixelwise mean image.
In this section, the object segmentation and the vehicle detection approach are evaluated with and without the proposed image stacking method for each of the four aerial videos. The aim is to determine the situations in which the application of image stacking is beneficial and to measure the potential improvement. The quantitative evaluation is presented in Table 2. Bold font indicates an improvement compared to the approach without image stacking. In SEQ 1, image stacking reduces the number of FPs and FNs by 76 for the segmentation approach and by 14 for the detection approach, respectively. A similar result is achieved for SEQ 2, where the number of FPs and FNs is decreased by 29 and 19, respectively. The improvement of the -score is between 0.002 and 0.011.
Quantitative evaluation for image stacking.
|Video||Evaluation measure||Object segmentation||Object segmentation + stacking||Vehicle detection||Vehicle detection + stacking|
Note: Improvement by image stacking is highlighted in bold.
However, image stacking does not really improve the results for the sequences SEQ 3 and EgTest01. The number of FPs and FNs is reduced by only 16 and 15 for object segmentation, respectively. At the same time, the performance of the vehicle detection approach is even decreased. In SEQ 3, there are no occlusions and only few merged detections, so that image stacking cannot generate much benefit here. Furthermore, slightly imprecise tracking of the motion vectors can result in blurry observed objects, which lead to an even higher number of FN detections. The reason is that due to object blurriness, the quantile-based threshold for object segmentation and the classifier’s confidence value threshold for object detection are not exceeded anymore. In the EgTest01 sequence, there are no merged detections, no partial occlusions, and only few disturbing street textures. Image stacking works well as long as the object appearance is stable. If this is not the case, the object gets blurred or deformed in the image stack. As mentioned earlier in Sec. 5, the camera angle and, thus, the vehicle appearance are varying in EgTest01. This is difficult to handle for the classifier that was only trained with top-view samples. It is even more challenging, when the turning vehicles at the beginning of the scene get deformed during image stacking as demonstrated in Fig. 12.
In general, there is stronger improvement in both accuracy (-score and N-MODA) and precision (N-MODP) for object segmentation compared to the vehicle detection approach. The main reason is that outlier removal as described in Sec. 3.3 is able to cover most situations where stacking can improve vehicle detection: due to the fixed size of the sliding window, usually no undersegmentation occurs in case of merged detections. Instead, objects are either missed or FP detections appear. These FPs are reduced by both image stacking and outlier removal. At the same time, undersegmentation as occurring regularly in object segmentation cannot be handled well by outlier removal but only by image stacking.
The qualitative evaluation is done separately for object segmentation in Fig. 13 and vehicle detection in Fig. 14. Four image sections are chosen for each of the two methods. GT is visualized in the top row, and the approach without and with image stacking is depicted in the center and lower row. Orange GT bounding boxes represent moving motorcycles, bicycles, or persons that are not considered for evaluation (ignore regions). The cyan and red boxes in the center and lower row visualize motion clusters and detections, respectively. The red arrows point at the improvement by image stacking. In Figs. 13(a) and 13(b), typical merge situations are shown where two objects drive close to each other and cause undersegmentation. They can be separated by image stacking since each object has its own stack without disturbing background. The right truck in Fig. 13(c) is partially occluded by a tree, while there is disturbing street texture in Fig. 13(d). Object segmentation is still possible but imprecise. Image stacking compensates for the occlusion and the street texture leading to more precise segmentation results. In Fig. 14, merged detections (a), FPs (b), and FNs (c) occur for the vehicle detection approach. The reason is ambiguities of the classifier when there are more prominent object contours between objects than for the real individual objects. In these cases, image stacking helps since the classifier confidence is higher inside the individual stacks with exactly one sharp object per stack. The partial occlusion in Fig. 14(d) is successfully handled with image stacking although only the shadow is detected for the left truck.
In summary, image stacking is able to improve both the object segmentation and the vehicle detection approach. The improvement may look minor; however, there are only few situations where the application of image stacking is beneficial, such as partial occlusions, merged detections, and ambiguous detections in dense traffic. There is less improvement for vehicle detection since both image stacking and outlier removal handle similar problems and, thus, complement each other. In our data, vehicle detection outperformed object segmentation in all sequences. The reason is that vehicle detection, which is based on a sliding-window approach, makes assumptions about vehicle size and appearance that are satisfied in the evaluated sequences. The model-free object segmentation approach achieves better performance in cases where those assumptions are violated, such as in sequence EgTest01.
The runtime for managing about 20 stacks per image varies between 48 and 870 ms. This strong variation is dependent on the choice of the stack arrangement: while stack initialization and association of motion vectors to stacks (Sec. 4.2), and replacement of motion clusters by stacks (Sec. 4.3) take less than 20 ms altogether, stack update (Sec. 4.4) claims between 28 ms, if the accumulation image is used, and up to 850 ms, if the stack is arranged as a circular buffer. Optimization can be achieved by fast, approximated, or incremental calculation of the median pixel values in the circular buffer.63,64 Furthermore, we do not perform any parallel processing of the image stacks, which is expected to strongly reduce the runtime.
Conclusions and Outlook
The main goal of our proposed image stacking approach is to remove disturbing image structures from the background of moving objects on the ground observed by airborne cameras in top view. Those structures are usually not moving and can thus be smoothed to isolate the observed object from the background. Especially, model-free object segmentation can benefit from this approach since object contours can be well-isolated by image stacking. There is less benefit for sliding window-based vehicle detection since the assumptions that are made implicitly (e.g., about object size or object width/length ratio) tackle similar problems as the image stacking approach. Furthermore, imprecise image registration of moving image regions that are used for image stacking is the most prominent limiting factor for vehicle detection as vehicles can be blurred affecting the classifier’s confidence. However, improvement may be possible when using deep learning-based local feature matching for image registration with higher accuracy.65
There are several potential applications for image stacking besides improved object detection and segmentation. Among these applications is super-resolution for moving objects, generating appearance templates without background for visual object tracking or reidentification, or temporal filtering to suppress image noise, compression artifacts, or artifacts of a disturbed wireless connection.
The research reported in this contribution was funded by the Technical Center for Information Technology and Electronics (WTD 81) of the German Federal Office for Armament and Procurement.
Michael Teutsch received his diploma degree in computer science and PhD from the Karlsruhe Institute of Technology (KIT) in 2009 and 2014, respectively. From 2009 to 2016, he worked as a research scientist at the Fraunhofer IOSB, Karlsruhe, Germany. Since 2016, he has been with Hensoldt Optronics, Oberkochen, Germany. His research interests include computer vision, visual surveillance, object detection, object tracking, and machine learning. He has authored or coauthored more than 20 scientific publications.
Wolfgang Krüger received his diploma degree in computer science and his PhD from the University of Karlsruhe, Germany, in 1986 and 1991, respectively. Since 1986, he has been a research scientist at the Fraunhofer IOSB, Karlsruhe. His research interests include computer vision, analysis of unmanned aerial vehicle video data, image registration, motion detection, object detection, visual tracking, and machine learning.
Jürgen Beyerer is a full professor for informatics at the Institute for Anthropomatics and Robotics at the KIT since March 2004 and director of the Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB in Ettlingen, Karlsruhe, Ilmenau, and Lemgo. His research interests include automated visual inspection, signal and image processing, variable image acquisition and processing, active vision, metrology, information theory, fusion of data and information from heterogeneous sources, system theory, autonomous systems, and automation.