17 October 2014 Detection of people in military and security context imagery (withdrawal notice)
Author Affiliations +
This paper has been withdrawn by the publisher because it was already published in the following conference:

Electro-Optical Remote Sensing, Photonic Technologies, and Applications VIII; and Military Applications in Hyperspectral Imaging and High Spatial Resolution Sensing II

The correct record for this manuscript can be found here: http://dx.doi.org/10.1117/12.2071902
Shannon, Spier, and Wiltshire: Detection of People in Military and Security Context Imagery



The Development Concepts and Doctrine Centre (DCDC) paper on the Future Character of Conflict1 discusses how the future operational military environment will likely be congested, contested, constrained, cluttered and connected. The world’s population is assessed to expand significantly over the next decades, particularly in the developing world where the risks of tensions triggered by demand for improved social conditions and access to scarce natural resources is highest. Equally competition for influence and prosperity may force the spread of instability into more developed nations. An increasing proportion of the global population live in urban areas suggesting that future conflicts will be characterized as hybrid in nature combining conventional, irregular and asymmetric threats within the same time and space. This potential environment is further complicated by the likely presence of friendly troops, innocent bystanders or demonstrators who do not pose any threat of direct hostile action.

The traditional use of aerial platforms for intelligence gathering is greatly challenged by the nature of cluttered and congested urban operating environments and restricted lines of sight. It is likely that ground based acquisition systems will become more prevalent in future urban operations with exploitation drawn from new and emerging electro-optic surveillance technologies. Recent and current operations in Iraq and Afghanistan undertaken by US and British forces are examples that have clearly demonstrated that for ground forces to fight effectively in built-up areas, or to act as aids to the civil power, they must first have access to current and pertinent intelligence about the existing and likely threats they face. In many cases, we remain dependent on the ability of human operators to manually fuse disparate sources of information to assist ground force commanders to make quick, reasoned judgments and assessments in complex and rapidly changing scenarios. The application of an overlay of computer vision methods to apply a level of automation to assist and to focus operator attention, with the goal of reducing cognitive burden could be of significant value.


People detection from planar images

Significant technical challenges remain in automatically detecting people from ground-based cameras including the variability in acquired image quality and scene illumination, changes in the appearance of subject shape depending on their angle of approach relative to a given camera, occlusions of subjects from objects including other people in a field of view and the resolution of complex interacting and intersecting trajectories of individuals in crowds of unknown size. The goal of our research was aimed at identifying and testing state-of-the-art people detectors that could perform reliably and quickly in challenging military scenarios where individuals may be partially obscured by structures, objects, items they may be carrying or the pose they have adopted. Published research based on feature descriptors such as the histograms of oriented gradients support learning vector machine approach (HOG-SVM)2 has shown that although this method is robust in detecting humans in images of limited quality, it fails in cases where the human subject is partially occluded or overlaps another subject.

This research addresses these shortcomings by revisiting the problem based on work published by Felzenszwalb et al.3 into the detection of partially-occluded objects using vigorous probabilistic models. In contrast to the HOG-SVM2 approach based on a hard decision algorithms, this new method uses a probabilistic framework for object class detection. The published improvements in performance compared to existing methods (reduced false positives and false negatives) in the presence of partial occlusion makes it a good potential candidate that certainly warranted further investigation. Figure 1. depicts an example of a bounded person instance detection using the Felzenszwalb et al.3 method applied to a military context image.

Figure 1.

Occluded detected person example using the Felzenszwalb et al.3 method.


We have also assessed the people detection capabilities of a new linear classification method based upon a two-stage cascaded ranking SVM published by Zhang et al.4 with results suggesting that it may further reduce computational overhead when compared to the state-of-the-art.




Pre-requisite stage

A pre-requisite stage undertaken for each approach was to annotate all persons identified in each test data image obtained from the on-line pedestrian and acquired military context databases via a ground-truth bounding box.


Cascade Object Detection with Deformable Part Models - Non-Linear Classification Method

The first method selected for investigation was that published by Felzenszwalb et al.3 as it remains widely cited in the literature. The approach was based on the construction of cascaded non-linear classifiers from part-based deformable models including pictorial structures. The researchers developed an algorithm based on partial hypothesis pruning with published results indicating a greater than one order of magnitude improvement in detection rates without sacrificing accuracy when compared to alternate state-of-the-art approaches.

The simplest and most common approach was to apply a binary classification using a sliding window applied at all positions, scales and possible orientations of objects in an image. A significant disadvantage of this method is that testing all points in a search space can be unacceptably slow if the maintenance of detection accuracy remains a desired outcome. An effective solution to this problem was to apply a cascade of simple tests to each hypothesised location to eliminate most of them early in the process. Detection with a deformable part model can be done by considering all possible locations of a distinguished root part and for each of these to find the best configuration of the remaining parts. The emphasis of the method was to focus on root locations that yielded high scoring configurations and to prune low scoring hypotheses using thresholds.


Proposal Generation Based on Ranked SVM - Linear Classification Method

The second method selected was based on a two-stage cascaded model with flexibility for future stages to be added to potentially further enhance detection sensitivity.

The algorithmic approach taken by the Zhang et al.4 was to convolve an image with a set of linear classifiers at varying scales and aspect ratios to produce response images. Local maxima were extracted from each response image and the corresponding windows forwarded to the second stage of the cascade. Each window was associated with a feature vector and a second round of ranking was applied to order the proposals such that true positives were given higher ranking during training. The method outputted the highest ranking windows in a final list. Detection threshold values were applied at each scale to teach a linear classifier for each scenario such that the ranking scores for positive training windows exceeded those defined as negative. The problem was thus cast as a ranking SVM with the purpose of building a proposal pool to be applied to the second stage of the cascade. Proposal selection was achieved by determining the local maxima within the response image of each classifier and defining the maximum number of windows to be passed to the second stage that then re-ranked these globally to identify the best proposals across all scales.



We evaluated Felzenszwalb's et al.3 person detector using both on-line upright pedestrian and military context image databases. For this purpose, we used the part-based model provided in the open source computer vision software library, OpenCV 5. This model was trained using the PASCAL datasets 6 that contained different content to the datasets used for the evaluation. We used our annotated ground truth to compute performance metrics on the detection results. Efficacy was established by calculating the precision, recall and Receiver Operator Characteristic (ROC) curves of the method and comparing the outcomes obtained from upright pedestrian and military context imagery. Performance testing included presenting images containing an individual fully presented or partly occluded in differing relative orientations, poses and distances from the camera to assess how well the method coped with differing subject sizes, aspect-ratios and degrees of obscuration within each image.

Zhang’s et al.4 approach was at an early point in development during our research so was evaluated by determining if expected proposals had been created as indicated by the presence of a box bounding the person in the military context images database.


On-line pedestrian database

The Penn-Fudan upright pedestrian database7 consists of 170 images of 345 labeled and upright pedestrians at different scales, orientations, actions and degrees of occlusion. The data has been used extensively in people detection research and was considered a useful source to determine the baseline performance of the Felzenszwalb et al.3 approach prior to testing with the potentially more demanding military context imagery.


Military context database

As part of the study, 431 images of trained infantrymen using light infantry weapons or civilian tools were acquired with the goal to provide challenging and pertinent data to test the detection methods. One soldier was dressed in helmet, woodland disruptive pattern material uniform (CS 95) and personal webbing with the other wearing civilian clothes. The infantrymen were instructed to use conventional weapon handling skills and tactics when using NATO force weapons and to apply more flexible handling and tactics approaches when using equipment more widely available to simulate both fully and partially trained combatant scenarios. Civilian poses likely to be encountered in current areas of operation were also included to further enhance the investigation. Table 1. lists the un-loaded, training or deactivated weapons and equipment that were handled by the infantrymen during the acquisition session. Table 2. lists the poses captured during the acquisition session.

Table 1.

Weapons and equipment used during the data collection.

AK47Assault rifle
SA80L85 A2 Individual weapon
SA80L86 A2 Light support weapon
LMGMiniMi Light Machine Gun
RPG-7Hand-held anti-tank grenade launcher
L109A1HE Fragmentation Grenade

Table 2.

Poses and activities assumed during the acquisition session.

Pose and ActivitiesPose and Activities
Aimed - standing unsupportedObservation- prone
Aimed - kneeling unsupportedGrenade throwing
Aimed - sitting unsupportedStanding Surrender – hands up
Aimed - proneStanding Surrender – hands forward
Carrying Weapons - on marchStanding Surrender – with object or weapon
Carrying Weapons - pre-assaultWalking – carrying no object
Carrying Weapons - assaultWalking – carrying inoffensive object
Aimed - standing supported - below chest occlusionStanding – carrying inoffensive object
Aimed - standing supported - side body occlusionSweeping
Observation - standing observation - below chest occlusionSitting
Observation - standing windowedSitting - holding inoffensive object
Observation - kneelingSquatting



We evaluated the performance of the non-linear part based person detector method described by Felzenszwalb3 using the Penn-Fudan pedestrian7 and the military context databases. The former dataset contained crowded scenes including self -occlusions and the latter contained images of partially occluded soldiers in some cases wearing military camouflage and personal equipment. The classifier returned a set of detections D = {di |i=1,…, N} with each detection di identified by a bounding box embedded in the image and a confidence score corresponding to the classifier's assessment of the likely presence of a person within each box. If a box corresponded to a true person in the image, then it was defined as a true positive detection otherwise it was regarded as a false positive outcome.

Based on the confidence score we filtered the results to obtain a desired performance and trade-off between the rate of false positive and the true positive detections. We used a pre-defined score ratio, Θ to filter the detection results. If the top detection score in a test image I was SM, then all the detection scores that lay below the high confidence threshold defined by (1- sgn(SM)Θ) SM were rejected. The remaining detection results were then used to compare with the ground truth annotations based on a 50% overlap acceptance criterion. False negative detection most likely occurred if the overlap criterion was not met. To prevent unwanted false positives in images not containing any person instances, we set a confidence score minima value.


Outcomes from the analysis of the military context imagery using the non-linear classification method.

Please note the black anonymity bars have been added after analysis.

Figure 2(a). depicts an example of a true positive outcome from an image analysis using Felzenszwalb's et al.3 method as demonstrated by the close correspondences between the ground truth bounding box (blue) and the person detection outcome (red). Figure 2(b). depicts an example where the outcome bounding box overlapped less than 50% of the ground truth box. As these detections failed the acceptance criterion they were defined as false negative outcomes. Figure 2(c). depicts an example where an inanimate object was incorrectly classified as a person instance so was defined as false positive outcome.

Figure 2.

(a) True positive (b) False negative and (c) False positive detections using the Felzenszwalb et al.3 method. (Blue bounding box = ground truth, Red box = detection).



Evaluation of the overall performance of the non-linear classification method

We evaluated the overall performance of Felzenszwalb's et al.3 part-based people detection method on the reported experiments using precision-recall and ROC (Receiver Operator Characteristic) graphs. ROC curves are commonly used to present the results of binary decision problems in machine learning whereas precision-recall curves offer additional insight into an algorithm's performance particularly if highly-skewed datasets were analyzed. Both curves were presented within this study to amplify the performance of the method applied to both datasets.

We considered different ratio thresholds, Θ, and compared the performance metrics using the confidence scores of the filtered detection results. Θ can be considered as a constant that is applied to the results to set the desired performance of the people detection algorithm in terms of the relation between precision and recall. The precision-recall curves for the two datasets are shown in Figure 3. for different values of Θ. Similarly, the ROC curves corresponding to these evaluations are shown in Figure 4. The diagonal line in Figure 4. depicts the baseline performance of a random classifier and therefore a good classifier's curve should be above it.

Figure 3.

Precision-Recall curves depicting results from (a) Upright pedestrian and (b) Military context image databases using the Felzenszwalb et al.3 method.


Figure 4.

ROC curves depicting results from (a) Upright pedestrian and (b) Military context image databases using the Felzenszwalb et al.3 method.



Outcomes from the analysis of the military context database using the linear classification method

Figure 5. depicts examples of true positive outcomes using Zhang's et al.4 method as demonstrated by a white bounding box correctly identifying the presence of persons within the images.

Figure 5.

True positive detections using the Zhang et al.4 method.



People detection speed estimation

Five hundred 1280 x 720 military context images were chosen to investigate the speed performance of Felzenszwalb's et al.3 part-based people detection method implemented in 32bit C++ based on the OpenCV5 published algorithm. The computer used for the tests contained an Intel ® Core ™ i5-2520M 2.5GHz processor, 4GB memory. The mean time to a solution per image was 10.38 s ± 0.2 (1SD).


Detection improvement and operator focus strategies

We investigated a number of strategies to further improve detection in images including setting likely person instance size boundaries, masking areas in fixed camera images that contain objects likely to generate false positives, displaying lower confidence bounding boxes and fixed camera image background subtraction. To enhance operator focus we introduced the concept of path tracking tails.


Predefined instance detection size boundaries

A review of the Penn-Fudan7, military context databases and supporting video streams of persons in scenes, indicated that the false positive rate could be reduced further if the range of acceptable bounding boxes was pre-defined. We observed that car-headlights, circular wall lamps, warning signage displaying no-entry symbols, textured surfaces and trees were particularly prone to false detection.


Image Masking

We observed that warning signs, particularly those that used a circular shape on a textured surface or embedded image numbers could generate higher than acceptable levels of false positives. One solution, suitable only for fixed camera applications was to apply a mask by setting the affected pixels to a single uniform color prior to analysis by Felzenszwalb’s et al.3 method. A disadvantage of the approach was that it could not be used if any mask lay within areas likely to contain a person instance.


Display of lower confidence bounding boxes.

The original approach was to reject all detections where the score lay below a high confidence threshold value, defined by (1- sgn(SM)Θ)SM, where SM was the maximum detection score obtained in a given image and Θ was the pre-defined score ratio. We found some value particularly when attempting to identify partially obscured person instances to include detection scores that lay below the threshold, described as low confidence bounding boxes. The resulting boxes were displayed in a different color to highlight their differing status. The method applied was to set pre-defined minima and accept all detections that lay between this and the high confidence threshold value. Table 3. lists the average number of low confidence detections obtained for differing minima settings taken from 100 frames of a sequence containing one walking person identified by the presence of a high confidence bounding box.

Table 3.

Low confidence detections for differing minima settings.

Minima Setting (Θ = 0)Average Number of False Positive Low Confidence Detections/Frame (Mean ± 1SD)Number of Frames Containing a False Positive Low Confidence Detection (100 maximum)
None67.24 ± 6.37100
- High Confidence Threshold59.74 ± 15.08100
-1.016.57 ± 2.74100
-0.56.54 ± 1.62100
01.75 ± 0.7995
High Confidence Threshold00

The principle was tested by assessing detection performance between displaying only scores that equaled or exceeded the high confidence threshold limit against also including lower confidence scores found to be greater than a figure of -0.5 and below the high confidence threshold limit. The imagery used for the test contained a single subject carrying an AK47 assault rifle, running behind a low wall followed by aiming at the camera over the wall in the kneeling supported pose. Table 4. lists the results obtained.

Table 4.

High and low confidence detection performance.

DetectionHigh Confidence DetectionsHigh and Low Confidence Detections
True Positive150198
False Positive15106
False Negative13520


Background subtraction

The approach taken for fixed camera applications was to define a background reference image as one that did not contain any persons in the scene. The updating frequency of the reference image was dependent on changes in environmental conditions particularly the impact of variations in ambient lighting. We performed background subtraction on each image containing a person instance using the well understood Mixture of Gaussians (MOG) algorithm8 to generate foreground masks that were then overlaid to isolate any person instance detections prior to analysis using the deformable part-based model. We found the approach to be potentially useful when circular object such as parked car headlights, hub caps, lights or signage were present in a scene.


Tracking tails

To improve operator estimation of the likely location of people moving behind objects in a scene, we implemented a very simple Kalman9 filter to present a tracking tail. Figure 6. depicts the concept applied to a single subject.

Figure 6.

Tracking tails using a basic Kalman9 filter.




The graphs depicted in Figures 3. and 4. provide an indication the robustness of the person detection results obtained using the Felzenszwalb et al.3 deformable part-based model method. Within the military context dataset at the high confidence threshold value of 0, the person detector yielded a precision of approximately 65% for a recall rate of around 85%. For the same recall rate, a precision of approximately 85% was achieved when applied to the Penn-Fudan uprights pedestrian dataset7. We expected the results obtained from the upright pedestrian dataset to be higher and observed that the outcomes of this study compared favourably with those reported by Felzenszwalb et al.3 when applied to a range of other on-line imagery databases. Further work could be undertaken in future studies to create a more comprehensive military context training dataset with the goal of attempting to further improve precision-recall performance. Furthermore, it could be argued that the true performance may actually be greater than above the reported rates. We believe this may be due to the discrepancies in the ground annotations of the military context as well as the upright pedestrian dataset (for example non-inclusion of a cyclist or out of focus people in the annotations). These discrepancies are hard to weed out manually due to the large number of images and annotations, and the missing ground truth data for these cases penalises the correct detection results. It should also be noted that a significant number of false negative outcomes in the military context dataset correctly identified the presence of a person in an image but were rejected under the pre-defined 50% overlap criterion. The preliminary outcomes of the investigation of the Zhang et al.4 potentially more computationally efficient method demonstrated that it was capable of successfully detecting obscured persons in planar images as an alternate approach worthy of further investigation as it gains more acceptance.

We have presented a number of simple strategies aimed at improving detection performance for both fixed and moving camera applications but are aware that in all cases a caveat needs to be considered that potential improvements in person detection performance can come at the cost of increased false positive rates, possible operator distraction and loss of scene information. The usefulness of a Kalman9 filter to provide tracking tails in two dimensional images is highly dependent on the absence of false positive detections and robust de-confliction approaches if there is a likelihood of trajectory cross-overs from multiple co-located people.

Song et al.10 have recently published an encouraging paper indicating that Felzenszwalb's et al.3 deformable part-based model could be applied as a parallel implementation making it suitable for use with readily available parallel computational devices such as Graphics Processing Units (GPU). DPM have been shown in a number of independent studies to yield high levels of classification accuracy in multiple benchmark challenges but remains computationally demanding so limiting its current practical usefulness when applied to the analysis of military and security context imagery. In implementing the DPM under the GPU (NVidia®, Santa Clara, CA.) Compute Unified Device Architecture programming paradigm11, Song et al.10 reported solution time improvements of an order of magnitude for single object classes when compared to the approaches we investigated. The research outcomes of Song et al.10 have offered some insight that the opportunity may exist to potentially apply Felzenszwalb's et al.3 method to challenging person detection problems today.


The research work was supported by the United Kingdom Defence Science and Technology Laboratory. The significant contribution to the research by our former colleague, Dr. Ali Shahrokni is also gratefully acknowledged.



Development, Concepts and Doctrines Centre, “Future Character of Conflict”, 2011, https://www.gov.uk/government/publications/future-character-of-conflict, (Accessed 8th August 2014).Google Scholar


Dalal, N. and Triggs, B. “Histograms of oriented gradients for human detection,” Proc. Computer Vision and Pattern Recognition, 886-893 (2005).Google Scholar


P. Felzenszwalb, P., Girshick, R., McAllester, D. and Ramanan, D. “Object Detection with Discriminatively Trained Part Based Models”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (9), 1627-1645 (2010).Google Scholar


Zhang, Z., Warrell, J. and Torr, P.H.S. “Proposal Generation for Object Detection using Cascaded Ranking SVM,” Proc. Computer Vision and Pattern Recognition, 1497-1504 (2011).Google Scholar


http://opencv.org, (Accessed 8th August 2014).Google Scholar


Everingham, M., Van-Gool, L., Williams, C.K.I., Winn, J. and Zisserman, A. “The PASCAL Visual Object Classes Challenge Results”, 2012, http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.Google Scholar


KadewTraKuPong, P. and Bowden, R. “improved adaptive background mixture model for real-time tracking with shadow detection”, Proc. 2nd European Workshop on Advanced Video-Based Surveillance Systems, 2001, http://personal.ee.surrey.ac.uk/Personal/R.Bowden/publications/avbs01/avbs01.pdf.Google Scholar


Kalman, R. E. (1960). “A New Approach to Linear Filtering and Prediction Problems”. Journal of Basic Engineering 82 (1), 35–45 (1960).Google Scholar


Song, H.O., Zickler, S., Althoff, T., Girshick, R. et al. “Sparselet Models for Efficient Multiclass Object Detection”, ECCV, Lecture Notes in Computer Science, 802-815 (2012).Google Scholar


http://www.nvidia.com/CUDA, (Accessed 8th August 2014).Google Scholar

© (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Thomas M. L. Shannon, Ben Wiltshire, Emmet H. Spier, "Detection of people in military and security context imagery (withdrawal notice)", Proc. SPIE 9248, Unmanned/Unattended Sensors and Sensor Networks X, 92480N (17 October 2014); doi: 10.1117/12.2069906; https://doi.org/10.1117/12.2069906

Back to Top