Vision-language foundation models for image classification, such as CLIP, suffer from a poor performance when applied to images of objects dissimilar to the training data. A relevant example of such a mismatch can be observed when classifying military vehicles. In this work, we investigate techniques to extend the capabilities of CLIP for this application. Our contribution is twofold: (a) we study various techniques to extend CLIP with knowledge on military vehicles and (b) we propose a two-stage approach to classify novel vehicles based on only one example image.
Our dataset consists of 13 military vehicle classes, with 50 images per class. Various techniques to extend CLIP with knowledge on military vehicles were studied, including: context optimization (CoOp), vision-language prompting (VLP), and visual prompt tuning (VPT); of which VPT was selected. Next, we studied one-shot learning approaches to have the extended CLIP classify novel vehicle classes based on only one image. The resulting two-stage ensemble approach was used in a number of leave-one-group-out experiments to demonstrate performance.
Results show that, by default, CLIP has a zero-shot classification performance of 48% for military vehicles. This can be improved to >80% by fine-tuning with example data, at the cost of losing the ability to classify novel (previously unseen) military vehicle types. A naive one-shot approach results in a classification performance of 19%, whereas our proposed one-shot approach achieves 70% for novel military vehicle classes.
In conclusion, our proposed two-stage approach can extend CLIP for military vehicle classification. In the first stage, CLIP is provided with knowledge on military vehicles using domain adaptation with VPT. In the second stage, this knowledge can be leveraged for previously unseen military vehicle classes in a one-shot setting.
A new approach for distinguishing neutral (e.g. walking) from threatening (e.g. aiming a handgun) poses, without training, is presented. There are various AI-based models that can classify human poses, but these oftentimes do not generalize to defence scenarios. The lack of data with threatening poses makes it hard to train new models. Our approach circumvents re-training and is a zero-shot, rule-based classification method for threatening poses. We combine a pretrained body part keypoint detection model with the neuro-symbolic framework Scallop. We compare the pretrained models MMPose and YOLOv8x-pose for keypoint detection. We use images from the YouTube Gun Detection Dataset containing persons holding a weapon and label them manually as having a ‘neutral’ or ‘aiming’ pose; the latter was further subdivided into ‘aiming a handgun’ and ‘aiming a rifle’. Scallop is used to define logic-based rules for classification, using the keypoints as input: e.g. the rule ‘aiming a handgun’ includes 'hands at shoulder height’ and 'hands far away from the body'. Recall and precision results for aiming are 0.75/0.81 and 0.83/0.73, for MMPose and YOLOv8x-pose, respectively. Average recall and average precision for ‘aiming a handgun’ and ‘aiming a rifle’ are 0.78/0.36 and 0.76/0.43, for MMPose and YOLOv8x-pose, respectively. Combining neuro-symbolic AI with pretrained pose estimation techniques shows promising results for detecting threatening human poses. Performance of neutral-versus-aiming classification is similar for both approaches, however, MMPose performs better for multi-class classification. In future research, we will focus on improving rules, identifying more poses, and using videos to obtain sequences of poses or activities.
Automatic object detection is increasingly important in the military domain, with potential applications including target identification, threat assessment, and strategic decision-making processes. Deep learning has become the standard methodology for developing object detectors, but obtaining the necessary large set of training images can be challenging due to the restricted nature of military data. Moreover, for meaningful deployment of an object detection model, it needs to work in various environments and conditions, in which prior data acquisition might not be possible. The use of simulated data for model development can be an alternative for real images and recent work has shown the potential for training a military vehicle detector using simulated data. Nevertheless, fine-grained classification of detected military vehicles, using training on simulated data, remains an open challenge.
In this study, we develop an object detector for 15 vehicle classes, containing similar appearing types, such as multiple battle tanks and howitzers. We show that combining few real data samples with a large amount of simulated data (12,000 images) leads to a significant improvement in comparison with using one of these sources individually. Adding just two samples per class improves the mAP to 55.9 [±2.6], compared to 33.8 [±0.7] when only simulated data is used. Further improvements are achieved by adding more real samples and using Grounding DINO, a foundation model pretrained on vast amounts of data (mAP = 90.1 [±0.5]). In addition, we investigate the effect of simulation variation, which we find is important even when more real samples are available.
Automated object detection is becoming more relevant in a wide variety of applications in the military domain. This includes the detection of drones, ships, and vehicles in video and IR video. In recent years, deep learning based object detection methods, such as YOLO, have shown to be promising in many applications for object detection. However, current methods have limited success when objects of interest are small in number of pixels, e.g. objects far away or small objects closer by. This is important, since accurate small object detection translates to early detection and the earlier an object is detected the more time is available for action. In this study, we investigate novel image analysis techniques that are designed to address some of the challenges of (very) small object detection by taking into account temporal information. We implement six methods, of which three are based on deep learning and use the temporal context of a set of frames within a video. The methods consider neighboring frames when detecting objects, either by stacking them as additional channels or by considering difference maps. We compare these spatio-temporal deep learning methods with YOLO-v8 that only considers single frames and two traditional moving object detection methods. Evaluation is done on a set of videos that encompasses a wide variety of challenges, including various objects, scenes, and acquisition conditions to show real-world performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.