19 May 2016 Pre-trained D-CNN models for detecting complex events in unconstrained videos
Author Affiliations +
Abstract
Rapid event detection faces an emergent need to process large videos collections; whether surveillance videos or unconstrained web videos, the ability to automatically recognize high-level, complex events is a challenging task. Motivated by pre-existing methods being complex, computationally demanding, and often non-replicable, we designed a simple system that is quick, effective and carries minimal overhead in terms of memory and storage. Our system is clearly described, modular in nature, replicable on any Desktop, and demonstrated with extensive experiments, backed by insightful analysis on different Convolutional Neural Networks (CNNs), as stand-alone and fused with others. With a large corpus of unconstrained, real-world video data, we examine the usefulness of different CNN models as features extractors for modeling high-level events, i.e., pre-trained CNNs that differ in architectures, training data, and number of outputs. For each CNN, we use 1-fps from all training exemplar to train one-vs-rest SVMs for each event. To represent videos, frame-level features were fused using a variety of techniques. The best being to max-pool between predetermined shot boundaries, then average-pool to form the final video-level descriptor. Through extensive analysis, several insights were found on using pre-trained CNNs as off-the-shelf feature extractors for the task of event detection. Fusing SVMs of different CNNs revealed some interesting facts, finding some combinations to be complimentary. It was concluded that no single CNN works best for all events, as some events are more object-driven while others are more scene-based. Our top performance resulted from learning event-dependent weights for different CNNs.
© (2016) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Joseph P. Robinson, Yun Fu, "Pre-trained D-CNN models for detecting complex events in unconstrained videos", Proc. SPIE 9871, Sensing and Analysis Technologies for Biomedical and Cognitive Applications 2016, 98710O (19 May 2016); doi: 10.1117/12.2228504; https://doi.org/10.1117/12.2228504
PROCEEDINGS
9 PAGES


SHARE
RELATED CONTENT


Back to Top