To perform a data protection concept for our mobile sensor platform (MODISSA), we designed and implemented an anonymization pipeline. This pipeline contains plugins for reading, modifying, and writing different image formats, as well as methods to detect the regions that should be anonymized. This includes a method to determine head positions and an object detector for the license plates, both based on state of the art deep learning methods. These methods are applied for all image sensors on the platform, no matter if they are panoramic RGB, thermal IR, or grayscale cameras. In this paper we focus on the whole face anonymization process. We determine the face region to anonymize on the basis of body pose estimates from OpenPose what proved to lead to robust results. Our anonymization pipeline achieves nearly human performance, with almost no human resources spent. However, to gain perfect anonymization a quick additional human interactive postprocessing step can be performed. We evaluated our pipeline quantitatively and qualitatively on urban example data recorded with MODISSA.
Images or videos recorded in public areas may contain personal data such as license plates. According to German law, one is not allowed to save the data without permission of the affected people or an immediate anonymization of personal information in the recordings. As asking for and obtaining permission is practically impossible for one thing and then again, manual anonymization time consuming, an automated license plate detection and localization system is developed. For the implementation, a two-stage neural net approach is chosen that hierarchically combines a YOLOv3 model for vehicle detection and another YOLOv3 model for license plate detection. The model is trained using a specifically composed dataset that includes synthesized images, the usage of low-quality or non-annotated datasets as well as data augmentation methods. The license plate detection system is quantitatively and qualitatively evaluated, yielding an average precision (AP) of 98.73% for an intersection over union threshold of 0.3 on the openALPR dataset and showing an outstanding robustness even for rotated, small scaled or partly covered license plates.
In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques. We use the publicly available large-scale multispectral ThermalWorld dataset consisting of images in the long-wave infrared and visible spectral range showing persons, vehicles, buildings, and pets and train for image classification a Convolutional Neural Network. The training data will be augmented with several modifications based on their different properties to find out which ones cause which impact and lead to the best classification performance.
We are living in a world dependent on sophisticated technical infrastructure. Malicious manipulation of such critical infrastructure poses an enormous threat for all its users. Thus, running a critical infrastructure needs special attention to log the planned maintenance or to detect suspicious events. Towards this end, we present a knowledge-based surveillance approach capable of logging visual observable events in such an environment. The video surveillance modules are based on appearance-based person detection, which further is used to modulate the outcome of generic processing steps such as change detection or skin detection. A relation between the expected scene behavior and the underlying basic video surveillance modules is established. It will be shown that the combination already provides sufficient expressiveness to describe various everyday situations in indoor video surveillance. The whole approach is qualitatively and quantitatively evaluated on a prototypical scenario in a server room.
In recent years, the wide use of video surveillance systems has caused an enormous increase in the amount of data that has to be stored, monitored, and processed. As a consequence, it is crucial to support human operators with automated surveillance applications. Towards this end an intelligent video analysis module for real-time alerting in case of abandoned objects in public spaces is proposed. The overall processing pipeline consists of two major parts. First, person motion is modeled using an Interacting Multiple Model (IMM) filter. The IMM filter estimates the state of a person according to a finite-state, discrete-time Markov chain. Second, the location of persons that stay at a fixed position defines a region of interest, in which a nonparametric background model with dynamic per-pixel state variables identifies abandoned objects. In case of a detected abandoned object, an alarm event is triggered. The effectiveness of the proposed system is evaluated on the PETS 2006 dataset and the i-Lids dataset, both reflecting prototypical surveillance scenarios.
Human action recognition has emerged as an important field in the computer vision community due to its large number of
applications such as automatic video surveillance, content based video-search and human robot interaction. In order to cope
with the challenges that this large variety of applications present, recent research has focused more on developing classifiers
able to detect several actions in more natural and unconstrained video sequences. The invariance discrimination tradeoff
in action recognition has been addressed by utilizing a Generalized Hough Transform. As a basis for action representation
we transform 3D poses into a robust feature space, referred to as pose descriptors. For each action class a one-dimensional
temporal voting space is constructed. Votes are generated from associating pose descriptors with their position in time
relative to the end of an action sequence. Training data consists of manually segmented action sequences. In the detection
phase valid human 3D poses are assumed as input, e.g. originating from 3D sensors or monocular pose reconstruction
methods. The human 3D poses are normalized to gain view-independence and transformed into (i) relative limb-angle
space to ensure independence of non-adjacent joints or (ii) geometric features. In (i) an action descriptor consists of the
relative angles between limbs and their temporal derivatives. In (ii) the action descriptor consists of different geometric
features. In order to circumvent the problem of time-warping we propose to use a codebook of prototypical 3D poses
which is generated from sample sequences of 3D motion capture data. This idea is in accordance with the concept of
equivalence classes in action space. Results of the codebook method are presented using the Kinect sensor and the CMU
Motion Capture Database.
Autonomously operating semi-stationary multi-camera components are the core modules of ad-hoc multi-view methods. On the one hand a situation recognition system needs overview of an entire scene, as given by a wide-angle camera, and on the other hand a close-up view from e.g. an active pan-tilt-zoom (PTZ) camera of interesting agents is required to further increase the information to e.g. identify those agents. To configure such a system we set the field of view (FOV) of the overview-camera in correspondence to the motor configuration of a PTZ camera. Images are captured from a uniformly moving PTZ camera until the entire field of view of the master camera is covered. Along the way, a lookup table (LUT) of motor coordinates of the PTZ camera and image coordinates in the master camera is generated. To match each pair of images, features (SIFT, SURF, ORB, STAR, FAST, MSER, BRISK, FREAK) are detected, selected by nearest neighbor distance ratio (NNDR), and matched. A homography is estimated to transform the PTZ image to the master image. With that information comprehensive LUTs are calculated via barycentric coordinates and stored for every pixel of the master image. In this paper the robustness, accuracy, and runtime are quantitatively evaluated for different features.