This PDF file contains the front matter associated with SPIE Proceedings Volume 10223, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and Conference Committee listing.
One of the main tasks in a vision-based traffic monitoring system is the detection of vehicles. Recently, deep neural networks have been successfully applied to this end, outperforming previous approaches. However, most of these works generally rely on complex and high-computational region proposal networks. Others employ deep neural networks as a segmentation strategy to achieve a semantic representation of the object of interest, which has to be up-sampled later. In this paper, a new design for a convolutional neural network is applied to vehicle detection in highways for traffic monitoring. This network generates a spatially structured output that encodes the vehicle locations. Promising results have been obtained in the GRAM-RTM dataset.
High density pedestrian flows are a common occurrence. Pedestrian safety and comfort in high density flows can present serious challenges to organizers, businesses and safety personnel. Obtaining pedestrian density and velocity directly from Closed Circuit Television (CCTV) would significantly improve real-time crowd management. A study of high density crowd monitoring from video and its real-time application viability is presented. The video data is captured from CCTV. Both cross correlation based and optical flow based approaches are studied. Results are presented in the form of fundamental diagrams, velocity vectors and speed contours of the flow field.
Several approaches were proposed in order to extract text from scanned documents. However, text extraction in heterogeneous documents stills a real challenge. Indeed, text extraction in this context is a difficult task because of the variation of the text due to the differences of sizes, styles and orientations, as well as to the complexity of the document region background. Recently, we have proposed the improved hybrid binarization based on Kmeans method (I-HBK)5 to extract suitably the text from heterogeneous documents. In this method, the Page Layout Analysis (PLA), part of the Tesseract OCR engine, is used to identify text and image regions. Afterwards our hybrid binarization is applied separately on each kind of regions. In one side, gamma correction is employed before to process image regions. In the other side, binarization is performed directly on text regions. Then, a foreground and background color study is performed to correct inverted region colors. Finally, characters are located from the binarized regions based on the PLA algorithm. In this work, we extend the integration of the PLA algorithm within the I-HBK method. In addition, to speed up the separation of text and image step, we employ an efficient GPU acceleration. Through the performed experiments, we demonstrate the high F-measure accuracy of the PLA algorithm reaching 95% on the LRDE dataset. In addition, we illustrate the sequential and the parallel compared PLA versions. The obtained results give a speedup of 3.7x when comparing the parallel PLA implementation on GPU GTX 660 to the CPU version.
Optical Character Recognition (OCR) systems have been designed to operate on text contained in scanned documents and images. They include text detection and character recognition in which characters are described then classified. In the classification step, characters are identified according to their features or template descriptions. Then, a given classifier is employed to identify characters. In this context, we have proposed the unified character descriptor (UCD) to represent characters based on their features. Then, matching was employed to ensure the classification. This recognition scheme performs a good OCR Accuracy on homogeneous scanned documents, however it cannot discriminate characters with high font variation and distortion.3 To improve recognition, classifiers based on neural networks can be used. The multilayer perceptron (MLP) ensures high recognition accuracy when performing a robust training. Moreover, the convolutional neural network (CNN), is gaining nowadays a lot of popularity for its high performance. Furthermore, both CNN and MLP may suffer from the large amount of computation in the training phase. In this paper, we establish a comparison between MLP and CNN. We provide MLP with the UCD descriptor and the appropriate network configuration. For CNN, we employ the convolutional network designed for handwritten and machine-printed character recognition (Lenet-5) and we adapt it to support 62 classes, including both digits and characters. In addition, GPU parallelization is studied to speed up both of MLP and CNN classifiers. Based on our experimentations, we demonstrate that the used real-time CNN is 2x more relevant than MLP when classifying characters.
This paper presents an extension to our previously developed fusion framework  involving a depth camera and an inertial sensor in order to improve its view invariance aspect for real-time human action recognition applications. A computationally efficient view estimation based on skeleton joints is considered in order to select the most relevant depth training data when recognizing test samples. Two collaborative representation classifiers, one for depth features and one for inertial features, are appropriately weighted to generate a decision making probability. The experimental results applied to a multi-view human action dataset show that this weighted extension improves the recognition performance by about 5% over equally weighted fusion deployed in our previous fusion framework.
Unmanned systems used for threat detection and identification are still not efficient enough for monitoring autonomously the battlefield. The limitation on size and energy makes those systems unable to use most state- of-the-art computer vision algorithms for recognition. The bio-inspired approach based on the humans peripheral and foveal visions has been reported as a way to combine recognition performance and computational efficiency. As a low resolution camera observes a large zone and detects significant changes, a second camera focuses on each event and provides a high resolution image of it. While such biomimetic existing approaches usually separate the two vision modes according to their functionality (e.g. detection, recognition) and to their basic primitives (i.e. features, algorithms), our approach uses common structures and features for both peripheral and foveal cameras, thereby decreasing the computational load with respect to the previous approaches.
The proposed approach is demonstrated using simulated data. The outcome proves particularly attractive for real time embedded systems, as the primitives (features and classifier) have already proven good performances in low power embedded systems. This first result reveals the high potential of dual views fusion technique in the context of long duration unmanned video surveillance systems. It also encourages us to go further into miming the mechanisms of the human eye. In particular, it is expected that adding a retro-action of the fovea towards the peripheral vision will further enhance the quality and efficiency of the detection process.
Systems that require multiple coordinated sensors (including sensor fusion) used for ISR, navigation in degraded environments, or infrared countermeasures are constantly trying to increase throughput to carry higher resolution images and video in real-time and with low latency. The need for ever higher throughput challenges system designers on every level, including the physical interface. Simply moving video efficiently from point to point or within a network in itself is a challenge. ARINC 818, the Avionics Digital Video Bus continues to expand into real-time video applications because of its low latency, robustness, and high throughput capabilities.
One in eight live births in the United States is premature and these infants have complications leading to life threatening events such as apnea (pauses in breathing), bradycardia (slowness of heart) and hypoxia (oxygen desaturation). Infant movement pattern has been hypothesized as an important predictive marker for these life threatening events. Thus estimation of movement along with behavioral states, as a precursor of life threatening events, can be useful for risk stratification of infants as well as for effective management of disease state. However, more important and challenging is the determination of the behavioral state of the infant. This information includes important cues such as sleep position and the status of the eyes, which are important markers for neonatal neurodevelopment state.
This paper explores the feasibility of using real time video analysis to monitor the condition of premature infants. The image of the infant can be segmented into regions to localize and focus on specific areas of interest. Analysis of the segmented regions can be performed to identify different parts of the body including the face, arms, legs and torso. This is necessary due to real-time processing speed considerations. Such a monitoring system would be of great benefit as an aide to medical staff in neonatal hospital settings requiring constant surveillance. Any such system would have to satisfy extremely stringent reliability and accuracy requirements, before it can be deployed in a hospital care unit, due to obvious reasons. The effect of lighting conditions and interference will have to be mitigated to achieve such performance.
Nowadays, HEVC is the cutting edge encoding standard being the most efficient solution for transmission of video content. In this paper a subjective quality improvement based on pre-processing algorithms for homogeneous and chaotic regions detection is proposed and evaluated for low bit-rate applications at high resolutions. This goal is achieved by means of a texture classification applied to the input frames. Furthermore, these calculations help also reduce the complexity of the HEVC encoder. Therefore both the subjective quality and the HEVC performance are improved.
This paper presents advancements on the state of the art High Efficiency Video Coding Standard (HEVC), in the context of the Joint Exploration Model (JEM). This model is still under development from the ITU and promises significant improvements in the Rate Distortion Performance of HEVC through an assortment of coding tools. These tools are presented in algorithmic detail and comparisons between HEVC and JEM are drawn.
Video recording is an essential property of new generation military imaging systems. Playback of the stored video on the same device is also desirable as it provides several operational benefits to end users. Two very important constraints for many military imaging systems, especially for hand-held devices and thermal weapon sights, are power consumption and size. To meet these constraints, it is essential to perform most of the processing applied to the video signal, such as preprocessing, compression, storing, decoding, playback and other system functions on a single programmable chip, such as FPGA, DSP, GPU or ASIC. In this work, H.264/AVC (Advanced Video Coding) compatible video compression, storage, decoding and playback blocks are efficiently designed and implemented on FPGA platforms using FPGA fabric and Altera NIOS II soft processor. Many subblocks that are used in video encoding are also used during video decoding in order to save FPGA resources and power. Computationally complex blocks are designed using FPGA fabric, while blocks such as SD card write/read, H.264 syntax decoding and CAVLC decoding are done using NIOS processor to benefit from software flexibility. In addition, to keep power consumption low, the system was designed to require limited external memory access. The design was tested using 640x480 25 fps thermal camera on CYCLONE V FPGA, which is the ALTERA’s lowest power FPGA family, and consumes lower than 40% of CYCLONE V 5CEFA7 FPGA resources on average.
Networks of vision sensors are deployed in many settings, ranging from security needs to disaster response to environmental monitoring. Many of these setups have hundreds of cameras and tens of thousands of hours of video. The difficulty of analyzing such a massive volume of video data is apparent whenever there is an incident that requires foraging through vast video archives to identify events of interest. As a result, video summarization, that automatically extract a brief yet informative summary of these videos, has attracted intense attention in the recent years. Much progress has been made in developing a variety of ways to summarize a single video in form of a key sequence or video skim. However, generating a summary from a set of videos captured in a multi-camera network still remains as a novel and largely under-addressed problem. In this paper, with the aim of summarizing videos in a camera network, we introduce a novel representative selection approach via joint embedding and capped ℓ2;1-norm minimization. The objective function is two-fold. The first is to capture the structural relationships of data points in a camera network via an embedding, which helps in characterizing the outliers and also in extracting a diverse set of representatives. The second is to use a capped ℓ2;1-norm to model the sparsity and to suppress the influence of data outliers in representative selection. We propose to jointly optimize both of the objectives, such that embedding can not only characterize the structure, but also indicate the requirements of sparse representative selection. Extensive experiments on standard multi-camera datasets well demonstrate the efficacy of our method over state-of-the-art methods.
This study focuses on improving capabilities of a single-band lossless JPEG-LS encoder with a preprocessing unit. Main motivation is preserving its genuine low complexity that enables high-throughput hardware implementations. Although JPEG-LS standard describes procedures for near-lossless and multicomponent compression, a conveniently designed preprocessor unit can easily bestow any single-band lossless JPEG-LS encoder gain these capabilities without change on itself. Similarly, its compression performance can be improved with selective compression. Idea depends on the detection of regions out-of-interest (cloud, snow etc.) employing their distinct spectral signature and remapping pixels in these regions so that highest compression with the JPEG-LS can be yielded based on its algorithm. Regions out-ofinterest can be compressed regardless of the outcome quality as they contain no significant information. Through analyses are achieved on satellite images for the preprocess approaches with software. Besides, designed preprocessor unit is implemented with field-programmable-gate-array (FPGA) and implementation details are provided.
In this paper, a novel approach for halftone images is proposed and implemented for images that are obtained by the Dot Diffusion (DD) method. Designed technique is based on an optimization of the so-called class matrix used in DD algorithm and it consists of generation new versions of class matrix, which has no baron and near-baron in order to minimize inconsistencies during the distribution of the error. Proposed class matrix has different properties and each is designed for two different applications: applications where the inverse-halftoning is necessary, and applications where this method is not required. The proposed method has been implemented in GPU (NVIDIA GeForce GTX 750 Ti), multicore processors (AMD FX(tm)-6300 Six-Core Processor and in Intel core i5-4200U), using CUDA and OpenCV over a PC with linux. Experimental results have shown that novel framework generates a good quality of the halftone images and the inverse halftone images obtained. The simulation results using parallel architectures have demonstrated the efficiency of the novel technique when it is implemented in real-time processing.
In this work, a robust steganography framework to hide a color image into a stereo images is proposed. The embedding algorithm is performed via Discrete Cosine Transform (DCT) and Quantization Index Modulation-Dither Modulation (QIM-DM) hiding the secret data. Additionally, the Arnold’s Cat Map Transform is applied in order to scramble the secret color image, guaranteeing better security and robustness of the proposed system. Novel framework has demonstrated better performance against JPEG compression attacks among other existing approaches. Besides, the proposed algorithm is developed taking into account the parallel paradigm in order to be implemented in multi-core CPU increasing the processing speed. The results obtained by the proposed framework show high values of PSNR and SSIM, which demonstrate imperceptibility and sufficient robustness against JPEG compression attacks.
Aiming at the conflict circumstances of multi-parameter H.265/HEVC encoder system, the present paper introduces the analysis of many optimizations' set in order to improve the trade-off between quality, performance and power consumption for different reliable and accurate applications. This method is based on the Pareto optimization and has been tested with different resolutions on real-time encoders.
Obtaining depth information of a scene is an important requirement in many computer-vision and robotics applications. For embedded platforms, passive stereo systems have many advantages over their active counterparts (i.e. LiDAR, Infrared). They are power efficient, cheap, robust to lighting conditions and inherently synchronized to the RGB images of the scene. However, stereo depth estimation is a computationally expensive task that operates over large amounts of data. For embedded applications which are often constrained by power consumption, obtaining accurate results in real-time is a challenge. We demonstrate a computationally and memory efficient implementation of a stereo block-matching algorithm in FPGA. The computational core achieves a throughput of 577 fps at standard VGA resolution whilst consuming less than 3 Watts of power. The data is processed using an in-stream approach that minimizes memory-access bottlenecks and best matches the raster scan readout of modern digital image sensors.