We investigate low-complexity convolutional neural networks (CNNs) for object detection for embedded vision applications. It is well-known that consolidation of an embedded system for CNN-based object detection is more challenging due to computation and memory requirement comparing with problems like image classification. To achieve these requirements, we design and develop an end-to-end TensorFlow (TF)-based fully-convolutional deep neural network for generic object detection task inspired by one of the fastest framework, YOLO.1 The proposed network predicts the localization of every object by regressing the coordinates of the corresponding bounding box as in YOLO. Hence, the network is able to detect any objects without any limitations in the size of the objects. However, unlike YOLO, all the layers in the proposed network is fully-convolutional. Thus, it is able to take input images of any size. We pick face detection as an use case. We evaluate the proposed model for face detection on FDDB dataset and Widerface dataset. As another use case of generic object detection, we evaluate its performance on PASCAL VOC dataset. The experimental results demonstrate that the proposed network can predict object instances of different sizes and poses in a single frame. Moreover, the results show that the proposed method achieves comparative accuracy comparing with the state-of-the-art CNN-based object detection methods while reducing the model size by 3× and memory-BW by 3 − 4× comparing with one of the best real-time CNN-based object detectors, YOLO. Our 8-bit fixed-point TF-model provides additional 4× memory reduction while keeping the accuracy nearly as good as the floating-point model. Moreover, the fixed- point model is capable of achieving 20× faster inference speed comparing with the floating-point model. Thus, the proposed method is promising for embedded implementations.
Proc. SPIE. 8856, Applications of Digital Image Processing XXXVI
KEYWORDS: Cameras, Autostereoscopic displays, Visualization, Image processing algorithms and systems, Associative arrays, Detection and tracking algorithms, Gaussian filters, 3D displays, 3D video compression, Glasses
Autostereoscopic (AS) displays spatially multiplex multiple views, providing a more immersive experience by enabling
users to view the content from different angles without the need of 3D glasses. Multiple views could be captured from
multiple cameras at different orientations, however this could be expensive, time consuming and not applicable to some
applications. The goal of multiview synthesis in this paper is to generate multiple views from a stereo image pair and
disparity map by using various video processing techniques including depth/disparity map processing, initial view
interpolation, inpainting and post-processing. We specifically emphasize the need for disparity processing when there is
no depth information is available that is associated with the 2D data and we propose a segmentation based disparity
processing algorithm to improve disparity map. Furthermore we extend the texture based 2D inpainting algorithm to 3D
and further improve the hole-filling performance of view synthesis. The benefit of each step of the proposed algorithm
is demonstrated with comparison to state of the art algorithms in terms of visual quality and PSNR metric. Our system is
evaluated in an end-to-end multi view synthesis framework where only stereo image pair is provided as input to the
system and 8 views are outputted and displayed in 8-view Alioscopy AS display.
Frame rate up conversion (FRC) is the process of converting between different frame rates for targeted display
formats. Besides scanning format applications for large displays, FRC can be used to increase the frame rate of
video at the receiver end for video telephony, video streaming or playback applications for mobile platforms where
bandwidth savings are crucial. Many algorithms have been proposed for decoder/receiver side FRC. However,
most of them are from video encoding/decoding point of view. We systematically studied the strategies of
utilizing the camera 3A (auto exposure, auto white balance and auto focus) information to assist FRC process,
while in this paper we focus on the technique using camera exposure information to assist the decoder FRC.
In the proposed strategy the exposure information as well as other camera 3A related information is packetized
as the meta data which is attached to the corresponding frame and transmitted together with the main video
bit stream to the decoder side for FRC assistance. The meta data contains information such as zooming, auto
focus, AE (auto exposure), AWB (auto white balance) statistics, scene change detection, global motion detected
from motion sensors. The proposed meta data consists of camera specific information which is different than just
sending motion vectors or mode information to aid FRC process. Compared to traditional FRC approaches used
in mobile platforms, the proposed approach is a low-complexity,
low-power solution which is crucial in resource
constrained environments such as mobile platforms.
We describe design of a low-complexity lossless and near-lossless image compression system with random access,
suitable for embedded memory compression applications. This system employs a block-based DPCM coder using
variable-length encoding for the residual. As part of this design, we propose to use non-prefix (one-to-one) codes for
coding of residuals, and show that they offer improvements in compression performance compared to conventional
techniques, such as Golomb-Rice and Huffman codes.