An image compression method utilizing multiple sampling-rate downsampling and super-resolution upconversion is proposed. The multiple sampling-rate downsampling modes and different quantization patterns are designed for each 32 × 32 macroblock at the encoder side. Meanwhile, the rate distortion optimization strategy is investigated to obtain the optimal downsampling and quantization mode. The chosen mode is used to downsample and code the original macroblocks. To obtain the full resolution block, the deep learning-based multiple models super-resolution upconversion is designed to reconstruct the decoded block at the decoder side. The experimental results demonstrate that our method can obtain the higher quality compressed image than JPEG and some state-of-the-art downsampling-based methods at almost all bit rates. At the same decoded image quality, our method can achieve 30% to 55% bit savings at low bit rates and 15% to 30% bit savings at medium to high bit rates. In addition, the proposed framework is applicable to other image compression standards.
Convolution filtering is one of the most important algorithms in image processing. It is data-intensive, especially when dealing with high-definition images. Most previous studies on accelerating convolution computation in parallel focus on the use of graphics processing units (GPUs), whereas the central processing units (CPUs) always play the role of host to manage the data buffer and control flow. However, recent CPU architectures have seen significant modifications to parallel data computing capabilities, and the trend of integrating the CPU and GPU on a single chip is on a rise. We propose an approach to accelerate convolution filtering on the heterogeneous architecture of integrated CPU–GPU. We exploit the parallel processing power of vector instructions on a CPU and make it collaboratively function with the on-chip GPU. Two task assignment methods, static and dynamic task partitioning, are proposed for CPU–GPU collaboration. We evaluate our approach with images and filters of different sizes. The experimental results demonstrate that we can achieve 146 GFLOP/s at best using a quad-core CPU and the performance is 2.5 to 4.8 times faster than that of the single-GPU version of the OpenCV library. We also obtain 90 times speedup over the single-threaded CPU version. The results demonstrate that the proposed algorithm is efficient.
By considering the rate distortion optimization (RDO) characteristics, a multilevel optimal λ decision algorithm for the rate control is proposed. With the process of getting the optimal λ for the group of picture (GoP)-level rate control, the frame-level rate control, and the coding tree unit (CTU)-level rate control, a special relationship of the current level λ and the next level λ can be built. First, an approximate linear bit allocation relationship is used between the bit rate and the motion compensation prediction error to assign the bit rate more reasonably for every frame. Second, the multilevel optimal λ decision scheme is proposed. To minimize the whole distortion of a sequence, an effective way is to get the optimal RDO of every GoP. Then, with the RDO process of the GoP-level rate control, an optimal frame λ ratio for each frame in current GoP can be obtained. For the optimal frame λ ratio, its value is mainly decided by the distortion effect, which is calculated by an employed source distortion propagation chain. Finally, an optimal CTU λ clip scheme is proposed to get the optimal RDO of every CTU. The experimental results show that the coding quality of the proposed rate control algorithm is improved significantly with little loss of the bit rate accuracy.
Moving object detection is one of the most promising research areas, which is required in different applications, such as video monitoring and surveillance systems, human activity recognition systems, vehicle counting, and anomaly detection. Various methods for object detection using single sensor and a few using multimodal techniques have been reported in the literature. However, such systems fail to handle adverse or challenging atmospheric conditions such as illumination variations, scale and appearance change of objects or targets, occlusions, and camouflaged conditions. We have presented an approach for the detection of moving objects using structural similarity metric (SSIM) and Gaussian mixture model (GMM). SSIM is used to compute similarity between reference mean background frame and foreground frame of visible spectrum (VIS) and thermal infrared (IR) independently. The computation of similarity measure is performed in an image spatial domain. The threshold results of SSIM are fused together using different pixel-level fusion methods such logical “OR,” discrete wavelet transform, and principal components analysis. Temporal analysis is performed to eliminate noise and false positives (unwanted background regions) using GMM on fused results. We have compared the results with recent methods for different complex scenarios and found out that approximately F-measure increases up to 80%. Hence, the proposed method proves to be a robust moving object detection technique in multimodality domain.
Long-term tracking tasks remain challenging, especially in areas of occlusion. Herein, we propose an enhanced occlusion handling and multipeak redetection method for long-term object tracking. First, our appearance model is constructed based on two complementary cues. Each model is trained independently and combined by adaptive merging, and considers the reliability of each representation to provide a preliminary estimation. Then, we present an occlusion detection scheme relying on the response variation to activate a redetection module in case of track failure. Finally, we introduce an adaptive model update strategy using the most confident tracking predictions to retain reliable memories. The redetection module is designed based on the multipeak property of the merged response and the model is updated adaptively based on the reliability of each representation and the occlusion detection result, which allows the proposed method to deal with heavy occlusions effectively. Extensive experiments are conducted on two public benchmark datasets with 100 challenging sequences. The experimental results demonstrate that the proposed method performs favorably against 17 state-of-the-art trackers while running efficiently in real time.
Image deblocking is a postprocessing method that aims to suppress the compression artifacts without changing existing JPEG coding standard. We propose an image deblocking method, which is based on deep convolutional neural networks. The proposed method takes full advantage of the characteristics of wavelet domain and pixel domain to restore the high-frequency information of compressed images and maintain low-frequency information, respectively. In addition, a fusion layer is employed to fuse the merits of two domains. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art deblocking methods in both subjective vision and objective evaluation.
Image processing methods that take place in different areas of life from identification to traffic flow control are starting to be used on mobile devices. Hence, usage of optical mark recognition systems (OMRS) on mobile devices is also not widespread as it is for image processing applications. A mobile OMRS is developed using image processing methods. In this developed system, image enhancement, edge detection, and tilt-shifting operations were performed on the optical answer paper images that are obtained using mobile devices. Within the scope of this study, it is aimed to achieve the maximum success of image processing operations on a mobile device that has limited hardware resources. The evaluation of optical answer papers that are obtained using mobile devices is performed using the image processing methods and algorithms developed within the scope of this study. The developed system is tested using different type of optical answer papers. According to the test results, the developed system provides a flexible and efficient use and partly eliminates previous work constraints. The success rate of the developed system is determined to be 93.54% using the test results obtained.
Performance of modern automated pattern recognition (PR) systems is heavily influenced by accuracy of their feature extraction algorithm. Many papers have demonstrated uses of deep learning techniques in PR, but there is little evidence on using them as feature extractors. Our goal is to contribute to this field and perform a comparative study between classical used methods in feature extraction and deep learning techniques. For that, a biometric recognition system, which is a PR application, is developed and evaluated using a proposed evaluation metric called expected risk probability. In our study, two deeply learned features, based on PCANet and DCTNet deep learning techniques, are used with two biometric modalities that are palmprint and palm-vein. Subsequently, the efficiency of these techniques is compared with various classical feature extraction methods. From the obtained results, we drew our conclusions on a very positive impact of deep learning techniques on overall recognition rate, and thus these techniques significantly outperform the classical techniques.
As a representative deep learning model, convolutional neural networks (CNNs) have accomplished great achievements in image classification and object detection. However, CNNs require the resizing of the input images to a fixed size, which may affect the representations of objects. To overcome this limitation, we replace the last pooling layer with a topic model and call it a topic network. For arbitrary sizes and ratios of input images, the outputs of the topic network are fixed-size features due to the topic model of the topic layer, and they can reflect the global or regional characteristics of images by means of different scales. Two topic models, namely latent Dirichlet allocation (LDA) and Markov topic random fields (MTRF), are applied to the topic layer, and we call them latent Dirichlet allocation topic network and Markov topic random fields topic network, respectively. Both of them perform well in image classification with original size. More importantly, as a framework, any topic model can be easily applied to the topic layer of a topic network, which makes it much more flexible and extensible.
The adaptive multiclass correlation filters (AMCF) method is proposed to exploit different kinds of features and information in a unified framework for recognition. Theoretical investigation into AMCF shows that it obtains a closed-form subsolution to constrain the optimization objective, simplifying the entire inference mechanism in the multiclass classification. The time series recognition problems, such as human action recognition and radar behavior recognition, are important yet challenging tasks. However, it is still time-consuming to acquire enough labeled training samples. AMCF is capable to exploit different kinds of features to solve the time series recognition problem. With this new correlation filters-based method, we extend the original signals and handle the insufficient training set effectively. Experiments are done on the depth image based action recognition and radar behavior recognition with a small number of training examples, including MSRAction3D, MSRGesture3D, UTD-MHAD, and radar behavior datasets. Particularly, we demonstrate that the proposed action recognition system is based on the completed local binary patterns and AMCF, and successfully achieves superior performances over the state-of-the-arts.
Although the flexible quad-tree structure in high efficiency video coding (HEVC) has achieved significant performance improvement over H.264/AVC, it remarkably increases computation complexity due to the search for the optimal coding unit (CU) size in a very large space. We propose perceptually motivated fast CU decision for HEVC intracoding based on visual regularity. Since visual regularity represents visual masking effect in the human visual system (HVS) on structure information, i.e., irregularity concealment effect, we use it for early CU termination. We predict irregular CUs by exploiting the information of neighboring CUs, which would not introduce too much distortion. Also, the distortion cannot be perceived by HVS due to the visual masking effect in irregular regions. The experimental results demonstrate that the proposed method saves 24% computational cost on average as compared to anchor with only 0.52% increase in Bjøntegaard delta bit rate and 0.02 dB loss in Bjøntegaard delta PSNR.
Human emotions are known to always have four phases in the temporal domain: neutral, onset, apex, and offset. This has been demonstrated to be of great benefit for emotion recognition. Therefore, temporal segmentation has attracted considerable research interest. Although state-of-the-art techniques use recurrent neural networks to highly increase the performance, they ignore the relevance of each frame (time step) of a video, and they do not consider the changing contribution of different features when fusing them. We propose a framework called dual-level attention-aware bidirectional grated recurrent unit, which integrates ideas from attention models to discover the most important frames and features for improving temporal segmentation. Specifically, it applies attention mechanisms at two levels: frame and feature. A significant advantage is that the two-level attention weights provide a meaningful value to depict the importance of each frame and feature. The experiments demonstrated that the proposed framework outperforms state-of-the-art methods.
Remote sensing image pansharpening involves the fusing of multispectral (MS) images with a panchromatic (PAN) image to produce an image with high-spatial as well as high-spectral resolution. We propose an improved pansharpening algorithm based on deep learning. A four-layer residual network is used as a reconstructed model to enable the accurate estimation of high-frequency details. We consider two priors to take advantage of MS information. The first prior indicates interspectral similarity, wherein the relationship between high- and low-resolution PAN images is used in the estimation of high-resolution MS images. The second prior provides the location of edges and textures according to the gradient of the PAN image. Consistency in the spectral characteristic is used as the basis in creating a pretrained model with the aim of accelerating convergence. Multiple evaluation metrics were applied to simulated and real images in order to compare the efficacy of the proposed method with that of state-of-the-art image fusion methods.
The checkerboard is a frequently used pattern in camera calibration, an essential process to get intrinsic parameters for more accurate information from images. An automatic checkerboard detection method that can detect multiple checkerboards in a single image is proposed. It contains a corner extraction approach using self-correlation and a structure recovery solution using constraints related to adjacent corners and checkerboard block edges. The method utilizes the central symmetric feature of the checkerboard crossings as well as the spatial relationship of neighboring checkerboard corners and the grayscale distribution of their neighboring pixels. Five public datasets are used in the experiments to evaluate the method. Results show high detection rates and a short average runtime of the proposed method. In addition, the camera calibration accuracy also presents the effectiveness of the proposed detection method with reprojected pixel errors smaller than 0.5 pixels.
Recent years have witnessed the rapid development of mobile video services. The diversity of mobile devices and video services can result in a difference in a user’s perceptual video quality, which presents a significant challenge for video quality evaluation. An objective assessment model is proposed that considers both video and screen characteristics, where the video characteristics include the video resolution and video coding quality (VCQ), and the screen characteristics include the screen size and screen resolution. By analyzing the relationships between the perceived video quality (PVQ) and the video and screen characteristics, an influential parameter, effective video pixels per inch (EV-PPI), is proposed that combines the two characteristics. Then, the PVQ is estimated with the VCQ and EV-PPI. Experiment results confirm that the proposed method can effectively estimate the user’s PVQ. It can be used as a guideline for both video service providers and mobile device manufacturers to improve the performance of their services and devices.
This work proposes an improved ranking model of learning local feature descriptors for matching image patches by introducing a variance shrinkage constraint. Previous ranking losses, such as triplet ranking loss and quadruplet ranking loss, have proven powerful in separating corresponding patch pairs from noncorresponding ones. However, they are unable to restrict the intraclass variation since they are only designed to keep noncorresponding pairs away from corresponding ones. Consequently, those scattered pairs get mixed up near the separating hyperplane, which are difficult to discriminate and may disrupt the performance. To resolve this problem, we introduce a variance shrinkage constraint that aims to reduce the variance of patch pairs in the same class and force them to be close to each other. The combination of ranking losses and the variance shrinkage constraint can efficiently reduce overlaps between patch pairs of different classes, which is confirmed by our experiments. Experiments also show that our model achieves a significant improvement in performance compared with original ranking models and other latest methods.
Shifted superimposition is a resolution-enhancement method that has gained popularity in the projector industry the last couple of years. This method consists of shifting every other projected frame spatially with subpixel precision, and by doing so creating a new pixel grid on the projected surface with smaller effective pixel pitch. There is still an open question of how well this technique performs in comparison with the native resolution, and how high the effective resolution gain really is. To help investigate these questions, we have developed a framework for simulating different superimposition methods over different image contents, and evaluate the result using several image quality metrics (IQMs). We have also performed a subjective experiment with observers who rate the simulated image content, and calculated the correlation between the subjective results and the IQMs. We found that the visual information fidelity metric is the most suitable to evaluate natural superimposed images when subjective match is desired. However, this metric does not detect the distortion in synthetic images. The multiscale structural similarity metric which is based on the analysis of image structure is better at detecting this distortion.