In this paper, we are concerned with unsupervised natural image matting. Due to the under-constrained nature of the problem, image matting algorithms are usually provided with user interactions, such as scribbles or trimaps. This is a very tedious task and may even become impractical for some applications. For unsupervised matte calculation, we can either adopt a technique that supports an unsupervised mode for alpha map calculation, or we may automate the process of acquiring user interactions provided for a matting algorithm. Our proposed technique contributes to both approaches and is based on spectral matting. The latter is the only technique in the literature that supports automatic matting but it suffers from critical limitations among which is the unreliable unsupervised operation. Stressing on that drawback, spectral matting may produce erroneous mattes in the absence of guiding scribbles or trimaps. Using the Gestalt laws of grouping, we propose a method that automatically produces more truthful mattes than spectral matting. In addition, it can be used to generate trimaps, eliminating the required user interactions and making it possible to harness the powers of matting techniques that are better than spectral matting but don't support unsupervised operation. The main contribution of this research is the introduction of the Gestalt laws of grouping to the matting problem.
This paper is concerned with the problem of image completion where the goal is to fill large missing parts (holes) in an image, video or a scene in a visually-plausible and a computationally-efficient manner. Recently, the literature on hole filling was dominated by exemplar-based (patch-based) filling techniques with a two-stage unified pipeline that starts by building a bag of significant patches (BoSP), and then uses that bag to fill the hole. In this paper, we propose a new framework which addresses the inherent limitations of the state-of-the-art techniques. Our method capitalizes on a newly-developed technique for image skimming, followed by a novel procedure to propagate the constructed skim to within the hole. Experimental results show that our method compares favourably with the state-of-the-art.
Digital cameras capture images through different Color Filter Arrays and then reconstruct the full color image. Each
CFA pixel only captures one primary color component; the other primary components will be estimated using information
from neighboring pixels. During the demosaicking algorithm, the two unknown color components will be
estimated at each pixel location. Most of the demosaicking algorithms use the RGB Bayer CFA pattern with Red,
Green and Blue filters. The least-Squares Luma-Chroma demultiplexing method is a state of the art demosaicking
method for the Bayer CFA. In this paper we develop a new demosaicking algorithm using the Kodak-RGBW CFA. This
particular CFA reduces noise and improves the quality of the reconstructed images by adding white pixels. We have
applied non-adaptive and adaptive demosaicking method using the Kodak-RGBW CFA on the standard Kodak image
dataset and the results have been compared with previous work.
This paper addresses the problem of natural image matting in which the goal is to softly-segment a foreground from a background. Given an input image and some known foreground (FG) and background (BG) pixels, an alpha value indicating a partial foreground coverage is calculated for every other pixel in the image. The proposed algorithm is affiliated to the sampling-based matting techniques where the alpha of every unknown pixel is calculated using some FG / BG pairs that are sampled according to certain criteria. Current sampling based matting techniques suffer from critical disadvantages, leaving the problem open for further development. By adopting a novel FG / BG pair-selection strategy, we propose a technique that overcomes critical pitfalls in the state-of-the-art methods with a performance that is comparable (and superior in certain cases) to them. Our results were evaluated according to the matting online benchmark.
In recent years, the problem of acquiring omnidirectional stereoscopic imagery of dynamic scenes has gained commercial interest, and consequently, new techniques have been proposed to address this problem. The goal of many of these new panoramic methods is to provide practical solutions for acquiring real-time omnidirectional stereoscopic imagery for human viewing. However, there are problems related to mosaicking partially overlapped stereoscopic snapshots of the scene that need to be addressed. Among these issues are the conditions to provide a consistent depth illusion over the whole scene and the appearance of undesired vertical disparities. We develop an acquisition model capable of describing a variety of omnistereoscopic imaging systems and suitable to study the design constraints of these systems. Based on this acquisition model, we compare different acquisition approaches based on mosaicking partial stereoscopic views of the scene in terms of their depth continuity constraints and the appearance of vertical disparities. This work complements and extends our previous work in omnistereoscopic imaging systems by proposing a mathematical framework to contrast different acquisition strategies to create stereoscopic panoramas using a small number of stereoscopic images.
Different camera configurations to capture panoramic images and videos are commercially available today. However, capturing omnistereoscopic snapshots and videos of dynamic scenes is still an open problem. Several methods to produce stereoscopic panoramas have been proposed in the last decade, some of which were conceived in the realm of robot navigation and three-dimensional (3-D) structure acquisition. Even though some of these methods can estimate omnidirectional depth in real time, they were not conceived to render panoramic images for binocular human viewing. Alternatively, sequential acquisition methods, such as rotating image sensors, can produce remarkable stereoscopic panoramas, but they are unable to capture real-time events. Hence, there is a need for a panoramic camera to enable the consistent and correct stereoscopic rendering of the scene in every direction. Potential uses for a stereo panoramic camera with such characteristics are free-viewpoint 3-D TV and image-based stereoscopic telepresence, among others. A comparative study of the different cameras and methods to create stereoscopic panoramas of a scene, highlighting those that can be used for the real-time acquisition of imagery and video, is presented.
There are different panoramic techniques to produce outstanding stereoscopic panoramas of static scenes. However, a camera configuration capable to capture omnidirectional stereoscopic snapshots and videos of dynamic scenes is still a subject of research. In this paper, two multiple-camera configurations capable to produce high-quality stereoscopic panoramas in real-time are presented. Unlike existing methods, the proposed multiple-camera systems acquire all the information necessary to render stereoscopic panoramas at once. The first configuration exploits micro-stereopsis arising from a narrow baseline to produce omni-stereoscopic images. The second panoramic camera uses an extended baseline to produce poly-centric panoramas and to extract additional depth information, e.g., disparity and occlusion maps, which are used to synthesize stereoscopic views in arbitrary viewing directions. The results of emulating both cameras and the pros and cons of each set-up are presented in this paper.
In this paper we address the problem of disparity estimation required for free navigation in acquired cubicpanorama
image datasets. A client server based scheme is assumed and a remote user is assumed to seek information at each navigation step. The initial compression of such image datasets for storage as well as the transmission of the required data is addressed in this work. Regarding the compression of such data for storage, a fast method that uses properties of the epipolar geometry together with the cubic format of panoramas is used to estimate disparity vectors efficiently. Assuming the use of B pictures, the concept of forward and backward prediction is addressed. Regarding the transmission stage, a new disparity vector transcoding-like scheme is introduced and a frame conversion scenario is addressed. Details on how to pick the best vector among candidate disparity vectors is explained. In all the above mentioned cases, results are compared both visually through error images as well as using the objective measure of Peak Signal to Noise Ratio (PSNR) versus time.
A key problem in telepresence systems is how to effectively emulate the subjective experience of being there delivered
by our visual system. A step toward visual realism can be achieved by using high-quality panoramic snapshots instead of
computer-based models of the scene. Furthermore, a better immersive illusion can be created by enabling the free viewpoint
stereoscopic navigation of the scene, i.e. using omnistereoscopic imaging. However, commonly found implementation
constraints of telepresence systems such as acquisition time, rendering complexity, and storage capacity, make the idea of
using stereoscopic panoramas challenging. Having these constraints in mind, we developed a technique for the efficient
acquisition and rendering of omnistereoscopic images based on sampling the scene with clusters of three panoramic images
arranged in a controlled geometric pattern. Our technique can be implemented with any off-the-shelf panoramic cameras.
Furthermore, it does not require neither the acquisition of additional depth information of the scene nor the estimation of
camera parameters. The low the computational complexity and reduced data overhead of our rendering process make it
attractive for the large scale stereoscopic sampling in a variety of scenarios.
In this paper we address the problem of cubic panorama image dataset compression. Two state-of-the-art approaches,
namely: H.264/MPEG4 AVC and Dirac video codec, are used and compared for the application of
virtual navigation in image based representations of real world environments. Different prediction structures and
Group Of Pictures (GOP) sizes are investigated and compared on this new type of visual data. Based on the
obtained results, as well as the requirements of the system, an efficient prediction structure and bitstream syntax
are proposed. The concept of Epipolar geometry is introduced and a method to facilitate efficient disparity
estimation is suggested.
The recovery of a full resolution color image from a color filter array like the Bayer pattern is commonly regarded
as an interpolation problem for the missing color components. But it may equivalently be viewed as the problem
of channel separation from a frequency multiplex of the color components. By using linear band-pass filters in
a locally adaptive manner, this latter view has earlier been successfully approached, providing state-of-the-art
performance in demosaicking. In this paper, we address remaining shortcomings of this frequency domain method
and discuss a locally adaptive restoration filter. By implementing restoration as an extension of the bilateral
filter, reasonable complexity of the method can be sustained while being able to improve resulting image quality
by up to more than 1dB.
This paper presents a disparity estimation algorithm that combines three different kinds of techniques: Gabor transform, variational refinement and region-based affine parameter estimation for disparity calculation. The Gabor transform is implemented using a set of quadrature-pair filters for estimating the two-dimensional cor-
respondences between the two images without the calibration information, and the estimated coarse disparity
maps are applied to a variational refinement process which involves solving a set of partial differential equations (PDEs). Then the refined disparity values are used with the image segmentation information so that the parameters of affine transforms for the correspondence of each region can be calculated by singular value decomposition (SVD), and these affine parameters can be applied in turn to bring more refined disparity maps.
In this paper, a simplified implementation of the Concentric Mosaics image-based rendering technique is proposed. The greatest difficulty for an ordinary user in obtaining the pre-captured images for use in the Concentric Mosaics technique is the precise control of the rotation of a long beam. In the proposed Simplified Concentric Mosaics technique, the camera positions where the pre-captured images will be taken are not precisely controlled, but are estimated from the pre-captured images. We use a stereo technique for estimation of camera positions in this special scenario instead of using the traditional camera pose estimation methods of computer vision. The estimation errors have been analyzed and a closed-loop constraint is used to achieve better rotation angle estimation. In addition, a ratio fitting technique is proposed to select good matching features in the special tri-view matching detection scenario, which subsequently improves rotation angle estimation. Another contribution of the paper is a pre-processing step to eliminate or reduce the possible vertical offsets and other distortions in the pre-captured images, which are caused by any possible motions of the camera that deviate from the ideal one. In a column-based view synthesis technique like the proposed method and the conventional Concentric Mosaics technique, these vertical offsets and distortions in the pre-captured images will lower the quality of the synthesized images. Thus our pre-processing can be applied on both the proposed method and the ordinary Concentric Mosaics technique. The pre-captured image data structures of both Concentric Mosaics and the proposed method have been illustrated and the comparison has been made. The proposed technique has a similar data structure and thus a similar rendering algorithm as the conventional Concentric Mosaics technique. As a result, it meets our objective that an ordinary user can obtain the Concentric Mosaics type image data and plug it into a common Concentric Mosaics rendering framework. Simulation results show that the proposed method can achieve good rendering results.
This paper presents a method for view morphing and interpolation based on triangulation. View morphing here is viewed here as a basic tool for view interpolation. The feature points in each source image are first detected. Based on these feature points, each source image is segmented into a set of triangular regions and local affine transformations are then used for texture mapping of each triangle from the source image to the destination image. This is called triangulation-based texture mapping. However, one of the significant problems associated with this approach is texture discontinuity between adjacent triangles. In order to solve this problem, the triangular patches that might cause these adjacent discontinuities are first detected and the optimal affine transformations for these triangles are then applied. In the subsequent view interpolation step, all source images are transferred to the novel view through view morphing, and the final novel view is the combination of all these candidate novel views. The major improvement over the traditional approach is a feedback-based method proposed to determine the weights for the texture combination from different views. Simulation results show that our method can reduce the discontinuities in triangle-based view morphing and significantly improve the quality of the interpolated views.
The steadily increasing need for video content accessibility necessitates the development of stable systems to represent video sequences based on their high-level (semantic) content. The core of such systems is the automatic extraction of video content. In this paper, a computational layered framework to effectively extract multiple high-level features of a video shot is presented. The objective with this framework is to extract rich high-level video descriptions of real world scenes. In our framework, high-level descriptions are related to moving objects which are represented by their spatio-temporal low-level features. High-level features are
represented by generic high-level object features such as events.
To achieve higher applicability, descriptions are extracted independently of the video context. Our framework is based on four interacting video processing layers: enhancement to estimate and reduce noise, stabilization to compensate for global changes, analysis to extract meaningful objects, and interpretation to extract context-independent semantic features. The effectiveness and real-time response of the our framework are demonstrated by extensive experimentation on indoor and outdoor video shots in the presence of
multi-object occlusion, noise, and artifacts.
The discrete wavelet transform(DWT) is a tool extensively used in image processing algorithms. It can be used to decorrelate information from the original image, which can thus help in compressing the data for storage, transmission or other post-processing purposes. However, the finite nature of such images gives rise to edge artifacts in the reconstructed data. A commonly used technique to overcome this problem is a symmetric extension of the image, which can preserve zeroth order continuity in the data. This still produces undesirable edge artifacts in derivatives and subsampled versions of the image. In this paper we present an extension to Williams and Amaratunga's work of extrapolating the image data using a polynomial extrapolation technique before performing the forward or inverse DWT for biorthogonal wavelets. Comaparitive results of reconstructed data, with individual subband reconstruction as well as using the embedded zerotree coding (EZC) scheme, are also presented for both the aforementioned techniques.
It is likely that in many applications block-matching techniques for motion estimation will be further used. In this paper, a novel object-based approach for enhancement of motion fields generated by block matching is proposed. Herein, a block matching is first applied in parallel with a fast spatial image segmentation. Then, a rule-based object postprocessing strategy is used where each object is partitioned into sub-objects and each sub-object motion histogram first separately analyzed. The sub-object treatment is, in particular, useful when image segmentation errors occur. Then, using plausibility histogram tests, object motions are segregated into translational or non-translational motion. For non-translational motion, a single motion-vector per sub-object is first assigned. Then motion vectors of the sub-objects are examined according to plausibility criteria and adjusted in order to create smooth motion inside the whole object. As a result, blocking artifacts are reduced and a more accurate estimation is achieved. Another interesting result is that motion vectors are implicitly assigned to pixels of covered/exposed areas. In the paper, performance comparison of the new approach and block matching methods is given. Furthermore, a fast unsupervised image segmentation method of reduced complexity aimed at separating objects is proposed. This method is based on a binarization method and morphological edge detection. The binarization combines local and global texture-homogeneity tests based on special homogeneity masks which implicitly take possible edges into account for object separation. The paper contributes also a novel formulation of binary morphological erosion, dilation and binary edge detection. The presented segmentation uses few parameters which are automatically adjusted to the amount of noise in the image and to the local standard deviation.
In this paper, novel techniques for image segmentation and explicit object-matching-based motion estimation are presented. The principal aims of this work are to reconstruct motion-compensated images without introducing significant artifacts and to introduce an explicit object-matching and noise-robust segmentation technique which shows low computational costs and regular operations. A main feature of the new motion estimation technique is its tolerance against image segmentation errors such as the fusion or separation of objects. In addition, motion types inside recognized objects are detected. Depending on the detected object motion types either 'object/unique motion-vector' relations or 'object/several motion-vectors' relations are established. For example, in the case of translation and rotation, objects are divided into different regions and a 'region/one motion vector' relation is achieved using interpolation techniques. Further, suitability (computational cost) of the proposed methods for online applications (e.g. image interpolation) is shown. Experimental results are used to evaluate the performance of the proposed methods and to compare with block- based motion estimation techniques. In this stage of our work, the segmentation part is based on intensity and contour information (scalar segmentation). For further stabilization of the segmentation and hence the estimation process, the integration other statistical properties of objects (e.g. texture) (vector segmentation) is our current research.
This paper descries a technique for representing motion information in a video coder. We present a novel way of representing motion, based on a dictionary of motion models, as well as related estimation techniques. Motion fields are represented by low-order polynomial-based models and a discrete label field. We develop an adaptive context-based entropy coding technique for the label field. In the paper, we address issues relating to rate-distortion optimal coding. Simulations based on a software implementation of the technique are compared to similar results for classical block- based motion compensation and coding techniques.
This paper presents a new method for coding the chromatic component of a color image that exploits the piecewise- constant nature of chromatic information. The image is first transformed to a color space in which chromatic information is nearly piecewise constant. The chromatic component is then represented by entries from a codebook of 2D chromatic vectors adapted to the given image. Both memoryless quantization and quantization with spatial memory are considered. Finally, the field of labels is coded using a suitable lossless code with memory; we have used a context- dependent arithmetic code. Experimental results showing rate-distortion performance of the method under various conditions are presented.
In this article, we present a new NTSC system based on multidimensional crosstalk-free transmultiplexer theory. The system is truly crosstalk-free and is compatible with existing television sets. The new encoded NTSC composite signal can be demodulated with slight degradation by a conventional television receiver, with some improvements by a comb-filter-equipped television and with best performance with the new decoder. We sue new sampling structures for luminance and chrominance signals in order to obtain near- perfect-reconstruction. The detailed structure of the proposed crosstalk-free NTSC system is presented. The NTSC encoder is composed of a decimation stage followed by a near-perfect-reconstruction transmultiplexer encoder. The NTSC encoder is composed of a decimation stage followed by a near-perfect-reconstruction transmultiplexer encoder. The NTSC decoder is composed of the transmultiplexer's decoder followed by an interpolation stage. We show structures of multidimensional two-channel FIR filter banks which allow near-perfect-reconstruction. Such structures lead to exactly zero crosstalk between luminance and chrominance signals. Special attention is given to the design of the filters in the system since they need to maintain compatibility with existing receivers. A design example and comparison with existing NTSC systems are presented.
In video coding at high compression rates, e.g., in very low bit rate coding, every transmitted bit carries a significant amount of information that is related either to motion parameters or to intensity residual. As demonstrated in the SIM-3 coding scheme, a more precise motion model leads to improved quality of coded images when compared with the H.261 coding standard. In this paper, we present some of our recent results on the modeling and estimation of motion of the compression and post-processing of typical videophone ('head-and- shoulders') image sequences. We describe a block-based motion estimation that attempts to optimize the overall bit budget for intensity residual, motion and overhead information. We compare simulation results for this scheme with full-search block matching in the context of the H.261 coding. Then, we discuss a region-based motion estimation that exploits segmentation maps obtained from an MDL-based (minimum description length) algorithm. We compare experimentally several algorithms for the compression of such maps. Finally, we describe motion-compensated interpolation that takes into account pixel acceleration. We show experimentally a major performance improvement of the constant- acceleration model over the usual constant-velocity models. This is a very promising technique for post-processing in the receiver to improve reconstruction of frames dropped in the transmitter.
The ability to flexibly access coded video data at different resolutions or bit rates is referred to as scalability. We are concerned here with the class of methods referred to as pyramidal embedded coding in which specific subsets of the binary data can be used to decode lower- resolution versions of the video sequence. Two key techniques in such a pyramidal coder are the scan-conversion operations of down-conversion and up-conversion. Down-conversion is required to produce the smaller, lower-resolution versions of the image sequence. Up- conversion is used to perform conditional coding, whereby the coded lower-resolution image is interpolated to the same resolution as the next higher image and used to assist in the encoding of that level. The coding efficiency depends on the accuracy of this up-conversion process. In this paper techniques for down-conversion and up-conversion are addressed in the context of a two-level pyramidal representation. We first present the pyramidal technique for spatial scalability and review the methods used in MPEG-2. We then discuss some enhanced methods for down- and up-conversion, and evaluate their performance in the context of the two-level scalable system.
This paper presents a two-dimensional code excited linear prediction (CELP) method for image coding.
This method is a two-dimensional extension of the CELP systems commonly used for speech coding. The
decoder is identical to a conventional DPCM decoder. However, at the encoder, the input images are first
decomposed into disjoint blocks. A single codeword from a table of N codewords is used to represent the
vector of quantized residuals for each block. The encoder selects the appropriate codeword by reconstructing N
versions of the current block, using each of the N vectors of the codebook. The index of the codeword giving
the least distortion is then transmitted. In designing the codebook, while the LBG method of clustering failed
to converge, we have succeeded in finding a deterministic codebook based on a training set using the method
of successive clustering. The system has been extended by using adaptive prediction, where one of K possible
prediction filters is used for each block; the encoder chooses the prediction filter that results in the least mean
squared prediction error. An index is transmitted to the decoder indicating which prediction filter has been
used. With no additional overhead, K different codebooks can be used, corresponding to each of the prediction
filters. We have tested this system using five predictors. The five predictors were initially selected to give
good performance on different types of image material, e.g. edges of different orientation, and then refined by
minimizing the mean square prediction error on those pixels for which the initial predictor gave the lowest mean