Images have characteristic statistics that can be characterized in terms of the responses of wavelet or Gabor-like filters.
There has been a great deal of interest in the fact that images have sparse (kurtotic) statistics in the wavelet domain, with
implications for efficient image encoding in biological and artificial systems. If we set aside the issue of efficiency, we
are still left with the problem of seeing. We have been studying the ways in which filter statistics can reveal useful information
about surfaces, including albedo, shading, and gloss. We find that odd order statistics such as skewness are
quite useful in extracting information about reflectance and gloss, and we also find evidence that humans make use of
this information. It is straightforward to compute skewness with physiological mechanisms.
We describe a set of techniques for mapping one image to another based on the statistics of a training set. We
apply these techniques to the problems of image denoising and superresolution, but they should also be useful
for many vision problems where training data are available. Given a local feature vector computed from an
input image patch, we learn to estimate a subband coefficient of the output image conditioned on the patch.
This entails approximating a multidimensional function, which we make tractable by nested binning and linear
regression within bins. This method performs as well as nearest neighbor techniques, but is much faster. After
attaining this local (patch based) estimate, we force the marginal subband histograms to match a set of target
histograms, in the style of Heeger and Bergen.1 The target histograms are themselves estimated from the
training data. With the combined techniques, denoising performance is similar to state of the art techniques
in terms of PSNR, and is slightly superior in subjective quality. In the case of superresolution, our techniques
produce higher subjective quality than the competing methods, allowing us to attain large increases in apparent
resolution. Thus, for these two tasks, our method is very fast and very effective.
Physical surfaces such as metal, plastic, and paper possess different optical qualities that lead to different characteristics in images. We have found that humans can effectively estimate certain surface reflectance properties from a single image without knowledge of illumination. We develop a machine vision system to perform similar reflectance estimation tasks automatically. The problem of estimating reflectance form single images under unknown, complex illumination proves highly under-constrained due to the variety of potential reflectances and illuminations. Our solution relies on statistical regularities in the spatial structure of real-world illumination. These regularities translate into predictable relationships between surface reflectance and certain statistical features of the image. We determine these relationships using machine learning techniques. Our algorithms do not depend on color or polarization; they apply even to monochromatic imagery. An ability to estimate reflectance under uncontrolled illumination will further efforts to recognize materials and surface properties, tp capture computer graphics models from photographs, and to generalize classical motion and stereo algorithms such that they can handle non-Lambertian surfaces.
The perception of objects is a well-developed field, but the perception of materials has been studied rather little. This is surprising given how important materials are for humans, and how important they must become for intelligent robots. We may learn something by looking at other fields in which material appearance is recognized as important. Classical artists were highly skilled at generating convincing materials. The simulation of material appearance is a topic of great importance in 3D computer graphics. Some fields, such as mineralogy, use the concept of a 'habit' which is a combination of shape and texture, and which may be used for characterizing certain objects or materials. We have recently taken steps toward material recognition by machines, using techniques derived from the domain of texture analysis.
Large regions of many images are filled with visual texture, in which a viewer is not concerned with the exact pixel values. In image coding, it is advantageous to describe such regions in terms of their boundaries and textural properties. A textural description can be much more compact than a precise description of pixel values. For a coding system to work, it is necessary to have an automated method for generating compact texture descriptions; the synthesized textures must appear satisfying to the human viewer. We have adapted the Heeger and Bergen algorithms to the coding problem. The algorithm decomposes an image into subbands with a steerable pyramid, and characterizes the texture in terms of the subband histograms and the pixel histogram. Since the subband histograms all have a similar form, we can describe each one with a low-order parametric model. The resulting textural descriptor is quite compact. We show examples with both still images and video sequences.
Most image coding systems rely on signal processing concepts such as transforms, VQ, and motion compensation. In order to achieve significantly lower bit rates, it will be necessary to devise encoding schemes that involve mid-level and high-level computer vision. Model-based systems have been described, but these are usually restricted to some special class of images such as head-and-shoulders sequences. We propose to use mid-level vision concepts to achieve a decomposition that can be applied to a wider domain of image material. In particular, we describe a coding scheme based on a set of overlapping layers. The layers, which are ordered in depth and move over one another, are composited in a manner similar to traditional `cel' animation. The decomposition (the vision problem) is challenging, but we have attained promising results on simple sequences. Once the decomposition has been achieved, the synthesis is straightforward.
Image segmentation provides a powerful semantic description of video imagery essential in image understanding and efficient manipulation of image data. In particular, segmentation based on image motion defines regions undergoing similar motion allowing an image coding system to more efficiently represent video sequences. This paper describes a general iterative framework for segmentation of video data. The objective of our spatiotemporal segmentation is to produce a layered image representation of the video for image coding applications whereby video data is simply described as a set of moving layers.