The pattern of the light that falls on the retina is a conflation of real-world sources such as illumination and reflectance. Human observers often contend with the inherent ambiguity of the underlying sources by making assumptions about what real-world sources are most likely. Here we examine whether the visual system’s assumptions about illumination match the statistical regularities of the real world. We used a custom-built multidirectional photometer to capture lighting relevant to the shading of Lambertian surfaces in hundreds of real-world scenes. We quantify the diffuseness of these lighting measurements, and compare them to previous biases in human visual perception. We find that (1) natural lighting diffuseness falls over the same range as previous psychophysical estimates of the visual system’s assumptions on diffuseness, and (2) natural lighting almost always provides lighting direction cues that are strong enough to override the human visual system’s well known assumption that light tends to come from above. A consequence of these findings is that what seem to be errors in visual perception are often actually byproducts of the visual system knowing about and using reliable properties of real-world lighting when contending with ambiguous retinal images.
Recently, we developed a method for optimally estimating focus error given a set of natural scenes, a waveoptics
model of the lens system, a sensor array, and a specification of measurement noise. The method is
based on first principles and can be tailored to any vision system for which these properties can be
characterized. Here, the method is used to estimate defocus in local areas of images (64x64 pixels) formed
in a Nikon D700 digital camera fitted with a 50mm Sigma prime lens. Performance is excellent. Defocus
magnitude and sign can be estimated with high precision and accuracy over a wide range. The method
takes an integrative approach that accounts for natural scene statistics and capitalizes (but not does depend
exclusively) on chromatic aberrations. Although chromatic aberrations are greatly reduced in achromatic
lenses, we show that there are sufficient residual chromatic aberrations in a high-quality prime lens for our
method to achieve good performance. Our method has the advantages of both phase-detection and contrastmeasurement
autofocus techniques, without their disadvantages. Like phase detection, the method provides
point estimates of defocus (magnitude and sign), but unlike phase detection, it does not require specialized
hardware. Like contrast measurement, the method is image-based and can operate in "Live View" mode,
but unlike contrast measurement, it does not require an iterative search for best focus. The proposed
approach could be used to develop improved autofocus algorithms for digital imaging and video systems.
We tracked the points-of-gaze of human observers as they viewed videos drawn from foreign films while engaged
in two different tasks: (1) Quality Assessment and (2) Summarization. Each video was subjected to three possible
distortion severities - no compression (pristine), low compression and high compression - using the H.264
compression standard. We have analyzed these eye-movement locations in detail. We extracted local statistical
features around points-of-gaze and used them to answer the following questions: (1) Are there statistical differences
in variances of points-of-gaze across videos between the two tasks?, (2) Does the variance in eye movements
indicate a change in viewing strategy with change in distortion severity? (3) Are statistics at points-of-gaze different
from those at random locations? (4) How do local low-level statistics vary across tasks? (5) How do
point-of-gaze statistics vary across distortion severities within each task?
The environments we live in and the tasks we perform in those environments have shaped the design of our visual
systems through evolution and experience. This is an obvious statement, but it implies three fundamental components
of research we must have if we are going to gain a deep understanding of biological vision systems: (a) a rigorous
science devoted to understanding natural environments and tasks, (b) mathematical and computational analysis of how
to use such knowledge of the environment to perform natural tasks, and (c) experiments that allow rigorous
measurement of behavioral and neural responses, either in natural tasks or in artificial tasks that capture the essence of
natural tasks. This approach is illustrated with two example studies that combine measurements of natural scene
statistics, derivation of Bayesian ideal observers that exploit those statistics, and psychophysical experiments that
compare human and ideal performance in naturalistic tasks.
Motion coding in the brain undoubtedly reflects the statistics of retinal image motion occurring in the natural
environment. To characterize these statistics it is useful to measure motion in artificial movies derived from simulated
environments where the "ground truth" is known precisely. Here we consider the problem of coding retinal image
motion when an observer moves through an environment. Simulated environments were created by combining the
range statistics of natural scenes with the spatial statistics of natural images. Artificial movies were then created by
moving along a known trajectory at a constant speed through the simulated environments. We find that across a range
of environments the optimal integration area of local motion sensors increases logarithmically with the speed to which
the sensor is tuned. This result makes predictions for cortical neurons involved in heading perception and may find use
in robotics applications.
We describe an algorithm and software for creating variable resolution displays in real time, contingent upon the direction of gaze. The algorithm takes as input a video sequence and an arbitrary, real-valued, two-dimensional map that specifies a desired amount of filtering (blur) at each pixel location relative to direction of gaze. For each input video image the follow operations are performed: (1) the image is coded as a multi-resolution pyramid, (2) the gaze direction is measured, (3) the resolution map is shifted to the gaze direction, (4) the desired filtering at each pixel location is achieved by interpolating between levels of the pyramid using the resolution map, and (5) the interpolated image is displayed. The transfer function associated with each level of the pyramid is calibrated beforehand so that the interpolation produces exactly the desired amount of filtering at each pixel. This algorithm produces precision, artifact-free displays in 8-bit grayscale or 24-bit color. The software can process live or prerecorded video at over 60 frames per second on ordinary personal computers without special hardware. Direction of gaze for each processed video frame may be taken from an eye-tracker, from a sequence of directions saved on disk, or from another pointing device (such as a mouse). The software is demonstrated by simulating the visual fields of normals and of patients with low vision. We are currently using the software to precisely control retinal stimulation during complex tasks such as extended visual search.
Current digital image/video storage, transmission and display technologies use uniformly sampled images. On the other hand, the human retina has a nonuniform sampling density that decreases dramatically as the solid angle from the visual fixation axis increases. Therefore, there is sampling mismatch between the uniformly sampled digital images and the retina. This paper introduces Retinally Reconstructed Images (RRIs), a novel representation of digital images, that enables a resolution match with the human retina. To create an RRI, the size of the input image, the viewing distance and the fixation point should be known. In the RRI coding phase, we compute the `Retinal Codes', which consist of the retinal sampling locations onto which the input image projects, together with the retinal outputs at these locations. In the decoding phase, we use the backprojection of the Retinal Codes onto the input image grid as B-spline control coefficients, in order to construct a 3D B-spline surface with nonuniform resolution properties. An RRI is then created by mapping the B-spline surface onto a uniform grid, using triangulation. Transmitting or storing the `Retinal Codes' instead of the full resolution images enables up to two orders of magnitude data compression, depending on the resolution of the input image, the size of the input image and the viewing distance. The data reduction capability of Retinal Codes and RRI is promising for digital video storage and transmission applications. However, the computational burden can be substantial in the decoding phase.
Foveated imaging exploits the fact that the spatial resolution of the human visual system decreases dramatically away from the point of gaze. Because of this fact, large bandwidth savings are obtained by matching the resolution of the transmitted image to the fall-off in resolution of the human visual system. We have developed a foveated multiresolution pyramid video coder/decoder which runs in real-time on a general purpose computer (i.e., a Pentium with the Windows 95/NT OS). The current system uses a foveated multiresolution pyramid to code each image into 5 or 6 regions of varying resolution. The user-controlled foveation point is obtained from a pointing device (e.g., a mouse or an eyetracker). Spatial edge artifacts between the regions created by the foveation are eliminated by raised- cosine blending across levels of the pyramid, and by `foveation point interpolation' within levels of the pyramid. Each level of the pyramid is then motion compensated, multiresolution pyramid coded, and thresholded/quantized based upon human contrast sensitivity as a function of spatial frequency and retinal eccentricity. The final lossless coding includes zero-tree coding. Optimal use of foveated imaging requires eye tracking; however, there are many useful applications which do not require eye tracking.
We have developed a preliminary version of a foveated imaging system, implemented on a general purpose computer, which greatly reduces the transmission bandwidth of images. The system is based on the fact that the spatial resolution of the human eye is space variant, decreasing with increasing eccentricity from the point of gaze. By taking advantage of this fact, it is possible to create an image that is almost perceptually indistinguishable from a constant resolution image, but requires substantially less information to code it. This is accomplished by degrading the resolution of the image so that it matches the space-variant degradation in the resolution of the human eye. Eye movements are recorded so that the high resolution region of the image can be kept aligned with the high resolution region of the human visual system. This system has demonstrated that significant reductions in bandwidth can be achieved while still maintaining access to high detail at any point in an image. The system has been tested using 256 by 256 8 bit gray scale images with a 20 degree field-of-view and eye-movement update rates of 30 Hz (display refresh was 60 Hz). users of the system have reported minimal perceptual artifacts at bandwidth reductions of up to 94.7% (a factor of 18.8). Bandwidth reduction factors of over 100 are expected once lossless compression techniques are added to the system.
A model of human visual detection performance has been developed, based on available anatomical and physiological data for the primate visual system. The inhomogeneous retino- cortical (IRC) model computes detection thresholds by comparing simulated neural responses to target patterns with responses to a uniform background of the same luminance. The model incorporates human ganglion cell sampling distributions; macaque monkey ganglion cell receptive field properties; macaque cortical cell contrast nonlinearities; and a optical decision rule based on ideal observer theory. Spatial receptive field properties of cortical neurons were not included. Two parameters were allowed to vary while minimizing the squared error between predicted and observed thresholds. One parameter was decision efficiency, the other was the relative strength of the ganglion-cell center and surround. The latter was only allowed to vary within a small range consistent with known physiology. Contrast sensitivity was measured for sinewave gratings as a function of spatial frequency, target size and eccentricity. Contrast sensitivity was also measured for an airplane target as a function of target size, with and without artificial scotomas. The results of these experiments, as well as contrast sensitivity data from the literature were compared to predictions of the IRC model. Predictions were reasonably good for grating and airplane targets.
The contrast response functions of cat and monkey visual cortex neurons reveal two important nonlinearities: expansive response exponents and contrast gain control. These two nonlinearities (when combined with a linear spatiotemporal receptive field) can have beneficial consequences on stimulus selectivity. Expansive response exponents enhance stimulus selectivity introduced by previous neural interactions, thereby relaxing the structural requirements for establishing highly selective neurons. Contrast gain control maintains stimulus selectivity, over a wide range of contrasts, in spite of the limited dynamic response range and the steep slopes of the contrast response function.