The first Modelfest group publication appeared in the SPIE Human Vision and Electronic Imaging conference
proceedings in 1999. "One of the group's goals is to develop a public database of test images with threshold data from
multiple laboratories for designing and testing HVS (Human Vision Models)." After extended discussions the group
selected a set of 45 static images thought to best meet that goal and collected psychophysical detection data which is
available on the WEB and presented in the 2000 SPIE conference proceedings. Several groups have used these datasets
to test spatial modeling ideas. Further discussions led to the preliminary stimulus specification for extending the database
into the temporal domain which was published in the 2002 conference proceeding.
After a hiatus of 12 years, some of us have collected spatio-temporal thresholds on an expanded stimulus set of
41 video clips; the original specification included 35 clips. The principal change involved adding one additional spatial
pattern beyond the three originally specified. The stimuli consisted of 4 spatial patterns, Gaussian Blob, 4 c/d Gabor
patch, 11.3 c/d Gabor patch and a 2D white noise patch. Across conditions the patterns were temporally modulated over
a range of approximately 0-25 Hz as well as temporal edge and pulse modulation conditions. The display and data
collection specifications were as specified by the Modelfest groups in the 2002 conference proceedings.
To date seven subjects have participated in this phase of the data collection effort, one of which also
participated in the first phase of Modelfest. Three of the spatio-temporal stimuli were identical to conditions in the
original static dataset. Small differences in the thresholds were evident and may point to a stimulus limitation. The
temporal CSF peaked between 4 and 8 Hz for the 0 c/d (Gaussian blob) and 4 c/d patterns. The 4 c/d and 11.3 c/d Gabor
temporal CSF was low pass while the 0 c/d pattern was band pass.
This preliminary expansion of the Modelfest dataset needs the participation of additional laboratories to
evaluate the impact of different methods on threshold estimates and increase the subject base. We eagerly await the
addition of new data from interested researchers. It remains to be seen how accurately general HVS models will predict
thresholds across both Modelfest datasets.
Assorted technologies such as; EEG, MEG, fMRI, BEM, MRI, TMS and BCI are being integrated to understand how
human visual cortical areas interact during controlled laboratory and natural viewing conditions. Our focus is on the
problem of separating signals from the spatially close early visual areas. The solution involves taking advantage of
known functional anatomy to guide stimulus selection and employing principles of spatial and temporal response
properties that simplify analysis. The method also unifies MEG and EEG recordings and provides a means for improving
existing boundary element head models. In going beyond carefully controlled stimuli, in natural viewing with scanning
eye movements, assessing brain states with BCI is a most challenging task. Frequent eye movements contribute artifacts
to the recordings. A linear regression method is introduced that is shown to effectively characterize these frequent
artifacts and could be used to remove them. In free viewing, saccadic landings initiate visual processing epochs and
could be used to trigger strictly time based analysis methods. However, temporal instabilities indicate frequency based
analysis would be an important adjunct. The class of Cauchy filter functions is introduced that have narrow time and
frequency properties well matched to the EEG/MEG spectrum for avoiding channel leakage.
Proc. SPIE. 6806, Human Vision and Electronic Imaging XIII
KEYWORDS: Signal to noise ratio, Visual process modeling, Visualization, Spatial frequencies, Electrodes, Magnetic resonance imaging, Process control, Electroencephalography, Functional magnetic resonance imaging, Magnetoencephalography
The human brain has well over 30 cortical areas devoted to visual processing. Classical neuro-anatomical as well as
fMRI studies have demonstrated that early visual areas have a retinotopic organization whereby adjacent locations in
visual space are represented in adjacent areas of cortex within a visual area. At the 2006 Electronic Imaging meeting we
presented a method using sprite graphics to obtain high resolution retinotopic visual evoked potential responses using
multi-focal m-sequence technology (mfVEP). We have used this method to record mfVEPs from up to 192 non
overlapping checkerboard stimulus patches scaled such that each patch activates about 12 mm2 of cortex in area V1 and
even less in V2. This dense coverage enables us to incorporate cortical folding constraints, given by anatomical MRI
and fMRI results from the same subject, to isolate the V1 and V2 temporal responses. Moreover, the method offers a
simple means of validating the accuracy of the extracted V1 and V2 time functions by comparing the results between
left and right hemispheres that have unique folding patterns and are processed independently. Previous VEP studies
have been contradictory as to which area responds first to visual stimuli. This new method accurately separates the
signals from the two areas and demonstrates that both respond with essentially the same latency. A new method is
introduced which describes better ways to isolate cortical areas using an empirically determined forward model. The
method includes a novel steady state mfVEP and complex SVD techniques. In addition, this evolving technology is put
to use examining how stimulus attributes differentially impact the response in different cortical areas, in particular how
fast nonlinear contrast processing occurs. This question is examined using both state triggered kernel estimation (STKE)
and m-sequence "conditioned kernels". The analysis indicates different contrast gain control processes in areas V1 and
V2. Finally we show that our m-sequence multi-focal stimuli have advantages for integrating EEG and MEG for
improved dipole localization.
The pupil dilation reflex is mediated by inhibition of the parasympathetic Edinger-Westphal oculomotor complex and
sympathetic activity. It has long been documented that emotional and sensory events elicit a pupillary reflex dilation. Is
the pupil response a reliable marker of a visual detection event? In two experiments where viewers were asked to report
the presence of a visual target during rapid serial visual presentation (RSVP), pupil dilation was significantly associated
with target detection. The amplitude of the dilation depended on the frequency of targets and the time of the detection.
Larger dilations were associated with trials having fewer targets and with targets viewed earlier during the trial. We also
found that dilation was strongly influenced by the visual task.
The typical multifocal stimulus used in visual evoked potential (VEP) studies consists of about 60 checkerboard stimulus patches each independently contrast reversed according to an m-sequence. Cross correlation of the response (EEG, MEG, ERG, or fMRI) with the m-sequence results in a series of response kernels for each response channel and each stimulus patch. In the past the number and complexity of stimulus patches has been constrained by graphics hardware, namely the use of look-up-table (LUT) animation methods. To avoid such limitations we replaced the LUTs with true color graphic sprites to present arbitrary spatial patterns. To demonstrate the utility of the method we have recorded simultaneously from 192 cortically scaled stimulus patches each of which activate about 12mm2 of cortex in area V1. Because of the sparseness of cortical folding, very small stimulus patches and robust estimation of dipole source orientation, the method opens a new window on precise spatio-temporal mapping of early visual areas. The use of sprites also enables multiplexing stimuli such that at each patch location multiple stimuli can be presented. We have presented patterns with different orientations (or spatial frequencies) at the same patch locations but independently temporally modulated, effectively doubling the number of stimulus patches, to explore cell population interactions at the same cortical locus. We have also measured nonlinear responses to adjacent pairs of patches, thereby getting an edge response that doubles the spatial sampling density to about 1.8 mm on cortex.
Most corneal topographers are slope-based instruments, measuring corneal slope based on light reflected by the cornea acting as a mirror. This mirror method limits corneal coverage to about 9 mm diameter. Both refractive surgery and contact lens fitting actually require a larger coverage than is obtainable using slope-based instruments. Height-based instruments should be able to measure a cornea/sclera area that is twice the size (four times the area) of slope-based topographers with an accuracy of a few microns. We have been testing a prototype of a new model height-based topographer manufactured by Euclid Systems. We find that single shots can produce a corneal coverage of up to 16 mm vertical and 20 mm horizontal. The heights and slopes in the corneal region have good replicability. Although the scleral region is noisier, it is the only topographer available able to measure scleral topography that is critically important to contact lens fitting. There are a number of improvements to the Euclid software and hardware that would enable it to fill an important niche in eye care and eye research.
Models that predict human performance on narrow classes of visual stimuli abound in the vision science literature. However, the vision and the applied imaging communities need robust general-purpose, rather than narrow, computational human visual system (HVS) models to evaluate image fidelity and quality and ultimately improve imaging algorithms. Of the general-purpose early HVS models that currently exist, direct model comparisons on the same data sets are rarely made. The Modelfest group was formed several years ago to solve these and other vision modeling issues. The group has developed a database of static spatial test images with threshold data that is posted on the WEB for modellers to use in HVS model design and testing. The first phase of data collection was limited to detection thresholds for static gray scale 2D images. The current effort will extend the database to include thresholds for selected grayscale 2D spatio-temporal image sequences. In future years, the database will be extended to include discrimination (masking) for dynamic, color and gray scale image sequences. The purpose of this presentation is to invite the Electronic Imaging community to participate in this effort and to inform them of the developing data set, which is available to all interested researchers. This paper presents the display specifications, psychophysical methods and stimulus definitions for the second phase of the project, spatio-temporal detection. The threshold data will be collected by each of the authors over the next year and presented on the WEB along with the stimuli.
We quantitatively evaluated a technique for combining multiple videokeratograph views of different areas of cornea. To achieve this we first simulated target reflection from analytic descriptions of various shapes believed to mimic common corneal topographies. The splicing algorithm used the simulated reflections to achieve a good quality estimation of the shapes. Actual imagery was then acquired of manufactured models of the same shapes and the splicing algorithm was found to achieve a less perfect estimation. The cause was thought mainly to be image blur due to defocus. To investigate this, blur was introduced into the reflection simulation, and the results of the splicing algorithm compared to those found from the actual imagery.
Proc. SPIE. 3959, Human Vision and Electronic Imaging V
KEYWORDS: Image compression, Visual process modeling, Data modeling, Visualization, Spatial frequencies, Databases, Composites, Video compression, Human vision and color perception, Performance modeling
A robust model of the human visual system (HVS) would have a major practical impact on the difficult technological problems of transmitting and storing digital images. Although most HVS models exhibit similarities, they may have significant differences in predicting performance. Different HVS models are rarely compared using the same set of psychophysical measurements, so their relative efficacy is unclear. The Modelfest organization was formed to solve this problem and accelerate the development of robust new models of human vision. Members of Modelfest have gathered psychophysical threshold data on the year one stimuli described at last year's SPIE meeting. Modelfest is an exciting new approach to modeling involving the sharing of resources, learning from each other's modeling successes and providing a method to cross-validate proposed HVS models. The purpose of this presentation is to invite the Electronic Imaging community to participate in this effort and inform them of the developing database, which is available to all researchers interested in modeling human vision. In future years, the database will be extended to other domains such as visual masking, and temporal processing. This Modelfest progress report summarizes the stimulus definitions and data collection methods used, but focuses on the results of the phase one data collection effort. Each of the authors has provided at least one dataset from their respective laboratories. These data and data collected subsequent to the submission of this paper are posted on the WWW for further analysis and future modeling efforts.
Videokeratography is a common method used by clinicians and researchers to estimate the surface topography of the human cornea. It is based on the object-to-image relationship of concentric rings reflected off the surface of the cornea. This technique works reliably in most cases for central cornea. However, the accuracy of corneal topography is reduced for peripheral cornea because of shadows caused by brows and nose and occlusions caused by eyelids. To achieve a broader coverage of the peripheral cornea, images of off- centered gaze in four directions could be combined. One of the difficulties associated with this approach is that the shape of image rings in the peripheral cornea become very irregular, z-shaped, due to abrupt change sin surface topography near the limbus. These irregularities cause complications for current algorithms for estimating the location of edges along each image ring. Many current algorithms make assumptions about he shape and relative positions of image rings to distinguish between different rings. These assumptions no longer hold with off-centered images since the image rings can deviate dramatically from an ellipsoid. Our algorithm overcomes this problem by using fewer assumptions combined with a robust segmentation algorithm to distinguish between image rings.
Successful extended contact lens wear requires lens motion that provides adequate tear mixing to remove ocular debris. Proper lens motion of rigid contact lenses is also important for proper fitting. Moreover, a factor in final lens comfort and optical quality for contact lens fitting is lens centration. Calculation of the post lens volume of rigid contact lenses at different corneal surface locations can be used to produce a volume map. Such maps often reveal channels of minimum volume in which lenses may be expected to move, or local minima, where lenses may be expected to settle. To evaluate the utility of our volume map technology and evaluate other models of contact lens performance we have developed an automated video-based lens tracking system that provides detailed information about lens translation and rotation. The system uses standard video capture technology with a CCD camera attached to an ophthalmic slit lamp biomicroscope. The subject wears a specially marked contact lens for tracking purposes. Several seconds of video data are collected in real-time as the patient blinks naturally. The data are processed off-line, with the experimenter providing initial location estimates of the pupil and lens marks. The technique provides a fast and accurate method of quantifying lens motion. With better contact lens motion information we will gain a better understanding of the relationships between corneal shapes, lens design parameters, tear mixing, and patient comfort.
Packet transmissions over the Internet incur delay jitter that requires data buffering for resynchronization, which is unfavorable for interactive applications. Last year we reported result of formal subjective quality evaluation experiments on delay cognizant video coding (DCVC), which introduces temporal jitter into the video stream. Measures such as MSE and MPQM indicate the introduction of jitter should degrade video quality. However, most observers actually preferred compressed video sequences with delay to sequences without. One reason for this puzzling observation is that the delay introduced by DCVC suppresses the dynamic noise artifacts introduced by compression, thereby improving quality. This observation demonstrates the possibility of reducing bit rate and improving perceived quality at the same time. We have been characterizing conditions in which dynamic quantization noise suppression might improve video quality. A new battery of video test sequences using simple stimuli were developed to avoid the complexity of natural scenes. These sequences are cases where quantization noise produces bothersome temporal flickering artifacts. We found the significance of artifacts depend strongly on the local image content. Pseudo code is provided for generating these test stimuli in the hope that they lead to the development of future video compression algorithms which take advantage of this technique of improving quality by dampening temporal artifacts.
Proc. SPIE. 3644, Human Vision and Electronic Imaging IV
KEYWORDS: Image compression, Visual process modeling, Data modeling, Visualization, Spatial frequencies, Databases, Linear filtering, Image quality, Human vision and color perception, Performance modeling
Models that predict human performance on narrow classes of visual stimuli abound in the vision science literature. However, the vision and the applied imaging communities need robust general-purpose, rather than narrow, computational human visual system models to evaluate image fidelity and quality and ultimately improve imaging algorithms. Psychophysical measure of image imaging algorithms. Psychophysical measures of image quality are too costly and time consuming to gather to evaluate the impact each algorithm modification might have on image quality.
Prior work on statistical multiplexing of variable-bit-rate network video shows higher video capacity (more video connections) can be supported if connections have smoother traffic profiles. For delay critical applications like videoconferencing, smoothing a compressed bit stream indiscriminately is not an option because excess delay would be introduced. In this paper, we presented an application of delay cognizant video coding (DCVC) to expand the network video capacity by performing traffic smoothing discriminatively. DCVC segments the raw video data and generates two compressed video flows with differential delay requirements, a delay-critical flow and a delay-relaxed flow. The delay-critical flow carries less video information and is thus less bursty. The delay-relaxed flow complements the first flow and the magnitude of its bursts can be reduced by traffic smoothing. We demonstrated that at equal visual quality measured in PSNR, the network video capacity could be increased by as mush as 50 percent through the two-flow discriminative traffic smoothing.
The conventional synchronous model of digital video, in which video is reconstructed synchronously at the decoder on a frame-by-frame basis, assumes its transport is delay- jitter-free. This assumption is inappropriate for modern integrated service packet networks such as the Internet for network delay jitter varies widely. Furthermore, multiframe buffering is not a viable solution in interactive applications such as video conferencing. We have proposed a `delay cognizant' model of video coding (DCVC) that segments an incoming video into two video flows with different delay attributes. The DCVC decoder operates in an asynchronous reconstruction mode that attempts to maintain image quality in the presence of network delay jitter. Our goal is to maximize the allowable delay of one flow relative to that of the other with minimal effect on image quality since an increase in the delay offset reflects more tolerance to transmission delay jitter. Subjective quality evaluations indicates for highly compressed sequences, differences in video quality of reconstructed sequences with large delay offsets as compared with zero delay offset are small. Moreover, in some cases asynchronously reconstructed video sequences look better than the zero delay case. DCVC is a promising solution to transport delay jitter in low- bandwidth video conferencing with minimal impact on video quality.
Basic vision science research has reached the point that many investigators are now designing quantitative models of human visual function in areas such as, pattern discrimination, motion detection, optical flow, color discrimination, adaptation and stereopsis. These models have practical significance in their application to image compression technologies and as tools for evaluating image quality. We have been working on a vision modeling environment, called Mindseye, that is designed to simplify the implementation and testing of general purpose spatio- temporal models of human vision. Mindseye is an evolving general-purpose vision-modeling environment that embodies the general structures of the visual system and provides a set of modular tools within a flexible platform tailored to the needs of researchers. The environment employs a user- friendly graphics interface with on-line documentation that describes the functionality of the individual modules. Mindseye, while functional, is still research in progress. We are seeking input from the image compression and evaluation community as well as from the vision science community as to the potential utility of Mindseye, and how it might be enhanced to meet future needs.
Seven types of masking are discussed: multi-component contrast gain control, one-component transducer saturation, two- component phase inhibition, multiplicative noise, high spatial frequency phase locked interference, stimulus uncertainty, and noise intrusion. In the present vision research community, multi-component contrast gain is gaining in popularity while the one- and two-component masking models are losing adherents. In this paper we take the presently unpopular stance and argue against multi-component gain control models. We have a two-pronged approach. First, we discuss examples where high contrast maskers that overlap the test stimulus in both position and spatial frequency nevertheless produce little masking. Second, we show that alternatives to gain control are still viable, as long as uncertainty and noise intrusion effects are included. Finally, a classification is offered for different types of uncertainty effects that can produce large masking behavior.
The luminance of a given display pixel depends not only on the present input voltage but also on the input voltages for the preceding pixel or pixels along the display raster. This effect which we refer to as the adjacent pixel nonlinearity is never compensated for when 2D stimulus patterns are presented on standard display monitors. To compensate for the adjacent pixel nonlinearity, we summarize in this paper the methods for generating a 2D lookup table which corrects for the nonlinearity over most of the displays luminance range. This table works even if the current pixel luminance depends on more than one preceding pixel. The creation of a 2D lookup involves making a series of calibration measurements and a least squares data fitting procedure to determine the parameters for a model of the adjacent pixel nonlinearity proposed by Mulligan and Stone. Once the parameters are determined for a particular display the 2D lookup table is created. To increase the available mean luminance we have evaluated the utility for 2D lookup table use when multiple color guns are in use.
One area of applied research in which vision scientists can have a significant impact is in improving image compression technologies by developing a model of human vision which can be used as an image fidelity metric. Scene cuts and other transient events in a video sequence have significant impact on digital video transmission bandwidth. We have therefore been studying masking at transient edge boundaries where bit rate savings might be achieved. Using Crawford temporal and Westheimer spatial masking techniques, we find unexpected stimulus polarity dependent effects. At normal video luminance levels there is a greater than fourfold increase in narrow line detection thresholds near the temporal onset of luminance pedestals. The largest elevations occur for pedestal widths in the range of 2 - 10 min. When the luminance polarity of the test line matches that of the pedestal polarity the masking is much greater than when the test and pedestal have opposite polarities. We believe at least two masking processes are involved; (1) a rapid response saturation in on- or off-center visual mechanisms and (2) a process based on a stimulus ambiguity when the test and pedestal are about the same size. The fact that masking is greatest for local spatial configurations gives one hope for its practical implementation in compression algorithms.
Standard 1D gamma-correcting lookup tables do not compensate for adjacent pixel spatial nonlinearities along the direction of the display raster. These nonlinearities can alter the local mean luminance and contrast of the displayed image. Five steps are described for generating a 2D lookup table (LUT) that compensates for the nonlinearity. By adjusting the 2D LUT so it takes into account the inherent blur at light to dark transitions of the display system, the usable luminance range of the LUT can be extended while reducing the ringing artifact associated with luminance compensation. Use of the blur-compensated 2D LUT incurs no additional computational effort over an uncompensated 2D LUT. Matlab programs are included that can be used to generate a 2D LUT for a user's particular display system.
This paper asks how the vision community can contribute to the goal of achieving perceptually lossless image fidelity with maximum compression. In order to maintain a sharp focus the discussion is restricted to the JPEG-DCT image compression standard. The numerous problems that confront vision researchers entering the field of image compression are discussed. Special attention is paid to the connection between the contrast sensitivity function and the JPEG quantization matrix.
Several topics connecting basic vision research to image compression and image quality are discussed: (1) A battery of about 7 specially chosen simple stimuli should be used to tease apart the multiplicity of factors affecting image quality. (2) A 'perfect' static display must be capable of presenting about 135 bits/min2. This value is based on the need for 3 pixels/min and 15 bits/pixel. (3) Image compression allows the reduction from 135 to about 20 bits/min2 for perfect image quality. 20 bit/min2 is the information capacity of human vision. (4) A presumed weakness of the JPEG standard is that it does not allow for Weber's Law nonuniform quantization. We argue that this is an advantage rather than a weakness. (5) It is suggested that all compression studies should report two numbers separately: the amount of compression achieved from quantization and the amount from redundancy coding. (6) The DCT, wavelet and viewprint representations are compared. (7) Problems with extending perceptual losslessness to moving stimuli are discussed. Our approach of working with a 'perfect' image on a 'perfect' display with 'perfect' compression is not directly relevant to the present situation with severely limited channel capacity. Rather than studying perceptually lossless compression we must carry out research to determine what types of lossy transformations are least disturbing to the human observer. Transmission of 'perfect', lossless images will not be practical for many years.