Proc. SPIE. 9014, Human Vision and Electronic Imaging XIX
KEYWORDS: Signal to noise ratio, Facial recognition systems, Principal component analysis, Image compression, Visualization, Databases, Nickel, Robotics, Object recognition, Human vision and color perception
We present a novel method, Manifold Sensing, for the adaptive sampling of the visual world based on manifolds of increasing but low dimensionality that have been learned with representative data. Because the data set is adapted during sampling, every new measurement (sample) depends on the previously acquired measurements. This leads to an efficient sampling strategy that requires a low total number of measurements. We apply Manifold Sensing to object recognition on UMIST, Robotics Laboratory, and ALOI benchmarks. For face recognition, with only 30 measurements - this corresponds to a compression ratio greater than 2000 - an unknown face can be localized such that its nearest neighbor in the low-dimensional manifold is almost always the actual nearest image. Moreover, the recognition rate obtained by assigning the class of the nearest neighbor is 100%. For a different benchmark with everyday objects, with only 38 measurements - in this case a compression ratio greater than 700 - we obtain similar localization results and, again, a 100% recognition rate.
In this paper, we present Adaptive Hierarchical Sensing (AHS), a novel adaptive hierarchical sensing algorithm for sparse signals. For a given but unknown signal with a sparse representation in an orthogonal basis, the sensing task is to identify its non-zero transform coefficients by performing only few measurements. A measurement is simply the inner product of the signal and a particular measurement vector. During sensing, AHS partially traverses a binary tree and performs one measurement per visited node. AHS is adaptive in the sense that after each measurement a decision is made whether the entire subtree of the current node is either further traversed or omitted depending on the measurement value. In order to acquire an N -dimensional signal that is K-sparse, AHS performs O(K log N/K) measurements. With AHS, the signal is easily reconstructed by a basis transform without the need to solve an optimization problem. When sensing full-size images, AHS can compete with a state-of-the-art compressed sensing approach in terms of reconstruction performance versus number of measurements. Additionally, we simulate the sensing of image patches by AHS and investigate the impact of the choice of the sparse coding basis as well as the impact of the tree composition.
Looking at the right place at the right time is a critical component of driving skill. Therefore, gaze guidance has the potential to become a valuable driving assistance system. In previous work, we have already shown that complex gaze-contingent stimuli can guide attention and reduce the number of accidents in a simple driving simulator. We here set out to investigate whether cues that are simple enough to be implemented in a real car can also capture gaze during a more realistic driving task in a high-fidelity driving simulator. We used a state-of-the-art, wide-field-of-view driving simulator with an integrated eye tracker. Gaze-contingent warnings were implemented using two arrays of light-emitting diodes horizontally fitted below and above the simulated windshield. Thirteen volunteering subjects drove along predetermined routes in a simulated environment popu lated with autonomous traffic. Warnings were triggered during the approach to half of the intersections, cueing either towards the right or to the left. The remaining intersections were not cued, and served as controls.
The analysis of the recorded gaze data revealed that the gaze-contingent cues did indeed have a gaze guiding effect, triggering a significant shift in gaze position towards the highlighted direction. This gaze shift was not accompanied by changes in driving behaviour, suggesting that the cues do not interfere with the driving task itself.
Interdisciplinary research in human vision and electronic imaging has greatly contributed to the current state of the art in imaging technologies. Image compression and image quality are prominent examples and the progress made in these areas relies on a better understanding of what natural images are and how they are perceived by the human visual system. A key research question has been: given the (statistical) properties of natural images, what are the most efficient and perceptually relevant image representations, what are the most prominent and descriptive features of images and videos?
We give an overview of how these topics have evolved over the 25 years of HVEI conferences and how they have influenced the current state of the art. There are a number of striking parallels between human vision and electronic imaging. The retina does lateral inhibition, one of the early coders was using a Laplacian pyramid; primary visual cortical areas have orientation- and frequency-selective neurons, the current JPEG standard defines similar wavelet transforms; the brain uses a sparse code, engineers are currently excited about sparse coding and compressed sensing. Some of this has indeed happened at the HVEI conferences and we would like to distill that.
Touch-free gesture technology is beginning to become more popular with consumers and may have a significant future impact on interfaces for digital photography. However, almost every commercial software framework for gesture and pose detection is aimed at either desktop PCs or high-powered GPUs, making mobile implementations for gesture recognition an attractive area for research and development. In this paper we present an algorithm for hand skeleton tracking and gesture recognition that runs on an ARM-based platform (Pandaboard ES, OMAP 4460 architecture). The algorithm uses self-organizing maps to fit a given topology (skeleton) into a 3D point cloud. This is a novel way of approaching the problem of pose recognition as it does not employ complex optimization techniques or data-based learning. After an initial background segmentation step, the algorithm is ran in parallel with heuristics, which detect and correct artifacts arising from insufficient or erroneous input data. We then optimize the algorithm for the ARM platform using fixed-point computation and the NEON SIMD architecture the OMAP4460 provides. We tested the algorithm with two different depth-sensing devices (Microsoft Kinect, PMD Camboard). For both input devices we were able to accurately track the skeleton at the native framerate of the cameras.
The imaging properties of small cameras in mobile devices exclude restricted depth-of-field and range-dependent
blur that may provide a sensation of depth. Algorithmic solutions to this problem usually fail because high-
quality, dense range maps are hard to obtain, especially with a mobile device. However, methods like stereo,
shape from focus stacks, and the use of ashlights may yield coarse and sparse range maps. A standard procedure
is to regularize such range maps to make them dense and more accurate. In most cases, regularization leads to
insufficient localization, and sharp edges in depth cannot be handled well. In a wavelet basis, an image is defined
by its significant wavelet coefficients, only these need to be encoded. If we wish to perform range-dependent
image processing, we only need to know the range for the significant wavelet coefficients. We therefore propose
a method that determines a sparse range map only for significant wavelet coefficients, then weights the wavelet
coefficients depending on the associate range information. The image reconstructed from the resulting wavelet
representation exhibits space-variant, range-dependent blur. We present results based on images and range maps
obtained with a consumer stereo camera and a stereo mobile phone.
We here model peripheral vision in a compressed sensing framework as a strategy of optimally guessing what
stimulus corresponds to a sparsely encoded peripheral representation, and find that typical letter-crowding effects
naturally arise from this strategy. The model is simple as it consists of only two convergence stages. We apply
the model to the problem of crowding effects in reading. First, we show a few instructive examples of letter
images that were reconstructed from encodings with different convergence rates. Then, we present an initial
analysis of how the choice of model parameters affects the distortion of isolated and flanked letters.
The present work aims at improving the image quality of low-cost cameras based on multiple exposures, machine
learning, and a perceptual quality measure. The particular implementation consists of two cameras, one being
a high-quality DSLR, the other part of a cell phone. The cameras are connected via USB. Since the system is
designed to take many exposures of the same scene, a stable mechanical coupling of the cameras and the use of
a tripod are required. Details on the following issues are presented: design aspects of the mechanical coupling of
the cameras, camera control via FCam and the Picture Transfer Protocol (PTP), further aspects of the design of
the control software, and post processing of the exposures from both cameras. The cell phone images are taken
with different exposure times and different focus settings and are simultaneously fused. By using the DSLR
image as a reference, the parameters of the fusion scheme are learned from examples and can be used to optimize
the design of the cell phone. First results show that the depth of field can be extended, the dynamic range can
be improved and the noise can be reduced.
Sparse coding learns its basis non-linearly, but the basis elements are still linearly combined to form an image.
Is this linear combination of basis elements a good model for natural images? We here use a non-linear synthesis
rule, such that at each location in the image the point-wise maximum over all basis elements is used to synthesize
the image. We present algorithms for image approximation and basis learning using this synthesis rule. With
these algorithms we explore the the pixel-wise maximum over the basis elements as an alternative image model
and thus contribute to the problem of finding a proper representation of natural images.
The detection of abnormalities is a very challenging problem in computer vision, especially if these abnormalities
must be detected in images of textured surfaces such as textile, stone, or wood. We propose a novel, non-parametric
approach for defect detection in textures that only employs two features. We compute the two
parameters of a Weibull fit for the distribution of image gradients in local regions. Then, we perform a simple
novelty detection algorithm in order to detect arbitrary deviations of the reference texture. Therefore, we evaluate
the Euclidean distances of all local patches to a reference point in the Weibull space, where the reference point
is determined for each texture image individually. Thus, our approach becomes independent of the particular
texture type and also independent of a certain defect type.
For performance evaluation we use the highly challenging database provided by Bosch for a contest on
industrial optical inspection with different classes of textures and different defect types. By using the Weibull
parameters we can detect local deviations of texture images in an unsupervised manner with high accuracy.
Compared to existing approaches such as Gabor filters or grey level statistics, our approach is not only powerful,
but also very efficient such that it can also be applied for real-time applications.
The problem of circular object detection and localisation arises quite often in machine vision applications, for
example in semi-conductor component inspection. We propose two novel approaches for the precise centre
localisation of circular objects, e.g. p-electrodes of light-emitting diodes. The first approach is based on image
gradients, for which we provide an objective function that is solely based on dot products and can be maximised
by gradient ascend. The second approach is inspired by the concept of isophotes, for which we derive an objective
function that is based on the definition of radial symmetry. We evaluate our algorithms on synthetic images with
several kinds of noise and on images of semiconductor components and we show that they perform better and
are faster than state of the art approaches such as the Hough transform. The radial symmetry approach proved
to be the most robust one, especially for low contrast images and strong noise with a mean error of 0.86 pixel
for synthetic images and 0.98 pixel for real world images. The gradient approach yields more accurate results
for almost all images (mean error of 4 pixel) compared to the Hough transform (8 pixel). Concerning runtime,
the gradient-based approach significantly outperforms the other approaches being 5 times faster than the Hough
transform; the radial symmetry approach is 12% faster.
The saliency of an image or video region indicates how likely it is that the viewer of the image or video fixates
that region due to its conspicuity. An intriguing question is how we can change the video region to make it more
or less salient. Here, we address this problem by using a machine learning framework to learn from a large set
of eye movements collected on real-world dynamic scenes how to alter the saliency level of the video locally. We
derive saliency transformation rules by performing spatio-temporal contrast manipulations (on a spatio-temporal
Laplacian pyramid) on the particular video region. Our goal is to improve visual communication by designing
gaze-contingent interactive displays that change, in real time, the saliency distribution of the scene.
Region-based active contours are a variational framework for image segmentation. It involves estimating the
probability distributions of observed features within each image region. Subsequently, these so-called region
descriptors are used to generate forces to move the contour toward real image boundaries. In this paper region
descriptors are computed from samples within windows centered on contour pixels and they are named local
region descriptors (LRDs). With these descriptors we introduce an equation for contour motion with two terms:
growing and competing. This equation yields a novel type of AC that can adjust the behavior of contour pieces to
image patches and to the presence of other contours. The quality of the proposed motion model is demonstrated
on complex images.
The optimal coding hypothesis proposes that the human visual system has adapted to the statistical properties
of the environment by the use of relatively simple optimality criteria.
We here (i) discuss how the properties of different models of image coding, i.e. sparseness, decorrelation,
and statistical independence are related to each other (ii) propose to evaluate the different models by verifiable
performance measures (iii) analyse the classification performance on images of handwritten digits (MNIST data
base). We first employ the SPARSENET algorithm (Olshausen, 1998) to derive a local filter basis (on 13 × 13
pixels windows). We then filter the images in the database (28 × 28 pixels images of digits) and reduce the
dimensionality of the resulting feature space by selecting the locally maximal filter responses. We then train a
support vector machine on a training set to classify the digits and report results obtained on a separate test
set. Currently, the best state-of-the-art result on the MNIST data base has an error rate of 0,4%. This result,
however, has been obtained by using explicit knowledge that is specific to the data (elastic distortion model
for digits). We here obtain an error rate of 0,55% which is second best but does not use explicit data specific
knowledge. In particular it outperforms by far all methods that do not use data-specific knowledge.
Larry Stark has emphasised that what we visually perceive is very much determined by the scanpath, i.e. the pattern of eye movements. Inspired by his view, we have studied the implications of the scanpath for visual communication and came up with the idea to not only sense and analyse eye movements, but also guide them by using a special kind of gaze-contingent information display. Our goal is to integrate gaze into visual communication systems by measuring and guiding eye movements. For guidance, we first predict a set of about 10 salient locations. We then change the probability for one of these candidates to be attended: for one candidate the probability is increased, for the others it is decreased. To increase saliency, for example, we add red dots that are displayed very briefly such that they are hardly perceived consciously. To decrease the probability, for example, we locally reduce the temporal frequency content. Again, if performed in a gaze-contingent fashion with low latencies, these manipulations remain unnoticed. Overall, the goal is to find the real-time video transformation minimising the difference between the actual and the desired scanpath without being obtrusive. Applications are in the area of vision-based communication (better control of what information is conveyed) and augmented vision and learning (guide a person's gaze by the gaze of an expert or a computer-vision system). We believe that our research is very much in the spirit of Larry Stark's views on visual perception and the close link between vision research and engineering.
We first review theoretical results for the problem of estimating single and multiple transparent motions. For N motions
we obtain a M×M generalized structure tensor JN with M = 3 for one, M = 6 for two, and M = 10 for three motions.
The analysis of motion patterns is based on the ranks of JN and is thus not only conceptual but provides computable
confidence measures for the different types of motions. To resolve the correspondence between the ranks of the tensors
and the motion patterns, we introduce the projective plane as a new way of describing motion patterns. In the projective
plane, intrinsically 2D spatial patterns (e.g. corners and line ends) that move correspond to points that represent the only
admissible velocity, and 1D spatial patterns (e.g. straight edges) that move correspond to lines that represent, as a set
of points, the set of admissible velocities. We then show a few examples for how the projective plane can be used to
generate novel motion patterns and explain the perception of these patterns. We believe that our results will be useful
for designing new stimuli for visual psychophysics and neuroscience and thereby contribute to the understanding of the
dynamical properties of human vision.
One advantage of flat-panel X-ray detectors is the immediate availability
of the acquired images for display. Current limitations in large-area
active-matrix manufacturing technology, however, require that the images
read out from such detectors be processed to correct for inactive pixels.
In static radiographs, these defects can only be interpolated by spatial
filtering. Moving X-ray image modalities, such as fluoroscopy or cine-angiography,
permit to use temporal information as well. This paper describes interframe
defect interpolation algorithms based on motion compensation and filtering.
Assuming the locations of the defects to be known, we fill in the defective
areas from past frames, where the missing information was visible due to motion.
The motion estimator is based on regularized block matching, with speedup obtained
by successive elimination and related measures. To avoid the motion
estimator locking on to static defects, these are cut out of each block
during matching. Once motion is estimated, three methods are available for
defect interpolation: direct filling-in by the motion-compensated predecessor,
filling-in by a 3D-multilevel median filtered value, and spatiotemporal
mean filtering. Results are shown for noisy fluoroscopy sequences acquired in
clinical routine with varying amounts of motion and simulated defects up to
six lines wide. They show that the 3D-multilevel median filter appears as the
method of choice since it causes the least blur of the interpolated data, is robust
with respect to motion estimation errors and works even in non-moving
This paper deals with the problem of estimating multiple motions at points where these motions are overlaid. We present a new approach that is based on block-matching and can deal with both transparent motions and occlusions. We derive a block-matching constraint for an arbitrary number of moving layers. We use this constraint to design a hierarchical algorithm that can distinguish between the occurrence of single, transparent, and occluded motions and can thus select the appropriate local motion model. The algorithm adapts to the amount of noise in the image sequence by use of a statistical confidence test. The algorithm is further extended to deal with very noisy images by using a regularization based on Markov Random Fields. Performance is demonstrated on image sequences synthesized from natural textures with high levels of additive dynamic noise.
We present a model that predicts saccadic eye-movements and can be tuned to a particular human observer who is viewing a dynamic sequence of images. Our work is motivated by applications that involve gaze-contingent interactive displays on which information is displayed as a function of gaze direction. The approach therefore differs from standard approaches in two ways: (1) we deal with dynamic scenes, and (2) we provide means of adapting the model to a particular observer. As an indicator for the degree of saliency we evaluate the intrinsic dimension of the image sequence within a geometric approach implemented by using the structure tensor. Out of these candidate saliency-based locations, the currently attended location is selected according to a strategy found by supervised learning. The data are obtained with an eye-tracker and subjects who view video sequences. The selection algorithm receives candidate locations of current and past frames and a limited history of locations attended in the past. We use a linear mapping that is obtained by minimizing the quadratic difference between the predicted and the actually attended location by gradient descent. Being linear, the learned mapping can be quickly adapted to the individual observer.
This paper deals with the problem of estimating multiple transparent
motions that can occur in computer vision applications, e.g. in the
case of semi-transparencies and occlusions, and also in medical
imaging when different layers of tissue move independently. Methods
based on the known optical-flow equation for two motions are
extended in three ways. Firstly, we include a regularization term
to cope with sparse flow fields. We obtain an Euler-Lagrange system
of differential equations that becomes linear due to the use of the
mixed motion parameters. The system of equations is solved for the
mixed-motion parameters in analogy to the case of only one motion.
To extract the motion parameters, the velocity vectors are treated
as complex numbers and are obtained as the roots of a complex
polynomial of a degree that is equal to the number of overlaid
motions. Secondly, we extend a Fourier-Transform based method
proposed by Vernon such as to obtain analytic solutions for more
than two motions. Thirdly, we not only solve for the overlaid
motions but also separate the moving layers. Performance is
demonstrated by using synthetic and real sequences.
Nonlinear contributions to pattern classification by humans are analyzed by using previously obtained data on discrimination between aligned lines and offset lines. We how that the optimal linear model can be rejected even when the parameters of the model are estimated individually for each observer. We use a new measure of agreement to reject the linear model and to test simple nonlinear operators. The first nonlinearity is position uncertainty. The linear kernels are shrunk to different extents and convolved with the input images. A Gaussian window weights the results of the convolutions and the maximum in that window is selected as the internal variable. The size of the window is chosen such as to maintain a constant total amount of spatial filtering, i.e., the smaller kernels have a larger position uncertainty. The result of two observers indicate that the best agreement is obtained at a moderate degree of position uncertainty, plus-minus one min of arc. Finally, we analyze the effect of orientation uncertainty and show that agreement can be further improved in some cases.
In this paper we analyze the properties of a repeated isotropic center-surround inhibition which includes single nonlinearities like half-wave rectification and saturation. Our simulation results show that such operations, here implemented as iterated nonlinear differences and ratios of Gaussians (INDOG and INROG), lead to endstopping. The benefits of the approach are twofold. Firstly, the INDOG can be used to design simple endstopped operators, e.g., corner detectors. Secondly, the results can explain how endstopping might arise in a neural network with purely isotropic characteristics. The iteration can be implemented as cascades by feeding the output of one NDOG to a next stage of NDOG. Alternatively, the INDOG mechanism can be activated in a feedback loop. In the latter case, the resulting spatio-temporal response properties are not separable and the response becomes spatially endstopped if the input is transient. Finally, we show that ON- and OFF-type INDOG outputs can be integrated spatially to result in quasi- topological image features like open versus closed and the number of components.
Proc. SPIE. 2031, Geometric Methods in Computer Vision II
KEYWORDS: Visual process modeling, Visualization, Computer vision technology, Signal processing, Machine vision, Image filtering, Fractal analysis, Visual system, Human vision and color perception, Neurons
Basic properties of 2-D-nonlinear scale-space representations of images are considered. First, local-energy filters are used to estimate the Hausdorff dimension, DH, of images. A new fractal dimension, DN, defined as a property of 2-D-curvature representations on multiple scales, is introduced as a natural extension of traditional fractal dimensions, and it is shown that the two types of fractal dimensions can give a less ambiguous description of fractal image structure. Since fractal analysis is just one (limited) aspect of scale-space analysis, some more general properties of curvature representations on multiple scales are considered. Simulations are used to analyze the stability of curvature maxima across scale and to illustrate that spurious resolution can be avoided by extracting 2-D-curvature features.
This paper considers how basic geometrical properties like curvature, rigidity, and possible embeddings can be related to efficient image encoding and the statistical concept of redundancy. In particular, the redundancy of planar and parabolic patches of images as surfaces is revealed by reconstructing the original image from curvature measures that are zero for non-elliptic regions. This approach also gives a new perspective on encoding principles in biological vision.
Intrinsic signal dimensionality, a property closely related to Gaussian curvature, is shown to be an important conceptual tool in multi-dimensional image processing for both biological and engineering sciences. Intrinsic dimensionality can reveal the relationship between recent theoretical developments in the definition of optic flow and the basic neurophysiological concept of 'end-stopping' of visual cortical cells. It is further shown how the concept may help to avoid certain problems typically arising from the common belief that an explicit computation of a flow field has to be the essential first step in the processing of spatio- temporal image sequences. Signals which cause difficulties in the computation of optic flow, mainly the discontinuities of the motion vector field, are shown to be detectable directly in the spatio-temporal input by evaluation of its three-dimensional curvature. The relevance of the suggested concept is supported by the fact that fast and efficient detection of such signals is of vital importance for ambulant observers in both the biological and the technical domain.
Proc. SPIE. 1249, Human Vision and Electronic Imaging: Models, Methods, and Applications
KEYWORDS: Optical filters, Visual process modeling, Sensors, Signal processing, Human vision and color perception, Electronic filtering, Nonlinear filtering, Electronic imaging, Signal detection, Filtering (signal processing)
Empirical evidence from both psychology and physiology stresses the importance of inherently
two-dimensional signals and corresponding operations in vision. Examples of this are the existence of
"bug-detectors" , hypercomplex and dot-responsive cells, the occurence of contour illusions, and interactions of
patterns with clearly separated orientations. These phenomena can not be described, and have been largely
ignored, by common theories of size and orientation selective channels. The reason for this is shown to be
located at the heart of the theory of linear systems: their one-dimensional eigenfunctions and the "or"-like
character of the superposition principle. Consequently, a nonlinear theory is needed. We present a first
approach towards a general framework for the description of 2D-signals and 2D-cells in biological vision.