Proc. SPIE. 9014, Human Vision and Electronic Imaging XIX
KEYWORDS: Signal to noise ratio, Facial recognition systems, Principal component analysis, Image compression, Visualization, Databases, Nickel, Robotics, Object recognition, Human vision and color perception
We present a novel method, Manifold Sensing, for the adaptive sampling of the visual world based on manifolds of increasing but low dimensionality that have been learned with representative data. Because the data set is adapted during sampling, every new measurement (sample) depends on the previously acquired measurements. This leads to an efficient sampling strategy that requires a low total number of measurements. We apply Manifold Sensing to object recognition on UMIST, Robotics Laboratory, and ALOI benchmarks. For face recognition, with only 30 measurements - this corresponds to a compression ratio greater than 2000 - an unknown face can be localized such that its nearest neighbor in the low-dimensional manifold is almost always the actual nearest image. Moreover, the recognition rate obtained by assigning the class of the nearest neighbor is 100%. For a different benchmark with everyday objects, with only 38 measurements - in this case a compression ratio greater than 700 - we obtain similar localization results and, again, a 100% recognition rate.
In this paper, we present Adaptive Hierarchical Sensing (AHS), a novel adaptive hierarchical sensing algorithm for sparse signals. For a given but unknown signal with a sparse representation in an orthogonal basis, the sensing task is to identify its non-zero transform coefficients by performing only few measurements. A measurement is simply the inner product of the signal and a particular measurement vector. During sensing, AHS partially traverses a binary tree and performs one measurement per visited node. AHS is adaptive in the sense that after each measurement a decision is made whether the entire subtree of the current node is either further traversed or omitted depending on the measurement value. In order to acquire an N -dimensional signal that is K-sparse, AHS performs O(K log N/K) measurements. With AHS, the signal is easily reconstructed by a basis transform without the need to solve an optimization problem. When sensing full-size images, AHS can compete with a state-of-the-art compressed sensing approach in terms of reconstruction performance versus number of measurements. Additionally, we simulate the sensing of image patches by AHS and investigate the impact of the choice of the sparse coding basis as well as the impact of the tree composition.
Touch-free gesture technology is beginning to become more popular with consumers and may have a significant future impact on interfaces for digital photography. However, almost every commercial software framework for gesture and pose detection is aimed at either desktop PCs or high-powered GPUs, making mobile implementations for gesture recognition an attractive area for research and development. In this paper we present an algorithm for hand skeleton tracking and gesture recognition that runs on an ARM-based platform (Pandaboard ES, OMAP 4460 architecture). The algorithm uses self-organizing maps to fit a given topology (skeleton) into a 3D point cloud. This is a novel way of approaching the problem of pose recognition as it does not employ complex optimization techniques or data-based learning. After an initial background segmentation step, the algorithm is ran in parallel with heuristics, which detect and correct artifacts arising from insufficient or erroneous input data. We then optimize the algorithm for the ARM platform using fixed-point computation and the NEON SIMD architecture the OMAP4460 provides. We tested the algorithm with two different depth-sensing devices (Microsoft Kinect, PMD Camboard). For both input devices we were able to accurately track the skeleton at the native framerate of the cameras.
The imaging properties of small cameras in mobile devices exclude restricted depth-of-field and range-dependent
blur that may provide a sensation of depth. Algorithmic solutions to this problem usually fail because high-
quality, dense range maps are hard to obtain, especially with a mobile device. However, methods like stereo,
shape from focus stacks, and the use of ashlights may yield coarse and sparse range maps. A standard procedure
is to regularize such range maps to make them dense and more accurate. In most cases, regularization leads to
insufficient localization, and sharp edges in depth cannot be handled well. In a wavelet basis, an image is defined
by its significant wavelet coefficients, only these need to be encoded. If we wish to perform range-dependent
image processing, we only need to know the range for the significant wavelet coefficients. We therefore propose
a method that determines a sparse range map only for significant wavelet coefficients, then weights the wavelet
coefficients depending on the associate range information. The image reconstructed from the resulting wavelet
representation exhibits space-variant, range-dependent blur. We present results based on images and range maps
obtained with a consumer stereo camera and a stereo mobile phone.
The present work aims at improving the image quality of low-cost cameras based on multiple exposures, machine
learning, and a perceptual quality measure. The particular implementation consists of two cameras, one being
a high-quality DSLR, the other part of a cell phone. The cameras are connected via USB. Since the system is
designed to take many exposures of the same scene, a stable mechanical coupling of the cameras and the use of
a tripod are required. Details on the following issues are presented: design aspects of the mechanical coupling of
the cameras, camera control via FCam and the Picture Transfer Protocol (PTP), further aspects of the design of
the control software, and post processing of the exposures from both cameras. The cell phone images are taken
with different exposure times and different focus settings and are simultaneously fused. By using the DSLR
image as a reference, the parameters of the fusion scheme are learned from examples and can be used to optimize
the design of the cell phone. First results show that the depth of field can be extended, the dynamic range can
be improved and the noise can be reduced.
Sparse coding learns its basis non-linearly, but the basis elements are still linearly combined to form an image.
Is this linear combination of basis elements a good model for natural images? We here use a non-linear synthesis
rule, such that at each location in the image the point-wise maximum over all basis elements is used to synthesize
the image. We present algorithms for image approximation and basis learning using this synthesis rule. With
these algorithms we explore the the pixel-wise maximum over the basis elements as an alternative image model
and thus contribute to the problem of finding a proper representation of natural images.
The optimal coding hypothesis proposes that the human visual system has adapted to the statistical properties
of the environment by the use of relatively simple optimality criteria.
We here (i) discuss how the properties of different models of image coding, i.e. sparseness, decorrelation,
and statistical independence are related to each other (ii) propose to evaluate the different models by verifiable
performance measures (iii) analyse the classification performance on images of handwritten digits (MNIST data
base). We first employ the SPARSENET algorithm (Olshausen, 1998) to derive a local filter basis (on 13 × 13
pixels windows). We then filter the images in the database (28 × 28 pixels images of digits) and reduce the
dimensionality of the resulting feature space by selecting the locally maximal filter responses. We then train a
support vector machine on a training set to classify the digits and report results obtained on a separate test
set. Currently, the best state-of-the-art result on the MNIST data base has an error rate of 0,4%. This result,
however, has been obtained by using explicit knowledge that is specific to the data (elastic distortion model
for digits). We here obtain an error rate of 0,55% which is second best but does not use explicit data specific
knowledge. In particular it outperforms by far all methods that do not use data-specific knowledge.
Larry Stark has emphasised that what we visually perceive is very much determined by the scanpath, i.e. the pattern of eye movements. Inspired by his view, we have studied the implications of the scanpath for visual communication and came up with the idea to not only sense and analyse eye movements, but also guide them by using a special kind of gaze-contingent information display. Our goal is to integrate gaze into visual communication systems by measuring and guiding eye movements. For guidance, we first predict a set of about 10 salient locations. We then change the probability for one of these candidates to be attended: for one candidate the probability is increased, for the others it is decreased. To increase saliency, for example, we add red dots that are displayed very briefly such that they are hardly perceived consciously. To decrease the probability, for example, we locally reduce the temporal frequency content. Again, if performed in a gaze-contingent fashion with low latencies, these manipulations remain unnoticed. Overall, the goal is to find the real-time video transformation minimising the difference between the actual and the desired scanpath without being obtrusive. Applications are in the area of vision-based communication (better control of what information is conveyed) and augmented vision and learning (guide a person's gaze by the gaze of an expert or a computer-vision system). We believe that our research is very much in the spirit of Larry Stark's views on visual perception and the close link between vision research and engineering.
We present a model that predicts saccadic eye-movements and can be tuned to a particular human observer who is viewing a dynamic sequence of images. Our work is motivated by applications that involve gaze-contingent interactive displays on which information is displayed as a function of gaze direction. The approach therefore differs from standard approaches in two ways: (1) we deal with dynamic scenes, and (2) we provide means of adapting the model to a particular observer. As an indicator for the degree of saliency we evaluate the intrinsic dimension of the image sequence within a geometric approach implemented by using the structure tensor. Out of these candidate saliency-based locations, the currently attended location is selected according to a strategy found by supervised learning. The data are obtained with an eye-tracker and subjects who view video sequences. The selection algorithm receives candidate locations of current and past frames and a limited history of locations attended in the past. We use a linear mapping that is obtained by minimizing the quadratic difference between the predicted and the actually attended location by gradient descent. Being linear, the learned mapping can be quickly adapted to the individual observer.