Face analysis in a real-world environment is a complex task as it should deal with challenging problems such as pose variations,
illumination changes and complex backgrounds. The use of active appearance models for facial features detection
is often successful in restricted environments, but the performance decreases when applied in unconstrained environments.
Therefore, in this paper, we introduce a novel method that integrates the knowledge of a face detector inside the shape and
the appearance models by using what we call a 'virtual structuring element' (VSE). In this way the possible settings of the
active appearance models are constrained in an appearance-driven manner. The use of a virtual structuring element in an
active appearance model provides increased performance in both accuracy and robustness over standard active appearance
models applied to different environments.
The choice of a colour space is of great importance for many computer vision algorithms (e.g. edge detection and object recognition). It induces the equivalence classes to the actual algorithms. Since there are many colour spaces available, the problem is how to automatically select the weighting to integrate the colour spaces in order to produce the best result for a particular task. In this paper we propose a method to learn these weights, while exploiting the non-perfect correlation between colour spaces of features through the principle of diversification. As a result an optimal trade-off is achieved between repeatability and distinctiveness. The resulting weighting scheme will ensure maximal feature discrimination.
The method is experimentally verified for three feature detection tasks: Skin colour detection, edge detection and corner detection. In all three tasks the method achieved an optimal trade-off between (colour) invariance (repeatability) and discriminative power (distinctiveness).
Recent technological advances have enabled human users to interact with computers in ways previously unimaginable. Beyond the confines of the keyboard and mouse, new modalities for human-computer interaction such as voice, gesture, and force-feedback are emerging. Despite important advances, one necessary ingredient for natural
interaction is still missing-emotions. Emotions play an important role in human-to-human communication and interaction, allowing people to express themselves beyond the verbal domain. The ability to understand human emotions is desirable for the computer in several applications. This paper explores new ways of human-computer
interaction that enable the computer to be more aware of the user's emotional and attentional expressions. We present the basic research in the field and the recent advances into the emotion recognition from facial, voice, and physiological signals, where the different modalities are treated independently. We then describe the challenging problem of multimodal emotion recognition and we advocate the use of probabilistic graphical models when fusing the different modalities. We also discuss the difficult issues of obtaining reliable affective data, obtaining ground truth for emotion recognition, and the use of unlabeled data.
We consider the well-known problem of segmenting a color image into foreground-background pixels. Such result can be obtained by segmenting the red, green and blue channels directly. Alternatively, the result may be obtained through the transformation of the color image into other color spaces, such as HSV or normalized colors. The problem then is how to select the color space or color channel that produces the best segmentation result. Furthermore, if more than one channels are equally good candidates, the next problem is how to combine the results. In this article, we investigate if the principles of the formal model for diversification of Markowitz (1952) can be applied to solve the problem. We verify, in theory and in practice, that the proposed diversification model can be applied effectively to determine the most appropriate combination of color spaces for the application at hand.
In Content-based Image Retrieval the comparison of a query image and each of the database images is defined by a similarity distance obtained from the two feature vectors involved. These feature vectors can be seen as sets of noisy indexes. Unlike text matching (that is exact) image matching is only approximate, leading to ranking
methods. Only images at the top ranks (within the scope) are returned as retrieval results. Image retrieval performance characterization has mainly been based on measures available from probabilistic text retrieval in the form of Precision-Recall or Precision-Scope graphs. However, these graphs offer an incomplete overview of the image retrieval system under study. Essential information about how the success of the query is influenced by the size and type of irrelevant images is missing. Due to the inexactness of the visual matching process, the effect of the irrelevant embedding, represented in the additional performance measure generality, plays an important role.
In general, a performance graph will be three-dimensional, a Generality-Recall-Precision Graph. By choosing appropriate scope values a new two-dimensional performance graph, the Generality-Recall-Precision Graph, is proposed to replace the commonly used Precision-Recall Graph, as the better choice for total recall studies.
Many current video analysis systems fail to fully acknowledge the
process that resulted in the acquisition of the video data, i.e. they don't view the complete multimedia system that encompasses the several physical processes that lead to the captured video data. This multimedia system includes the physical process that created the appearance of the captured objects, the capturing of the data by the sensor (camera), and a model of the domain the video data belongs to. By modelling this complete multimedia system, a much more robust and theoretically sound approach to video analysis can be taken. In this paper we will describe such a system for the detection, recognition and tracking of objects in video's. We will introduce an extension of the mean shift tracking process, based on a detailed model of the video capturing process. This system is used for two applications in the soccer video domain: Billboard recognition and tracking and player tracking.
In this paper we present a system for the localisation and tracking of
billboards in streamed soccer matches. The application area for this research is the delivery of customised content to end users. When international soccer matches are broadcast, the diversity of the audience is very large and advertisers would like to be able to adapt the billboards to the different audiences. By replacing the billboards in the video stream this can be achieved. In order to build a more robust system, photometric invariant features are used. These colour features are less susceptible to the changes in illumination. Sensor noise is dealt with through variable kernel density estimation.
We propose a new image feature that merges color and shape information. This global feature, which we call color shape context, is a histogram that combines the spatial (shape) and color information of the image in one compact representation. This histogram codes the locality of color transitions in an image. Illumination invariant derivatives are first computed and provide the edges of the image, which is the shape information of our feature. These edges are used to obtain similarity (rigid) invariant shape descriptors. The color transitions that take place on the edges are coded in an illumination invariant way and are used as the color information. The color and shape information are combined in one multidimensional vector. The matching function of this feature is a metric and allows for existing indexing methods such as R-trees to be used for fast and efficient retrieval.
In this paper, we aim to get to the content-based retrieval of nonuniformly textured objects from natural scenes under varying illumination and viewing conditions. Nonuniformly textured objects are objects containing irregular texture elements such as trees, animals (e.g. lions), walls, and grass. To cope with irregular texture contents, the texture measure is based on comparing feature distributions based on the multidimensional histogram intersection of color ratio derivatives. It is shown that color ratio derivatives are robust to a change in illumination, camera viewpoint, and pose of the textured object. Color ratio derivatives are computed from the RGB color channels of a ccd color camera as well as from spectral data obtained by a spectrograph. To cope with object cluttering, a region-based texture segmentation is applied on the target images in the image database prior to the actual image retrieval process. The region-based segmentation algorithm computes regions or blobs having roughly the same texture content as the query image. After segmenting the target images into blobs, the retrieval process is based on computing the histogram intersection of color ratio derivatives derived from query image and target blobs. Experiments have been conducted on images taken from colored, textured objects. Different light sources have been used to illuminate the objects in the scene. From the theoretical and experimental results, it is concluded that color constant texture matching in image libraries provides high retrieval accuracy and is robust to varying illumination and viewing conditions.
It is known that the transformation of RGB color space to the normalized color space is invariant to changes in the scene geometry. The transformation to the hue color space is additionally invariant to highlights. However, due to sensor noise, the transforms become unstable at many RGB values. This effect is usually overcome by ad hoc thresholding, for example if the RGB coordinates are located near the achromatic axis then the corresponding hue value is rejected. To arrive at a principled way to deal with the unstabilities that result from these color space transforms, the contribution of this report is as follows. Uncertainties in the measured RGB values are caused by photon noise, which arises from the statistical nature of photon production. Using a theoretical camera model, we determine the number of photons required to cause a color value transition. Based on the associated uncertainty according to the Poisson distribution, we then derive theoretical models that propagate this uncertainty to the uncertainty in the transformed color coordinates. We propose a histogram construction method based on Parzen estimators that incorporates this theoretical reliability. As a result, we overcome the need for thresholding of the transformed color values.
Comparison in the RGB domain is not suitable for precise color matching, due to the strong dependency of this domain on factors like spectral power distribution of the light source and object geometry. We have studied the use of multispectral or hyperspectral images for color matching, since it can be proven that hyperspectral images can be made independent of the light source and object geometry. Hyperspectral images have the disadvantages that they are large compared to regular RGB-imags, which makes it infeasible to use them for image matching across the Internet. For red roses, it is possible to reduce the large number of bands of the spectral images to only three bands, the same numbers of an RGB-image, using Principal Component Analysis, while maintaining 99 percent of the original variation. The obtained PCA-images of the roses can be matched using for example histogram cross correlation. From the principal coordinates plot, obtained from the histogram similarity matrices of twenty images of red roses, the discriminating power seems to be better for normalized spectral images than for color constant spectral images and RGB-images, the latter being recorded under highly optimized standard conditions.
In this paper, we study computational models and techniques to combine textural and image features for classification of images on Internet. A framework is given to index images on the basis of textural, pictorial and composite information. The scheme makes use of weighted document terms and color invariant image features to obtain a high-dimensional similarity descriptor to be used as an index. Based on supervised learning, the k-nearest neighbor classifier is used to organize images into semantically meaningful groups of Internet images. Internet images are first classified into photographical and synthetical images. After classifying images into photographical and synthetical images, we further classify photographical images into portraits and non-portraits. Further, synthetical images are classified into button and non-button images.
The problem of color constancy for discounting illumination color to obtain the apparent color of the object has been the topic of much research in computer vision. By assuming the neutral interface reflection and dichromatic reflection with highlights (i.e. highlights have the same color as the illuminant) various methods have been proposed aiming at recovering the illuminant color from color highlight analysis. In general, these methods are based on three color stimuli to approximate color. In this contribution, we estimate the spectral distribution from surface reflection using spectral information obtained by a spectrograph. The imaging spectrograph provides a spectral range at each pixel covering the visible wavelength range. Our method differ from existing methods by using a robust clustering technique to obtain the body and surface components in a multi-spectral space. These components determine the direction of the illumination spectral color. Then, we recover the illumination spectral power distribution by using principal component analysis for all wavelengths. To obtain the most reliable estimate of the spectral power distribution of the illuminant, all possible combinations of wavelengths are used to generate the optimal averaged estimation of the spectral power distribution of the scene illuminant. Our method is restricted to images containing a substantial amount of body reflection and highlights.
This paper proposes an efficient method to recognize 3-D rigid, solid objects from 2-D projective images in the presence of object overlapping and occlusion which is robust to noise, location accuracy, and able to deal with multiple instances of a model in a scene. The task of the recognition method is to find instances of known object models in projective images. Projective invariants shape descriptors of 3-D solid objects are generated which are invariant to a change in the point of view. To obtain these projective invariants, the classical projective geometry has been taken as a starting point. It is classically known that the cross ratio is projectively invariant. The invariants describing the model are used as `keys' to index the model into a hash table. Because these keys remain the same under the projective transformation they are computed for objects found in an image and used to determine which object model is present in an image. Experiments show excellent performance and together with the inherent parallelism of the recognition method makes the method a promising one.
This paper presents a novel design of an image database system which supports storage, indexing, and retrieval of images by content. The image retrieval methodology is based on the observation that images can be discriminated by the presence of image objects and their spatial relations. Images in the database are first automatically segmented into sets of image object descriptions. Then, transformation invariant quantities of image object descriptions are derived and used as keys to index images, described ower and distinct positioned bandpasses allow a somewhat more distinct recording of vegetational targets in the VIS/NIR range and an improvement in discrimination of various rock/soil types independently of the very high spatial resolution. The bandpass used for the panchromatic modules has also been rearranged to optimize for the contrast of vegetation to rock/soil targets. Entropy, SNR, and PSF performances are comparable to operational sensors like HRV or TM. However, data of band 3 are slightly weaker iexperiments for a simple application.
The growing capacity of computers, the abundance of digital cameras, and the increased connectivity of the world all point to large digital multimedia archives. They include images and videos from the World Wide Web, museum objects, flowers, trademarks, and views from everyday life. The faster these archives grow, the more prominent becomes the need for efficient access to the content of the images and videos.
In this short course, we will give a survey of the most recent developments on image and video search engines. First, the important step of feature extraction will be discussed in detail including color, shape, and texture information, with particular attention to discriminatory power and invariance. We will then focus on the concepts of indexing and genre classification as an intermediate step to sort the data. We will pay attention to interactive ways to perform browsing and retrieval by means of information visualization and relevance feedback. Methods will be discussed to localize the retrieved objects in their images.