We adopt genetic programming (GP) to define a measure that can predict complexity perception of texture images. We perform psychophysical experiments on three different datasets to collect data on the perceived complexity. The subjective data are used for training, validation, and test of the proposed measure. These data are also used to evaluate several possible candidate measures of texture complexity related to both low level and high level image features. We select four of them (namely roughness, number of regions, chroma variance, and memorability) to be combined in a GP framework. This approach allows a nonlinear combination of the measures and could give hints on how the related image features interact in complexity perception. The proposed complexity measure MGP exhibits Pearson correlation coefficients of 0.890 on the training set, 0.728 on the validation set, and 0.724 on the test set. MGP outperforms each of all the single measures considered. From the statistical analysis of different GP candidate solutions, we found that the roughness measure evaluated on the gray level image is the most dominant one, followed by the memorability, the number of regions, and finally the chroma variance.
Automatic action recognition in videos is a challenging computer vision task that has become an active research area in recent years. Existing strategies usually use kernel-based learning algorithms that considers a simple combination of different features completely disregarding how such features should be integrated to fit the given problem. Since a given feature is most suitable to describe a given image/video property, the adaptive weighting of such features can improve the performance of the learning algorithm. In this paper, we investigated the use of the Multiple Kernel Learning (MKL) algorithm to adaptive search for the best linear relation among the considered features. MKL is an extension of the support vector machines (SVMs) to work with a weighted linear combination of several single kernels. This approach allows to simultaneously estimate the weights for the multiple kernels combination as well as the underlying SVM parameters. In order to prove the validity of the MKL approach, we considered a descriptor composed of multiple features aligned with dense trajectories. We experimented our approach on a database containing 36 cooking actions. Results confirm that the use of MKL improves the classification performance.
In security applications the human face plays a fundamental role, however we have to assume non-collaborative subjects. A face can be partially visible or occluded due to common-use accessories such as sunglasses, hats, scarves and so on. Also the posture of the head influence the face recognizability. Given a video sequence in input, the proposed system is able to establish if a face is depicted in a frame, and to determine its degree of recognizability in terms of clearly visible facial features. The system implements features filtering scheme combined with a skin-based face detection to improve its the robustness to false positives and cartoon-like faces. Moreover the system takes into account the recognizability trend over a customizable sliding time window to allow a high level analysis of the subject behaviour. The recognizability criteria can be tuned for each specific application. We evaluate our system both in qualitative and quantitative terms, using a data set of manually annotated videos. Experimental results confirm the effectiveness of the proposed system.
The aim of this work is to detect the events in video sequences that are salient with respect to the audio signal.
In particular, we focus on the audio analysis of a video, with the goal of finding which are the significant features
to detect audio-salient events. In our work we have extracted the audio tracks from videos of different sport
events. For each video, we have manually labeled the salient audio-events using the binary markings. On each
frame, features in both time and frequency domains have been considered. These features have been used to
train different classifiers: Classification and Regression Trees, Support Vector Machine, and k-Nearest Neighbor.
The classification performances are reported in terms of confusion matrices.
In order to create a cooking assistant application to guide the users in the preparation of the dishes relevant to their profile diets and food preferences, it is necessary to accurately annotate the video recipes, identifying and tracking the foods of the cook. These videos present particular annotation challenges such as frequent occlusions, food appearance changes, etc. Manually annotate the videos is a time-consuming, tedious and error-prone task. Fully automatic tools that integrate computer vision algorithms to extract and identify the elements of interest are not error free, and false positive and false negative detections need to be corrected in a post-processing stage. We present an interactive, semi-automatic tool for the annotation of cooking videos that integrates computer vision techniques under the supervision of the user. The annotation accuracy is increased with respect to completely automatic tools and the human effort is reduced with respect to completely manual ones. The performance and usability of the proposed tool are evaluated on the basis of the time and effort required to annotate the same video sequences.
We present here the results obtained by including a new image descriptor, that we called prosemantic feature
vector, within the framework of QuickLook<sup>2</sup> image retrieval system. By coupling the prosemantic features and
the relevance feedback mechanism provided by QuickLook<sup>2</sup>, the user can move in a more rapid and precise way
through the feature space toward the intended goal. The prosemantic features are obtained by a two-step feature
extraction process. At the first step, low level features related to image structure and color distribution are
extracted from the images. At the second step, these features are used as input to a bank of classifiers, each
one trained to recognize a given semantic category, to produce score vectors. We evaluated the efficacy of the
prosemantic features under search tasks on a dataset provided by Fratelli Alinari Photo Archive.
In this work we present a system which visualizes the results obtained from image search engines in such a way
that users can conveniently browse the retrieved images. The way in which search results are presented allows
the user to grasp the composition of the set of images "at a glance". To do so, images are grouped and positioned
according to their distribution in a prosemantic feature space which encodes information about their content at
an abstraction level that can be placed between visual and semantic information. The compactness of the feature
space allows a fast analysis of the image distribution so that all the computation can be performed in real time.
Correct image orientation is often assumed by common imaging applications such as enhancement, browsing, and
retrieval. However, the information provided by camera metadata is often missing or incorrect. In these cases
manual correction is required, otherwise the images cannot be correctly processed and displayed. In this work
we propose a system which automatically detects the correct orientation of digital photographs. The system
exploits the information provided by a face detector and a set of low-level features related to distributions in the
image of color and edges. To prove the effectiveness of the proposed approach we evaluated it on two datasets
of consumer photographs.
We have designed a new self-adaptive image cropping algorithm that is able to detect several relevant regions in the
image. These regions can then be sequentially proposed as thumbnails, to the user according to their relevance order,
thus allowing the viewer to visualize the relevant image content and eventually to display or print only those regions in
which he is more interested in. The algorithm exploits both visual and semantic information. Visual information is
obtained by a visual attention model, while semantic information relates to the detection and recognition of particularly
significant objects. In this work we concentrate our attention on the two common objects found in personal photos, such
as face and skin regions. Examples are shown to illustrate the effectiveness of the proposed method.
In the framework of multimedia applications image quality may have different meanings and interpretations. In this paper, considering the quality of an image as the degree of adequacy to its function/goal within a specific application field, we provide an organized overview of image quality assessment methods by putting in evidence their applicability and limitations in different application domains. Three scenarios have been chosen representing three typical applications with different degree of constraints in their image workflow chains and requiring different image quality assessment methodologies.
Although traditional content-based retrieval systems have been successfully employed in many multimedia applications, the need for explicit association of higher concepts to images has been a pressing demand from users. Many research works have been conducted focusing on the reduction of the semantic gap between visual features and the semantics of the image content. In this paper we present a mechanism that combines broad high level concepts and low level visual features within the framework of the QuickLook content-based image retrieval system. This system also implements a relevance feedback algorithm to learn users' intended query from positive and negative image examples. With the relevance feedback mechanism, the retrieval process can be efficiently guided toward the semantic or pictorial contents of the images by providing the system with the suitable examples. The qualitative experiments performed on a database of more than 46,000 photos downloaded from the Web show that the combination of semantic and low level features coupled with a relevance feedback algorithm, effectively improve the accuracy of the image retrieval sessions.
This paper focuses on full-reference image quality assessment and presents different computational strategies aimed to
improve the robustness and accuracy of some well known and widely used state of the art models, namely the Structural
Similarity approach (SSIM) by Wang and Bovik and the S-CIELAB spatial-color model by Zhang and Wandell. We
investigate the hypothesis that combining error images with a visual attention model could allow a better fit of the
psycho-visual data of the LIVE Image Quality assessment Database Release 2. We show that the proposed quality
assessment metric better correlates with the experimental data.
We propose an innovative approach to the selection of representative frames of a video shot for video summarization. By analyzing the differences between two consecutive frames of a video sequence, the algorithm determines the complexity of the sequence in terms of visual content changes. Three descriptors are used to express the frame’s visual content: a color histogram, wavelet statistics and an edge direction histogram. Similarity measures are computed for each descriptor and combined to form a frame difference measure. The use of multiple descriptors provides a more precise representation, capturing even small variations in the frame sequence. This method can dynamically, and rapidly select a variable number of key frame within each shot, and does not exhibit the complexity of existing methods based on clustering algorithm strategies. The method has been tested on various video segments of different genres (trailers, news, animation, etc.) and preliminary results shows that the algorithm is able to effectively summarize the shots capturing the most salient events in the sequences.
The paper describes an innovative image annotation tool for classifying image regions in one of seven classes - sky, skin, vegetation, snow, water, ground, and buildings - or as unknown. This tool could be productively applied in the management of large image and video databases where a considerable volume of images/frames there must be automatically indexed. The annotation is performed by a classification system based on a multi-class Support Vector Machine. Experimental results on a test set of 200 images are reported and discussed.
Image retrieval is a two steps process: 1) indexing, in which a set or a vector of features summarizing the properties of each image in the database, is computed and stored; and 2) retrieval, in which the features of the query image are extracted and compared with the others in the database. The database images are then ranked in order of their similarity. We introduce an innovative image retrieval strategy, the Dynamic Spatial Chromatic Histogram, which makes it possible to take into account spatial information in a flexible way without greatly adding to computation costs. Our preliminary results on a database of about 3000 images show that the proposed indexing and retrieval strategy is a powerful approach
The paper addresses the problem of distinguishing between pornographic and non-pornographic photographs, for the design of semantic filters for the web. Both, decision forests of trees built according to CART (Classification And Regression Trees) methodology and Support Vectors Machines (SVM), have been used to perform the classification. The photographs are described by a set of low-level features, features that can be automatically computed simply on gray-level and color representation of the image. The database used in our experiments contained 1500 photographs, 750 of which labeled as pornographic on the basis of the independent judgement of several viewers.
The paper addresses the problem of annotating photographs with broad semantic labels. To cope with the great variety of photos available on the WEB we have designed a hierarchical classification strategy which first classifies images as pornographic or not-pornographic. Not-pornographic images are then classified as indoor, outdoor, or close-up. On a database of over 9000 images, mostly downloaded from the web, our method achieves an average accuracy of close to 90%.
We present here a web-based protytpe for the interactive search of items in quality electronic catalogues. The system based on a multimedia information retrieval architecture, allows the user to query a multimedia database according to several retrieval strategies, and progressively refine the system's response by indicating the relevance, or non-relevance of the items retrieved. Once a subset of images meeting the user's information needs have been identified, these images can be displayed in a virtual exhibition that can be visited interactively by the user exploiting VRML technology.
The need to retrieval visual information form large image and video collections is shared by many application domains. This paper describes the main features of Quicklook, a system that combines in a single framework the alphanumeric relational query, the content-based image query exploiting automatically computed low-level image features, and the textural similarity query exploiting any textual attached to image database items.
We have examined the performance of various color-based retrieval strategies when coupled with a pre-filtering Retinex algorithm to see whether, and to what degree, Retinex improved the effectiveness of the retrieval, regardless of the strategy adopted. The retrieval strategies implemented included color and spatial-chromatic histogram matching, color coherence vector matching, and the weighted sum of the absolute differences between the first three moments of each color channel. The experimental results are reported and discussed.