We present a method for illuminant estimation that exploits a generative adversarial network architecture to generate a spatially-varying illuminant map. This map is then transformed by consensus into a global illuminant estimation, in the form of a single RGB triplet. To this end, different consensus strategies are designed and compared in this paper. The best solution won second place in the 2nd International Illumination Estimation Challenge, specifically for the indoor track.
We focus on saliency estimation in digital images. We describe why it is important to adopt a data-driven model for such an illposed problem, allowing for a universal concept of “saliency” to naturally emerge from data that are typically annotated with drastically heterogeneous criteria. Our learning-based method also involves an explicit analysis of the input at multiple scales, in order to take into account images of different resolutions, depicting subjects of different sizes. Furthermore, despite training our model on binary ground truths only, we are able to output a continuous-valued confidence map, which represents the probability of each image pixel being salient. Every contribution of our method for saliency estimation is singularly tested according to a standard evaluation benchmark, and our final proposal proves to be very effective in a comparison with the state-of-the-art.
We present a method for the automatic restoration of images subjected to the application of photographic filters, such as those made popular by photo-sharing services. The method uses a convolutional neural network (CNN) for the prediction of the coefficients of local polynomial transformations that are applied to the input image. The experiments we conducted on a subset of the Places-205 dataset show that the quality of the restoration performed by our method is clearly superior to that of traditional color balancing and restoration procedures, and to that of recent CNN architectures for image-to-image translation.
Hyperspectral cameras provide additional information in terms of multiple sampling of the visible spectrum, holding information that could be potentially useful for biometric applications. This paper investigates whether the performance of hyperspectral face recognition algorithms can be improved by considering single and multiple one-dimensional (1-D) projections of the whole spectral data along the spectral dimension. Three different projections are investigated and found by optimization: single-spectral band selection, nonnegative spectral band combination, and unbounded spectral band combination. Since 1-D projections can be performed directly on the imaging device with color filters, projections are also restricted to be physically plausible. The experiments are performed on a standard hyperspectral dataset and the obtained results outperform eight existing hyperspectral face recognition algorithms.
We present a fully automated approach for smile detection. Faces are detected using a multiview face detector and aligned and scaled using automatically detected eye locations. Then, we use a convolutional neural network (CNN) to determine whether it is a smiling face or not. To this end, we investigate different shallow CNN architectures that can be trained even when the amount of learning data is limited. We evaluate our complete processing pipeline on the largest publicly available image database for smile detection in an uncontrolled scenario. We investigate the robustness of the method to different kinds of geometric transformations (rotation, translation, and scaling) due to imprecise face localization, and to several kinds of distortions (compression, noise, and blur). To the best of our knowledge, this is the first time that this type of investigation has been performed for smile detection. Experimental results show that our proposal outperforms state-of-the-art methods on both high- and low-quality images.
Automatic action recognition in videos is a challenging computer vision task that has become an active research area in recent years. Existing strategies usually use kernel-based learning algorithms that considers a simple combination of different features completely disregarding how such features should be integrated to fit the given problem. Since a given feature is most suitable to describe a given image/video property, the adaptive weighting of such features can improve the performance of the learning algorithm. In this paper, we investigated the use of the Multiple Kernel Learning (MKL) algorithm to adaptive search for the best linear relation among the considered features. MKL is an extension of the support vector machines (SVMs) to work with a weighted linear combination of several single kernels. This approach allows to simultaneously estimate the weights for the multiple kernels combination as well as the underlying SVM parameters. In order to prove the validity of the MKL approach, we considered a descriptor composed of multiple features aligned with dense trajectories. We experimented our approach on a database containing 36 cooking actions. Results confirm that the use of MKL improves the classification performance.
A simple but effective technique for absolute colorimetric camera characterization is proposed. It offers a large dynamic range requiring just a single, off-the-shelf target and a commonly available controllable light source for the characterization. The characterization task is broken down in two modules, respectively devoted to absolute luminance estimation and to colorimetric characterization matrix estimation. The characterized camera can be effectively used as a tele-colorimeter, giving an absolute estimation of the XYZ data in cd=m2. The user is only required to vary the f - number of the camera lens or the exposure time t, to better exploit the sensor dynamic range. The estimated absolute tristimulus values closely match the values measured by a professional spectro-radiometer.
In security applications the human face plays a fundamental role, however we have to assume non-collaborative subjects. A face can be partially visible or occluded due to common-use accessories such as sunglasses, hats, scarves and so on. Also the posture of the head influence the face recognizability. Given a video sequence in input, the proposed system is able to establish if a face is depicted in a frame, and to determine its degree of recognizability in terms of clearly visible facial features. The system implements features filtering scheme combined with a skin-based face detection to improve its the robustness to false positives and cartoon-like faces. Moreover the system takes into account the recognizability trend over a customizable sliding time window to allow a high level analysis of the subject behaviour. The recognizability criteria can be tuned for each specific application. We evaluate our system both in qualitative and quantitative terms, using a data set of manually annotated videos. Experimental results confirm the effectiveness of the proposed system.
The processing pipeline of a digital camera converts the RAW image acquired by the sensor to a representation of the original scene that should be as faithful as possible. There are mainly two modules responsible for the color-rendering accuracy of a digital camera: the former is the illuminant estimation and correction module, and the latter is the color matrix transformation aimed to adapt the color response of the sensor to a standard color space. These two modules together form what may be called the color correction pipeline. We design and test new color correction pipelines that exploit different illuminant estimation and correction algorithms that are tuned and automatically selected on the basis of the image content. Since the illuminant estimation is an ill-posed problem, illuminant correction is not error-free. An adaptive color matrix transformation module is optimized, taking into account the behavior of the first module in order to alleviate the amplification of color errors. The proposed pipelines are tested on a publicly available dataset of RAW images. Experimental results show that exploiting the cross-talks between the modules of the pipeline can lead to a higher color-rendition accuracy.
This work aims at automatically recognizing sequences of complex karate movements and giving a measure of the quality of the movements performed. Since this is a problem which intrinsically needs a 3D model, in this work we propose a solution taking as input sequences of skeletal motions that can derive from both motion capture hardware or consumer-level, off the shelf, depth sensing systems. The proposed system is constituted by four different modules: skeleton representation, pose classification, temporal alignment, and scoring. The proposed system is tested on a set of different punch, kick and defense karate moves executed starting from the simplest case, i.e. fixed static stances (heiko dachi) up to sequences in which the starting stances is different from the ending one. The dataset has been recorded using a single Microsoft Kinect. The dataset includes the recordings of both male and female athletes with different skill levels, ranging from novices to masters.
Search and retrieval of huge archives of Multimedia data is a challenging task. A classification step is often used to
reduce the number of entries on which to perform the subsequent search. In particular, when new entries of the database
are continuously added, a fast classification based on simple threshold evaluation is desirable.
In this work we present a CART-based (Classification And Regression Tree ) classification framework for audio
streams belonging to multimedia databases. The database considered is the Archive of Ethnography and Social History
(AESS) , which is mainly composed of popular songs and other audio records describing the popular traditions
handed down generation by generation, such as traditional fairs, and customs.
The peculiarities of this database are that it is continuously updated; the audio recordings are acquired in unconstrained
environment; and for the non-expert human user is difficult to create the ground truth labels.
In our experiments, half of all the available audio files have been randomly extracted and used as training set. The
remaining ones have been used as test set. The classifier has been trained to distinguish among three different classes:
speech, music, and song. All the audio files in the dataset have been previously manually labeled into the three classes
above defined by domain experts.
In order to create a cooking assistant application to guide the users in the preparation of the dishes relevant to their profile diets and food preferences, it is necessary to accurately annotate the video recipes, identifying and tracking the foods of the cook. These videos present particular annotation challenges such as frequent occlusions, food appearance changes, etc. Manually annotate the videos is a time-consuming, tedious and error-prone task. Fully automatic tools that integrate computer vision algorithms to extract and identify the elements of interest are not error free, and false positive and false negative detections need to be corrected in a post-processing stage. We present an interactive, semi-automatic tool for the annotation of cooking videos that integrates computer vision techniques under the supervision of the user. The annotation accuracy is increased with respect to completely automatic tools and the human effort is reduced with respect to completely manual ones. The performance and usability of the proposed tool are evaluated on the basis of the time and effort required to annotate the same video sequences.
Pattern matching, also known as template matching, is a computationally intensive problem aimed at localizing the instances of a given template within a query image. In this work we present a fast technique for template matching, able to use histogram-based similarity measures on complex descriptors. In particular we will focus on Color Histograms (CH), Histograms of Oriented Gradients (HOG), and Bag of visual Words histograms (BOW). The image is compared with the template via histogram-matching exploiting integral histograms. In order to introduce spatial information, template and candidates are divided into sub-regions, and multiple descriptor sizes are computed. The proposed solution is compared with the Full-Search-equivalent Incremental Dissimilarity Approximations, a state of the art approach, in terms of both accuracy and execution time on different standard datasets.
In this work we address the problem of optimal sensor placement for a given region and task. An important
issue in designing sensor arrays is the appropriate placement of the sensors such that they achieve a predefined
goal. There are many problems that could be considered in the placement of multiple sensors. In this work
we focus on the four problems identified by Hörster and Lienhart. To solve these problems, we propose an
algorithm based on Direct Search, which is able to approach the global optimal solution within reasonable time
and memory consumption. The algorithm is experimentally evaluated and the results are presented on two real
floorplans. The experimental results show that our DS algorithm is able to improve the results given by the
most performing heuristic introduced in. The algorithm is then extended to work also on continuous solution
spaces, and 3D problems.
This paper focuses on full-reference image quality assessment and presents different computational strategies aimed to
improve the robustness and accuracy of some well known and widely used state of the art models, namely the Structural
Similarity approach (SSIM) by Wang and Bovik and the S-CIELAB spatial-color model by Zhang and Wandell. We
investigate the hypothesis that combining error images with a visual attention model could allow a better fit of the
psycho-visual data of the LIVE Image Quality assessment Database Release 2. We show that the proposed quality
assessment metric better correlates with the experimental data.
We present different computational strategies for colorimetric characterization of scanners using multidimensional polynomials. The designed strategies allow us to determine the coefficients of an a priori fixed polynomial, taking into account different color error statistics. Moreover, since there is no clear relationship between the polynomial chosen for the characterization and the intrinsic characteristics of the scanner, we show how genetic programming could be used to generate the best polynomial. Experimental results on different devices are reported to confirm the effectiveness of our methods with respect to others in the state of the art.
Several algorithms were proposed in the literature to recover
the illuminant chromaticity of the original scene. These algorithms
work well only when prior assumptions are satisfied, and the
best and the worst algorithms may be different for different scenes.
We investigate the idea of not relying on a single method but instead
consider a consensus decision that takes into account the responses
of several algorithms and adaptively chooses the algorithms
to be combined. We investigate different combining strategies
of state-of-the-art algorithms to improve the results in the
illuminant chromaticity estimation. Single algorithms and combined
ones are evaluated for both synthetic and real image databases
using the angular error between the RGB triplets of the measured
illuminant and the estimated one. Being interested in comparing the
performance of the methods over large data sets, experimental results
are also evaluated using the Wilcoxon signed rank test. Our
experiments confirm that the best and the worst algorithms do not
exist at all among the state-of-the-art ones and show that simple
combining strategies improve the illuminant estimation.
According to the LeBaron effect, serial correlation is low when volatility is high and vice-versa. We show that
it is true only for the predictable part of the volatility, while volatility which cannot be forecasted is positively
linked to serial correlation. Since the mechanism of price formation can be very different in small and large
markets we investigate the effect of volatility on intraday serial correlation in Italy (a small market) and U.S. (a
large market). We find substantial differences in the impact of volatility in the two markets.
In this paper we investigate the relationship between matrixing methods, the number of filters adopted and the
size of the color gamut of a digital camera. The color gamut is estimated using a method based on the inversion of
the processing pipeline of the imaging device. Different matrixing methods are considered, including an original
method developed by the authors. For the selection of a hypothetical forth filter, three different quality measures
have been implemented. Experimental results are reported and compared.
In this work we consider six methods for automatic white balance available in the literature. The idea investigated
does not rely on a single method, but instead considers a consensus decision that takes into account
the compendium of the responses of all the considered algorithms. Combining strategies are then proposed and
tested both on synthetic and multispectral images, extracted from well known databases. The multispectral
images are processed using a digital camera simulator developed by Stanford University. All the results are
evaluated using the Wilcoxon Sign Test.
We study the serial correlation of high-frequency intraday returns on the Italian stock index futures (FIB30) in the period 2000-2002. We adopt three different methods of analysis: the spectral density via Fast Fourier Transform, Detrended Fluctuation Analysis (DFA) and the Variance Ratio test. We find that intraday autocorrelation is mostly negative for time scales lower than 20 minutes, but we support the efficiency of the Italian futures market.