Image retrieval is a human-centered task: images are created by people and are ultimately accessed and used by people for human-related activities. In designing image retrieval systems and algorithms, or measuring their performance, it is therefore imperative to consider the conditions that surround both the indexing of image content and the retrieval. This includes examining the different levels of interpretation for retrieval, possible search strategies, and image uses. Furthermore, we must consider different levels of similarity and the role of human factors such as culture, memory, and personal context. This paper takes a human-centered perspective in outlining levels of description, types of users, search strategies, image uses, and human factors that affect the construction and evaluation of automatic content-based retrieval systems, such as human memory, context, and subjectivity.
Many recent efforts have been made to automatically index multimedia content with the aim of bridging the semantic gap between syntax and semantics. In this paper, we propose a novel framework to automatically index video using context for video understanding. First we discuss the notion of context and how it relates to video understanding. Then we present the framework we are constructing, which is modeled as an expert system that uses a rule-based engine, domain knowledge, visual detectors (for objects and scenes), and different data sources available with the video (metadata, text from automatic speech recognition, etc.). We also describe our approach to align text from speech recognition and video segments, and present experiments using a simple implementation of our framework. Our experiments show that context can be used to improve the performance of visual detectors.
We explore the way in which people look at images of different semantic categories and directly relate those results to computational approaches for automatic image classification. Our hypothesis is that the eye movements of human observers differ for images of different semantic categories, and that this information can be effectively used in automatic content-based classifiers. First, we present eye tracking experiments that show the variation in eye movements across different individuals for image of 5 different categories: handshakes, crowd, landscapes, main object in uncluttered background, and miscellaneous. The eye tracking results suggest that similar viewing patterns occur when different subjects view different images in the same semantic category. Using these results, we examine how empirical data obtained from eye tracking experiments across different semantic categories can be integrated with existing computational frameworks, or used to construct new ones. In particular, we examine the Visual Apprentice, a system in which images classifiers are learned form user input as the user defines a multiple level object definition hierarchy based on an object and its parts and labels examples for specific classes. The resulting classifiers are applied to automatically classify new images. Although many eye tracking experiments have been performed, to our knowledge, this is the first study that specifically compares eye movements across categories, and that links category-specific eye tracking results to automatic image classification techniques.
In this paper, we propose a dynamic approach to feature and classifier selection. In our approach, based on performance, visual features and classifiers are selected automatically. In earlier work, we presented the Visual Apprentice, in which users can define visual object models via a multiple- level object definition hierarchy. Visual Object Detectors are learned, using various learning algorithms - as the user provides examples from images or video, visual features are extracted and multiple classifiers are learned for each node of the hierarchy. In this paper, features and classifiers are selected automatically at each node, depending on their performance over the training set introduce the concept of Recurrent Visual Semantics and show how it can be used to identify domains in which performance-based learning techniques such as the one presented can be applied. We then show experimental results in detecting Baseball video shots, images that contain handshakes,and images that contain skies. These result demonstrate the importance, feasibility, and usefulness of dynamic feature/classifier selection for classification of visual information, and the performance benefits of using multiple learning algorithms to build classifiers. Based on our experiments, we also discuss some of the issues that arise when applying learning techniques in real-world content-based applications.
In this paper, we present a conceptual framework for indexing different aspects of visual information. Our framework unifies concepts from this literature in diverse fields such as cognitive psychology, library sciences, art, and the more recent content-based retrieval. We present multiple level structures for visual and non-visual and non- visual information. The ten-level visual structure presented provides a systematic way of indexing images based on syntax and semantics, and includes distinctions between general concept and visual concept. We define different types of relations at different levels of the visual structure, and also use a semantic information table to summarize important aspects related to an image. While the focus is on the development of a conceptual indexing structure, our aim is also to bring together the knowledge from various fields, unifying the issues that should be considered when building a digital image library. Our analysis stresses the limitations of state of the art content-based retrieval systems and suggests areas in which improvements are necessary.
The convergence of inexpensive digital cameras and cheap hardware for displaying stereoscopic images has created the right conditions for the proliferation of stereoscopic imagin applications. One application, which is of growing importance to museums and cultural institutions, consists of capturing and displaying 3D images of objects at multiple orientations. In this paper, we present our stereoscopic imaging system and methodology for semi-automatically capturing multiple orientation stereo views of objects in a studio setting, and demonstrate the superiority of using a high resolution, high fidelity digital color camera for stereoscopic object photography. We show the superior performance achieved with the IBM TDI-Pro 3000 digital camera developed at IBM Research. We examine various choices related to the camera parameters, image capture geometry, and suggest a range of optimum values that work well in practice. We also examine the effect of scene composition and background selection on the quality of the stereoscopic image display. We will demonstrate our technique with turntable views of objects from the IBM Corporate Archive.
One of the major challenges in scanning and printing documents in a digital library is the preservation of the quality of the documents and in particular of the images they contain. When photographs are offset-printed, the process of screening usually takes place. During screening, a continuous tone image is converted into a bi-level image by applying a screen to replace each color in the original image. When high-resolution scanning of screened images is performed, it is very common in the digital version of the document to observe the screen patterns used during the original printing. In addition, when printing the digital document, more effects tend to appear because printing requires halftoning. In order to automatically suppress these moire patterns, it is necessary to detect the image areas of the document and remove the screen pattern present in those areas. In this paper, we present efficient and robust techniques to segment a grayscale document into halftone image areas, detect the presence and frequency of screen patterns in halftone areas and suppress their detected screens. We present novel techniques to perform fast segmentation based on (alpha) -crossings, detection of screen frequencies using a fast accumulator function and suppression of detected screens by low-pass filtering.
Most existing approaches to content-based retrieval rely on query by example, or user sketch based on low-level features. However, these are not suitable for semantic (object level) distinctions. In other approaches, information is classified according to a predefined set of classes and classification is either performed manually or by using class-specific algorithms. Most of these systems lack flexibility: the user does not have the ability to define or change the classes, and new classification schemes require implementation of new class-specific algorithms and/or the input of an expert. In this paper, we present a different approach to content-based retrieval and a novel framework for classification of visual information, in which (1) users define their own visual classes and classifiers are learned automatically, and (multiple fuzzy-classifiers and machine learning techniques are combined for automatic classification at multiple levels (region, perceptual, object-part, object and scene). We present The Visual Apprentice, an implementation of our framework for still images and video that uses a combination of lazy-learning, decision trees, and evolution programs for classification and grouping. Our system is flexible, in that models can be changed by users over time, different types of classifiers are combined, and user-model definitions can be applied to object and scene structure classification. Special emphasis is placed on the difference between semantic and visual classes, and between classification and detection. Examples and results are presented to demonstrate the applicability of our approach to perform visual classification and detection.