This paper studies the relation between images and text in image databases. An analysis of this relation results in the definition of three distinct query modalities: (1) linguistic scenario: images are part of a whole including a self-contained linguistic discourse, and their meaning derives form their interaction with the linguistic discourse. A typical case of this scenario is constituted by images on the World Wide Web; (2) closed world scenario: images are defined in a limited domain, and their meaning is anchored by conventions and norms in that domain; (3) user scenario: the linguistic discourse is provided by the user. This is the case of highly interactive systems with relevance feedback. This paper deals with image databases of the first type. It shows how the relation between images and text can be inferred, and exploited for search. The paper develops a similarity model in which the similarity between two images is given by both their visual similarity and the similarity of the attached words. Both the visual and textural similarity can be manipulated by the user through the two windows of the interface.