Many image retrieval systems, and the evaluation methodologies of these systems, make use of either visual or textual information only. Only few combine textual and visual features for retrieval and evaluation. If text is used, it is often relies upon having a standardised and complete annotation schema for the entire collection. This, in combination with high-level semantic queries, makes visual/textual combinations almost useless as the information need can often be solved using just textual features. In reality, many collections do have some form of annotation but this is often heterogeneous and incomplete. Web-based image repositories such as FlickR even allow collective, as well as multilingual annotation of multimedia objects. This article describes an image retrieval evaluation campaign called ImageCLEF. Unlike previous evaluations, we offer a range of realistic tasks and image collections in which combining text and visual features is likely to obtain the best results. In particular, we offer a medical retrieval task which models exactly the situation of heterogenous annotation by combining four collections with annotations of varying quality, structure, extent and language. Two collections have an annotation per case and not per image, which is normal in the medical domain, making it difficult to relate parts of the accompanying text to corresponding images. This is also typical of image retrieval from the web in which adjacent text does not always describe an image. The ImageCLEF benchmark shows the need for realistic and standardised datasets, search tasks and ground truths for visual information retrieval evaluation.