We address the image retrieval problem of finding those images of a large corpus that contain objects or scenes similar to a given query image. In the last decade, research for large-scale systems has shifted from using local feature-based approaches such as the Bag-of-Words model to applying global aggregation methods that represent every image with a short and fixed-length vector. Examples of such methods comprise Fisher Vectors or the Vector of Aggregated Local Descriptors (VLAD) which both combine a variable number of local features into a global vector. Moreover, global approaches which pool visual information from features based on convolutional neural networks (CNN) have become increasingly popular for retrieval. In fact, fine-tuning or even end-to-end learning the retrieval task with CNNs shows impressive performance for a certain targeted object class. We argue that this is reasonable for established public retrieval datasets typically showing one large object (building or sight or scene) in the middle of an image. However, it often fails in real-world forensic scenarios where one wants to find small objects in cluttered backgrounds. We therefore propose to adapt public datasets to generate novel evaluation setups yielding tasks that are closer to the problem of small object retrieval. With experiments comparing global features with local features, we show that the new evaluation setup allows focusing on specific characteristics such as the object size more easily during evaluation.
|