Multimodal imaging systems have recently been drawing attention in fields such as medical imaging, remote sensing, and video surveillance systems. In such systems, estimating depth has become possible due to the promising progress of multimodal matching techniques. We perform a systematic performance evaluation of similarity measures frequently used in the literature for dense multimodal stereovision. The evaluated measures include mutual information (MI), sum of squared distances, normalized cross-correlation, census transform, local self-similarity (LSS) as well as descriptors adopted to multimodal settings, like scale invariant feature transform (SIFT), speeded-up robust features (SURF), histogram of oriented gradients (HOG), binary robust independent elementary features, and fast retina keypoint (FREAK). We evaluate the measures over datasets we generated, compiled, and provided as a benchmark and compare the performances using the Winner Takes All method. The datasets are (1) synthetically modified four popular pairs from the Middlebury Stereo Dataset (namely, Tsukuba, Venus, Cones, and Teddy) and (2) our own multimodal image pairs acquired using the infrared and the electro-optical cameras of a Kinect device. The results show that MI and HOG provide promising results for multimodal imagery, and FREAK, SURF, SIFT, and LSS can be considered as alternatives depending on the multimodality level and the computational complexity requirements of the intended application.