We aim to develop a better understanding of perception of similarity in focal computed tomography (CT) liver images to determine the feasibility of techniques for developing reference sets for training and validating content-based image retrieval systems. In an observer study, four radiologists and six nonradiologists assessed overall similarity and similarity in 5 image features in 136 pairs of focal CT liver lesions. We computed intra- and inter-reader agreements in these similarity ratings and viewed the distributions of the ratings. The readers’ ratings of overall similarity and similarity in each feature primarily appeared to be bimodally distributed. Median Kappa scores for intra-reader agreement ranged from 0.57 to 0.86 in the five features and from 0.72 to 0.82 for overall similarity. Median Kappa scores for inter-reader agreement ranged from 0.24 to 0.58 in the five features and were 0.39 for overall similarity. There was no significant difference in agreement for radiologists and nonradiologists. Our results show that developing perceptual similarity reference standards is a complex task. Moderate to high inter-reader variability precludes ease of dividing up the workload of rating perceptual similarity among many readers, while low intra-reader variability may make it possible to acquire large volumes of data by asking readers to view image pairs over many sessions.