Differences in the size distribution of malignant and benign pulmonary nodules in databases used for training and testing characterization systems have a significant impact on the measured performance. The magnitude of this effect and methods to provide more relevant performance results are explored in this paper. Two- and three-dimensional features, both including and excluding size, and two classifiers, logistic regression and distance-weighted nearest-neighbors (dwNN), were evaluated on a database of 178 pulmonary nodules. For the full database, the area under the ROC curve (AUC) of the logistic regression classifier for 2D features with and without size was 0.721 and 0.614 respectively, and for 3D features with and without size, 0.773 and 0.737 respectively. In comparison, the performance using a simple size-threshold classifier was 0.675. In the second part of the study, the performance was measured on a subset of 46 nodules from the entire subset selected to have a similar size-distribution of malignant and benign nodules. For this subset, performance of the size-threshold was 0.504. For logistic regression, the performance for 2D, with and without size, were 0.578 and 0.478, and for 3D, with and without size, 0.671 and 0.767. Over all the databases, logistic regression exhibited better performance using 3D features than 2D features. This study suggests that in systems for nodule classification, size is responsible for a large part of the reported performance. To address this, system performance should be reported with respect to the performance of a size-threshold classifier.