Computer-aided diagnosis (CAD) tools using MR images have been largely developed for disease burden quantification, patient diagnosis and follow-up. Newer CAD tools, based on machine learning techniques, often require large and heterogeneous data-sets to provide accurate and generalizable results. Commonly multi-center MR imaging data-sets are used. Typically, collection of these data-sets require adherence to an appropriate experimental protocol in order to assure that findings are due to a pathology and not due to variability in image quality or acquisition parameters across scanners and/or imaging centers. We compared different experimental training protocols used with a representative CAD tool (in this work, designed to identify Alzheimer’s disease (AD) patients from normal control (NC) subjects) using public multi-center data-sets. We examined: 1) subsets of the data-set that were acquired on the same scanner (simulating a single site homogeneous data-set), 2) a traditional cross validation framework (i.e., randomly splitting the data-set into training and testing sets irrespective of centre), and 3) a site-wise cross validation framework, in which training and testing data were differentiated by center using a leave one center out per iteration method. Results achieved with the homogeneous data-set, traditional cross-validation and site-wise cross validation differed (p = 0.0005): 100.0% (i.e., no misclassifications), 99.6% and 97.3% accuracy rates, respectively, even when the same image data-set, features and classifier were used. The lowest accuracy was observed with site-wise cross validation, the only protocol with no site-wise contamination between training and testing samples.