In this work, we validated a task-based performance figure-of-merit (FOM) by investigating ranking inconsistencies due to lurking variable/factors. We applied a falsifiable search assessment theory to assessing digital breast tomosynthesis (DBT) image quality using a scanning channelized Hotelling observer (CHO) on a simulated DBT dataset. We compared the performance of five reconstruction algorithms: filter back projection (FBP), maximum likelihood (ML), simultaneous algebraic reconstruction technique (SART), total-variation regularized least square estimator (TVLS) with strong and mild regularization settings. The results showed that the location-known-exactly (LKE) detection performance was almost identical for the five reconstruction algorithms. However the search characteristic as described by effective set size (M*) and search AUC value, ranked them differently. To falsify/corroborate our evaluations on search characteristic and performance, we conducted an image-size test. This test demonstrated an agreement between theoretical predictions and empirically measured observer performance in absolute performance levels, except for the ML algorithm. We concluded that evidence corroborated our evaluations, except that for the ML algorithm where our evaluation was wrong. Further investigation of the wrong evaluation in the ML case revealed a lurking variable that affected system performance ranking in search when AUC value was used as the FOM. This further confirmed that our evaluation in its current form for the ML algorithm was indeed wrong. We also noted that the ranking inconsistencies exist even when the AUC value was used as the FOM, and the falsifiable nature of M* allowed such inconsistencies to be identified.