We aimed to test two different software based on the deep learning technology versus two senior and one junior radiologist on a recall-based model for mammography. We performed a retrospective, monocentric, multi-reader study in the Centre Hospitalier de Valenciennes in the north of France. A set of examinations from a daily practice, with both screening and diagnostic studies, has been interpreted by 3 radiologists and the two AI based algorithms. The dataset has been enriched with BIRADS 4 and 5 cases in order to have a number of cancer cases sufficient to have statistically significant results. In total, 140 examinations have been included in the final dataset. Sensitivity (true positive rate - TPR), False positive rate (FPR), and recall rate per BI-RADS category were considered as endpoints for each of the radiologists. To compute these metrics all the included cases were considered as positive if the initial BI-RADS was equal or higher than 3 and as negative if the initial BI-RADS was 1 or 2. Additional analysis have been carried out taking into account the biopsy report (if any) as ground truth. While both the algorithms and radiologist have a good and comparative rate of sensitivity and FPR, the test based on BI-RADS categories (i.e. the number of cancer per BI-RADS category), showed heterogeneous results, with bad performances for one of the tested software on the extremes score of BI-RADS. We concluded that one of the analysed software cannot be used in the current clinical practice without further improvements, the second one shows promising results, but other studies are needed to have a robust external validation before being used in a daily practice.