We proposed a convolutional neural network (CNN)-based anthropomorphic model observer to predict human observer detection performance for breast cone-beam CT images. We generated the breast background with a 50% volume glandular fraction and inserted 2mm diameter spherical signal near the center. Projection data were acquired using a forward projection algorithm and were reconstructed using the Feldkamp-Davis-Kress reconstruction. To generate different noise structures, the projection data were filtered with Hanning, SheppLogan, and Lam-Lak filters with and without Fourier interpolation, resulting in six different noise structures. To investigate the benefits of non-linearity in CNN, we used the two different network architectures: linear CNN (Li-CNN) without any activation function and multi-layer CNN (ML-CNN) with a leaky rectified linear unit. For comparison, we also used a nonprewhitening observer with an eye-filter (NPWE) having the peak value at the frequency of 7 cyc/deg based on our previous work. We trained CNN to minimize the mean squared error using 12,000 pairs of signal-present and signal-absent images which were labeled with decision variable from NPWE. When labeling, the eye filter parameter of NPWE was fine-tuned separately for each noise structure to match percent correct to that of human observers. Note that we trained a single network for different noise structures whereas the template of NPWE was estimated for each noise structure. We conducted four alternative forced choice for detection tasks, and percent correct of human and model observers were compared. The results show that the proposed ML-CNN better predicts detection performance of human observers than NPWE and Li-CNN.