Large deformation diffeomorphic metric mapping (LDDMM) is one of the state-of-the-art deformable image registration algorithms that has been shown to be of superior performance, especially for brain images. LDDMM was originally proposed for matching intra-modality images, with the Sum of Squared Difference (SSD) being used as the matching cost function. Extension of LDDMM to other types of matching cost functions has been very limited. In this paper, we systematically evaluated three different matching cost functions, the SSD, the Mutual Information (MI), and the Cross Correlation (CC) in the LDDMM-image setting, based on 14 subcortical and ventricular structures in a total of 120 pairs of brain images. In addition, we proposed an efficient implementation for those three LDDMM-image settings via GPU-base parallel computing and quantitatively compared with the standard open source implementation of LDDMM-SSD in terms of both registration accuracy and computational time. The proposed parallelization and optimization approach resulted in an acceleration by 28 times, relative to the standard open source implementation, on a 4-core machine with GTX970 card (29.67 mins versus 828.35 mins on average), without sacrificing the registration accuracy. Comparing the three matching cost functions, we observed that LDDMM-CC worked the best in terms of registration accuracy, obtaining Dice overlaps larger than 0.853 for a majority of structures of interest.