A good image feature representation is crucial for image classification tasks. Many traditional applications have attempted to design single-modal features for image classification; however, these may have difficulty extracting sufficient information, resulting in misjudgments for various categories. Recently, researchers have focused on designing multimodal features, which have been successfully employed in many situations. However, there are still some problems in this research area, including selecting efficient features for each modality, transforming them to the subspace feature domain, and removing the heterogeneities among different modalities. We propose an end-to-end multimodal deep neural network (MDNN) framework to automate the feature selection and transformation procedures for image classification. Furthermore, inspired by Fisher’s theory of linear discriminant analysis, we improve the proposed MDNN by further proposing a multimodal multitask deep neural network (M2DNN) model. The motivation behind M2DNN is to improve the classification performance by incorporating an auxiliary discriminative constraint to the subspace representation. Experimental results on five representative datasets (NUS-WIDE, Scene-15, Texture-25, Indoor-67, and Caltech-101) demonstrate the effectiveness of the proposed MDNN and M2DNN models. In addition, experimental comparisons of the Fisher score criterion exhibit that M2DNN is more robust and has better discriminative power than other approaches.