With new depth sensing technology such as Kinect providing high quality synchronized RGB and depth images (RGB-D data), learning rich representations efficiently plays an important role in multi-modal recognition task, which is crucial to achieve high generalization performance. To address this problem, in this paper, we propose an effective multi-modal convolutional extreme learning machine with kernel (MMC-KELM) structure, which combines advantages both the power of CNN and fast training of ELM. In this model, CNN uses multiple alternate convolution layers and stochastic pooling layers to effectively abstract high level features from each modality (RGB and depth) separately without adjusting parameters. And then, the shared layer is developed by combining these features from each modality. Finally, the abstracted features are fed to the extreme learning machine with kernel (KELM), which leads to better generalization performance with faster learning speed. Experimental results on Washington RGB-D Object Dataset show that the proposed multiple modality fusion method achieves state-of-the-art performance with much less complexity.