Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. The challenge for action recognition is to capture and fuse the multi-dimension information in video data. In order to take into account these characteristics simultaneously, we present a novel method that fuses multiple dimensional features, such as chromatic images, depth and optical flow fields. We built our model based on the multi-stream deep convolutional networks with the help of temporal segment networks and extract discriminative spatial and temporal features by fusing ConvNets towers multi-dimension, in which different feature weights are assigned in order to take full advantage of this multi-dimension information. Our architecture is trained and evaluated on the currently largest and most challenging benchmark NTU RGB-D dataset. The experiments demonstrate that the performance of our method outperforms the state-of-the-art methods.