Video-based human action recognition is a challenging task in computer vision. In recent years, the convolution neural network (CNN) and its extended versions have shown promising results for video action recognition. However, most of the existing methods cannot deal with the global motion information effectively, especially for long-term motion which is crucial to represent complex none-periodic actions. To address this issue, a stacked trajectory energy image (STEI) is proposed by extracting trajectories from motion saliency regions and stacked them onto one grayscale image. This will result in an STEI with discriminative texture feature which can effectively characterize the global motion from multiple consecutive frames. Then, a three-stream CNN framework is proposed to simultaneously capture spatial, temporal, and global motion information of the action from RGB frames, optical flow, and STEI. Moreover, a trajectory-aware convolution strategy is introduced by incorporating local and long-term motion information so as to learn the motion features directly and effectively from three complementary action-related regions. Finally, the learned features are aggregated and categorized by a linear support vector machine. The experimental results on two challenging datasets (i.e., HMDB51 and UCF101) demonstrate that our approach statistically outperforms a number of state-of-the-art methods.