Human emotions are known to always have four phases in the temporal domain: neutral, onset, apex, and offset. This has been demonstrated to be of great benefit for emotion recognition. Therefore, temporal segmentation has attracted considerable research interest. Although state-of-the-art techniques use recurrent neural networks to highly increase the performance, they ignore the relevance of each frame (time step) of a video, and they do not consider the changing contribution of different features when fusing them. We propose a framework called dual-level attention-aware bidirectional grated recurrent unit, which integrates ideas from attention models to discover the most important frames and features for improving temporal segmentation. Specifically, it applies attention mechanisms at two levels: frame and feature. A significant advantage is that the two-level attention weights provide a meaningful value to depict the importance of each frame and feature. The experiments demonstrated that the proposed framework outperforms state-of-the-art methods.