18 April 2018 Deep hierarchical attention network for video description
Author Affiliations +
Pairing video to natural language description remains a challenge in computer vision and machine translation. Inspired by image description, which uses an encoder–decoder model for reducing visual scene into a single sentence, we propose a deep hierarchical attention network for video description. The proposed model uses convolutional neural network (CNN) and bidirectional LSTM network as encoders while a hierarchical attention network is used as the decoder. Compared to encoder–decoder models used in video description, the bidirectional LSTM network can capture the temporal structure among video frames. Moreover, the hierarchical attention network has an advantage over single-layer attention network on global context modeling. To make a fair comparison with other methods, we evaluate the proposed architecture with different types of CNN structures and decoders. Experimental results on the standard datasets show that our model has a more superior performance than the state-of-the-art techniques.
© 2018 SPIE and IS&T
Shuohao Li, Shuohao Li, Min Tang, Min Tang, Jun Zhang, Jun Zhang, } "Deep hierarchical attention network for video description," Journal of Electronic Imaging 27(2), 023027 (18 April 2018). https://doi.org/10.1117/1.JEI.27.2.023027 . Submission: Received: 3 January 2018; Accepted: 30 March 2018
Received: 3 January 2018; Accepted: 30 March 2018; Published: 18 April 2018

Back to Top