Translator Disclaimer
14 August 2019 Video question answering by frame attention
Author Affiliations +
Proceedings Volume 11179, Eleventh International Conference on Digital Image Processing (ICDIP 2019); 111793B (2019) https://doi.org/10.1117/12.2539615
Event: Eleventh International Conference on Digital Image Processing (ICDIP 2019), 2019, Guangzhou, China
Abstract
In recent years, Visual Question Answering (VisualQA) has gradually become one of the research hotspots of video understanding, but most of the researches are mainly focused on Image Question Answering (ImageQA), while fewer researches pay attention to Video Question Answering (VideoQA). Inspired by the ImageQA model, we propose a model, which utilizes videos and questions to generate answers. We also redesign and simplify the Joint Sequence Fusion (JSFusion) model for our soft-attention mechanism called Frame Attention which can refines its attention on the frame object with the help of questions. Frame Attention first fused the multi-modal features by the Hadamard product, and then generated attention probability by encoding. In addition, a new training strategy for the ZJL dataset is also proposed, and can take full advantage of all the answers of the questions for training. Experiments show the advantages of our model and accuracy of 0.509 is achieved.
© (2019) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jiannan Fang, Lingling Sun, and Yaqi Wang "Video question answering by frame attention", Proc. SPIE 11179, Eleventh International Conference on Digital Image Processing (ICDIP 2019), 111793B (14 August 2019); https://doi.org/10.1117/12.2539615
PROCEEDINGS
6 PAGES


SHARE
Advertisement
Advertisement
Back to Top