In recent years, Visual Question Answering (VisualQA) has gradually become one of the research hotspots of video understanding, but most of the researches are mainly focused on Image Question Answering (ImageQA), while fewer researches pay attention to Video Question Answering (VideoQA). Inspired by the ImageQA model, we propose a model, which utilizes videos and questions to generate answers. We also redesign and simplify the Joint Sequence Fusion (JSFusion) model for our soft-attention mechanism called Frame Attention which can refines its attention on the frame object with the help of questions. Frame Attention first fused the multi-modal features by the Hadamard product, and then generated attention probability by encoding. In addition, a new training strategy for the ZJL dataset is also proposed, and can take full advantage of all the answers of the questions for training. Experiments show the advantages of our model and accuracy of 0.509 is achieved.