In recent years, Visual Question Answering (VisualQA) has gradually become one of the research hotspots of video understanding, but most of the researches are mainly focused on Image Question Answering (ImageQA), while fewer researches pay attention to Video Question Answering (VideoQA). Inspired by the ImageQA model, we propose a model, which utilizes videos and questions to generate answers. We also redesign and simplify the Joint Sequence Fusion (JSFusion) model for our soft-attention mechanism called Frame Attention which can refines its attention on the frame object with the help of questions. Frame Attention first fused the multi-modal features by the Hadamard product, and then generated attention probability by encoding. In addition, a new training strategy for the ZJL dataset is also proposed, and can take full advantage of all the answers of the questions for training. Experiments show the advantages of our model and accuracy of 0.509 is achieved.
This paper presents the method of our submission to the ISBI 2019 Challenge for the task of classification of normal versus malignant cells in B-ALL white blood cancer microscopic images. We aimed to combine convolutional neural networks with several state-of-the-art techniques. Specifically, we fine-tuned pretrained deep learning networks including ResNet and DenseNet for this task. Overfitting is one of the major problems for this challenge. We solve overfitting by using the gradient norm clipping and the cosine annealing learning rate schedule with restarts, which have a significant impact on the performance of our deep neural network. More importantly, adaptive pooling layer is used in our models. With this modification, models are able to adapt to images of any size. An ensemble of deep models achieved a 0.8570 weighted-f1 score on the preliminary test set reported by the test server.