Automatically generating the description of an image is a task that connects computer vision and natural language processing. It has gained more and more attention in the field of artificial intelligence. In this paper, we present a model that generates description for images based on RNN (recurrent neural network) with multi-feature weighted by object attention to represent images. We use LSTM (long short term memory), which is a RNN model, to translate multi-feature of images to text. Most existing methods use single CNN (convolution neural network) trained on ImageNet to extract image features which mainly focuses on objects in images. However, the context in the scene is also informative to image captioning. So we incorporate the scene feature extracted with CNN trained on Places205. We evaluate our model on MSCOCO dataset based on standard metrics. Experiments show that multi-feature performs better than single feature. In addition, the saliency weight on images emphasizes the salient objects in images as the subject in image descriptions. The results show that our model performs better than several state-of-the-art methods on image captioning.
In this work, we propose a novel encoding-decoding based image captioning framework, which improves the performance by jointly exploring the visual object-context features, generic and specific semantic priors. In the encoding of RNN, we extract the semantic attributes, object-related and scene-related image features first, and then feed them sequentially to the encoder of RNN, which considers the rich general semantic and visual object-context representation of images. To incorporate the testing specific semantic priors in the decoding of RNN, we apply cross-modal retrieval to find the most similar captions of the testing image in the visual-semantic embedding space of VSE++. The BLEU-4 similarity is utilized to evaluate the similarity between the generated sentence and the retrieved captions, which incorporates the sentence-making priors to the testing-specific reference captions. The evaluation on benchmark dataset Microsoft COCO shows the superiority of our algorithm over the state-of-the-art approaches on standard evaluation metrics.
Generating description for an image can be regard as visual understanding. It is across artificial intelligence, machine learning, natural language processing and many other areas. In this paper, we present a model that generates description for images based on RNN (recurrent neural network) with object attention and multi-feature of images. The deep recurrent neural networks have excellent performance in machine translation, so we use it to generate natural sentence description for images. The proposed method uses single CNN (convolution neural network) that is trained on ImageNet to extract image features. But we think it can not adequately contain the content in images, it may only focus on the object area of image. So we add scene information to image feature using CNN which is trained on Places205. Experiments show that model with multi-feature extracted by two CNNs perform better than which with a single feature. In addition, we make saliency weights on images to emphasize the salient objects in images. We evaluate our model on MSCOCO based on public metrics, and the results show that our model performs better than several state-of-the-art methods.
Visual tracking is a challenging problem, especially using a single model. In this paper, we propose a discriminative correlation filter (DCF) based tracking approach that exploits both the long-term and short-term information of the target, named LSTDCF, to improve the tracking performance. In addition to a long-term filter learned through the whole sequence, a short-term filter is trained using only features extracted from most recent frames. The long-term filter tends to capture more semantics of the target as more frames are used for training. However, since the target may undergo large appearance changes, features extracted around the target in non-recent frames prevent the long-term filter from locating the target in the current frame accurately. In contrast, the short-term filter learns more spatial details of the target from recent frames but gets over-fitting easily. Thus the short-term filter is less robust to handle cluttered background and prone to drift. We take the advantage of both filters and fuse their response maps to make the final estimation. We evaluate our approach on a widely-used benchmark with 100 image sequences and achieve state-of-the-art results.