In this work, we propose a novel encoding-decoding based image captioning framework, which improves the performance by jointly exploring the visual object-context features, generic and specific semantic priors. In the encoding of RNN, we extract the semantic attributes, object-related and scene-related image features first, and then feed them sequentially to the encoder of RNN, which considers the rich general semantic and visual object-context representation of images. To incorporate the testing specific semantic priors in the decoding of RNN, we apply cross-modal retrieval to find the most similar captions of the testing image in the visual-semantic embedding space of VSE++. The BLEU-4 similarity is utilized to evaluate the similarity between the generated sentence and the retrieved captions, which incorporates the sentence-making priors to the testing-specific reference captions. The evaluation on benchmark dataset Microsoft COCO shows the superiority of our algorithm over the state-of-the-art approaches on standard evaluation metrics.