High discriminative feature representation is key in remote sensing scene classification. Existing mid-level feature methods for solving the classification show poor performance. The reason includes two aspects. First, the discrimination power of the feature generated by the feature coding method is limited. Second, semantic information hidden in the scene images are not utilized. These essentially prevent them from achieving better performance. To solve these issues, we propose a hierarchical feature coding model with two stacked feature encoding layers. Specifically, in the first coding layer, semantic information from convolutional layers of deep models and complementary structure and spectral features are extracted and encoded into bag of visual word (BOVW) histogram features. Then in the second layer, Dirichlet-based Gaussians mixture model Fisher kernel is adopted to transform the BOVW histogram features to the more discriminative and effective feature vectors. Thus, through feeding the output of the first layer into the second layer, the complex feature representation is refined. Finally, the concatenated feature vectors are put into support vector machine classifier for classification. Experiments on two public high-resolution remote sensing scene datasets demonstrate that the performance of our hierarchical coding method is comparable to the previous state-of-the-art methods, including most multifeature fusion methods and convolutional neural network-based methods.