Image classification is crucial in the interpretation of remote sensing images with high spatial resolution (HSR).1 The availability of HSR remote sensing imagery obtained from satellites (e.g., WorldView-2, IKONOS, QuickBird, ZY-3C, GF-1, and GF-2) increases the possibility of accurate Earth observations. Such HSR imagery provides highly valuable geometric and detailed information, which is important for various applications, such as precision agriculture, security applications, and damage assessment for environmental disasters and land use.2 In these applications, mapping a high-resolution image for land use and land cover (LULC) is particularly relevant.
In terms of LULC classification using remote sensing images, Landsat series satellite imagery with medium resolution is important in regional LULC and land use/cover change studies.34.5.6.–7 In processing high-resolution remote sensing images, numerous classification algorithms, such as the object-oriented approach,89.–10 based on the classification of a support vector machine (SVM)1112.–13 and Markov random fields (MRF)1415.16.17.–18 are being developed.
Local features1920.21.22.–23 have been successfully applied to image retrieval, semantic segmentation, and scene understanding. These features gained popularity in the remote sensing community because of their robustness in rotation, scale changes, and occlusion. Sparse coding is one of the most effective approaches to group local features and performs well in object categorization, scene-level land use classification, etc.24184.108.40.206.220.127.116.11.33.34.35.–36 The sparse coding method combined with max-pooling and spatial pyramid matching (SPM) can be used to learn midlevel features. In this approach, a class type is represented by the distribution of a set of visual words, which are usually obtained by unsupervised -means clustering of a set of low-level feature descriptors. However, visual words are learned in an unsupervised manner, resulting in less discriminative midlevel features. This characteristic reduces the accuracy of classification. Several conventional low-level features, such as spectral features, are neglected in the building of midlevel features. Some studies have resolved this drawback and effectively incorporated spectral and local features.33,34 Hu et al.37 developed a method that combines convolutional neural networks (CNN) and sparse coding to learn discriminative features for scene-level land use classification, and impressive results were obtained when the total accuracy reached about 96%. However, this method is limited by the lack of information of the LULC class type, because the parameters of a CNN model are estimated by the ImageNet dataset.38
In addition to feature learning, the selection of a classifier is particularly important for LULC classification based on high-resolution remote sensing images. Many classification methods, such as maximum likelihood, MRF, and SVM models, have been developed. The SVM classifier is widely used for various computer vision tasks and LULC classification, because this model has shown advantages on high-dimension feature space. MRF39 and conditional random field (CRF)40 are structured output models that consider interactions of random variables. These approaches have been successfully developed in remote sensing1415.16.–17,41 and computer vision communities.4243.44.45.46.47.48.–49 Moser et al.14 proposed an LULC classification for high-resolution remote sensing images based on the MRF model. However, the results of this model always exhibit an oversmoothed appearance.9,48 Another drawback of the MRF is its difficulty in processing high-dimension feature space. The CRF model overcomes these drawbacks and shows advantages on image classification and semantic segmentation.
Thus, we establish an LULC classification framework for HSR remote sensing images by exploiting labeled data based on midlevel feature learning and the SVM classifier to achieve multifeature soft-probability feature descriptors, and we employ a CRF classification method to jointly model the unary and pairwise costs.
In this paper, a multifeature soft-probability cascading and CRF (MFSC-CRF) classification model is designed to learn discriminative midlevel features in a supervised manner. First, we extracted the spectral, gray-level co-occurrence matrix (GLCM), and dense scale-invariant feature transform (DSIFT) features as low-level feature descriptors. Three types of midlevel feature descriptors are achieved by adopting sparse coding, superpixel segmentation, and max-pooling methods. Then, the probability that some labeled samples belong to LULC classes can be calculated. The three probability values are cascaded to construct the feature descriptors for each superpixel. Finally, the CRF model is introduced to generate the LULC classification.
The supervised learned feature descriptors can be obtained using the SVM classifier with training samples. This classifier has been demonstrated to effectively incorporate low-level features. Using the CRF classifier, the local spatial relationship between the neighboring superpixels is considered by combining the learned feature descriptor. Thus, the proposed method achieves better classification results than traditional methods.
The rest of this paper is structured as follows. In Sec. 2, the proposed method for midlevel feature learning and soft-probability cascading and CRF classification is presented. In Sec. 3, the experiments on the rural residential area dataset of Wuhan are discussed. Conclusions are drawn in Sec. 4.
MFSC-CRF Classification Framework
An HSR remote image classification framework for LULC classification is proposed. This method is based on midlevel feature learning by integrating sparse coding and the CRF method to utilize spectral, structural, and spatial contextual information. Three kinds of typical features, namely, GLCM, DSIFT, and spectral features, are selected to construct the low-level features. The whole pipeline of the MFSC-CRF classification framework consists of two main steps, namely, feature learning and CRF classification (Fig. 1).
Midlevel feature descriptors are achieved during the feature learning step using the three features by combining sparse coding, SPM, and max-pooling method. The probability can be calculated by the SVM classifier using the training samples. The resulting probability values form the new discriminative feature descriptors.
During CRF classification, the CRF model is introduced to classify the superpixels according to the land cover class types. The probability feature descriptor from the first step is considered in this step, and an SVM classifier is adopted to construct the unary potentials. The pairwise potentials can be acquired by calculating the distance between neighboring superpixels. The graph-cut-based -expansion algorithm is executed to obtain the classification result of the CRF models.
Midlevel Feature Descriptors
As discussed above, three typical features are adopted for the low-level feature descriptors, and the details are described as follows.
1. Spectral features: Features on the Earth reflect, absorb, transmit, and emit electromagnetic energy from the sun. A measurement of energy commonly used in remote sensing of the Earth is reflected energy (e.g., visible light, near-infrared, etc.) coming from land and water surfaces. The amount of energy reflected from these surfaces is usually expressed as a percentage of the amount of energy striking the objects. The band values of remote sensing images are used as the spectral features in this article.
2. GLCM: GLCM is a texture measurement to many image analyses. In this article, GLCM is extracted by ENVI software. Eight features are achieved, which are called as mean, variance, homogeneity, contrast, dissimilarity, entropy, etc. They are normalized to form feature vectors.
3. DSIFT: DSIFT descriptors are computed at points on a regular grid. At each grid point, the descriptors are computed over four circular support patches with different radii, and, consequently, each point is represented by four SIFT descriptors. Multiple descriptors are computed to allow for scale variation between images.50
The low-level feature descriptors are extracted from images, and each feature descriptor has size . The visual dictionary of visual words obtained by unsupervised -means clustering algorithm can be defined as follows:23. If is a descriptor vector, its coding vector corresponding to dictionary is given as follows:
Given a dictionary and a set of segmented superpixel regions over an image, we represent the image by spatial max-pooling. For each superpixel region, of image , where represents the number of superpixels extracted from the image, let be a descriptor vector extracted from region , where indexes the image pixels extracted from region . Thus, given a dictionary , region can be encoded using max spatial pooling, as follows:
Probability Feature Descriptors
Let be the midlevel feature vector of an image. This feature represents a vector in a -dimensional space with a dictionary . If three different types of features (DSIFT, spectral band, and GLCM) are used in the sparse coding phase, then an image can be represented by three different corresponding vectors. That is, each image can be represented by the following vectors:
The probability vectors of the different midlevel feature descriptors can be represented as follows:
CRF Classification Model
The CRF model for the final classification of high-resolution remote sensing images is proposed. The CRF is defined over a set of superpixels extracted from the image . Each superpixel is associated with a class label . The labeling of the image is denoted by the vector . The interaction among various superpixels of the CRF is captured by the set of edges , where each edge corresponds to a pair of superpixels that share a boundary.
The CRF energy, which consists of unary and pairwise costs, can be formulated as follows:
The unary potential, which is expressed as in Eq. (8), models the cost of assigning a class label to superpixel in image . This potential is defined as the score of a kernel SVM classifier for class applied to an MFSC feature vector of superpixel described in Eq. (7). The classifier for class is trained using the MFSC feature vector extracted from the superpixels in the training set. This vector is labeled as . The radial basis function (RBF)- kernel is adopted for SVM classification.
The pairwise potential, , models the cost of assigning labels and to the neighboring superpixels and , respectively. When a CRF formulation is used for classification, the pairwise potentials are usually used to ensure the smoothness of the label assignments. A contrast sensitive cost is used as follows:49. The classification result of the CRF models could be achieved by solving Eq. (8).
We conduct experiments using the high-resolution aerial images to evaluate the effectiveness of the proposed MFSC-CRF framework for LULC classification. Based on the study of Jain et al.’s49 work, comparative experiments are conducted by combining feature descriptors and classification methods. We compared the different methods using single-object class accuracy and total accuracy. The low-level feature, midlevel feature, and classifier associated with SF-SVM, U-SVM, GLCM-SVM, MFSC-SVM, SF-CRF, U-CRF, GLCM-CRF, and MFSC-CRF are reported in Table 1. The details are described as follows.
1. SF-SVM: This method uses only the unary segmentation cost. Spectral features are considered low-level features in this technique. After midlevel feature learning, the SVM method is adapted to achieve classification results. This method is very similar to the simultaneous orthogonal matching pursuit method proposed by Chen et al.51
2. U-SVM: This method is similar to SF-SVM, but they differ in the selection of low-level features. As described in Ref. 26, the DSIFT feature is considered as the low-level feature, and the SVM classifier is used for superpixel level classification.
3. GLCM-SVM: The GLCM feature is considered as the low-level feature in this method, and the SVM classifier is used for superpixel level classification.
4. MFSC-SVM: Multifeature soft-probability is used for the feature vector in this method, and SVM is adopted for LULC classification.
5. SF-CRF: Spectral feature is considered as the low-level feature in this method, which is combined with sparse coding and CRF to achieve the classification results.
6. U-CRF: Sparse coding and the CRF model are used in this technique, and DSIFT is considered as the low-level feature, as described in Ref. 48.
7. GLCM-CRF: GLCM is considered as the low-level feature descriptor in this model, in which CRF is adopted for classification.
8. MFSC-CRF: Probabilities are considered as feature descriptors in this proposed method, in which CRF is adopted for supervised classification.
Information of different classification methods.
|Method||Low-level feature||Midlevel feature||Classifier|
|SF-SVM||Spectral features||Sparse coding and max-pooling [Eq. (4)]||SVM|
|U-SVM||DSIFT||Sparse coding and max-pooling [Eq. (4)]||SVM|
|GLCM-SVM||GLCM||Sparse coding and max-pooling [Eq. (4)]||SVM|
|MFSC-SVM||Spectral features, DSIFT, and GLCM||MFSC [Eq. (7)]||SVM|
|SF-CRF||Spectral features||Sparse coding and max-pooling [Eq. (4)]||CRF|
|U-CRF||DSIFT||Sparse coding and max-pooling [Eq. (4)]||CRF|
|GLCM-CRF||GLCM||Sparse coding and max-pooling [Eq. (4)]||CRF|
|MFSC-CRF||Spectral features, DSIFT, and GLCM||MFSC [Eq. (7)]||CRF|
The experimental results are evaluated using three kinds of accuracies, namely, the accuracy of each class, overall accuracy (OA), and kappa coefficient (Kappa). OA is the fraction of correctly classified pixels, based on all pixels of that ground-truth class. For a fair comparison, the classification results with the highest OA are selected for all classification algorithms. The effect of the number of training samples is further investigated in relation to the MFSC-CRF model.
Experimental Data Description
Experimental datasets (testing site 1)
The first test image is captured over the rural residential area in Wuhan city, Hubei Province, China, through unmanned aerial vehicle aerial photography, including red, green, and blue three spectral bands. The image is of , with spatial resolution of 0.2 m and three multispectral channels. An overview of this dataset is shown in Fig. 2(a). The corresponding ground truth is shown in Fig. 2(b). The testing image was segmented to 52,654 superpixels using the simple linear iterative clustering method. Six classes of interest, namely, low vegetation, homestead, farmland, waterbody, road, and woodland, are considered and listed in Table 2. Rural homestead is the main type of rural residential land and is more scattered. This class contains various houses, walls, and other facilities with spatial correlation and semantic structure characteristics. The other five class types are mainly land cover types. A total of 100 training samples for each LULC class type is used from the reference ground-truth data, and the remaining samples are used to evaluate the accuracy. The results are shown in Table 2.
Class information of Wuhan rural residential area dataset of testing site 1.
|Class name||Training samples||Testing samples|
Experimental datasets (testing site 2)
This testing image is also captured over the rural residential area in Wuhan city, Hubei Province, China. The image is of , with spatial resolution of 0.2 m and three multispectral channels. Compared with testing site 1, testing site 2 is larger and has a more complex scene. More trees are around the homesteads in this rural residential area, and the shadow effect is more obvious. This image is a challenging task for LULC classification. The ground-truth image corresponding to the high resolution image (HRI) has been classified manually into the six most common LULC classes. The classification data (label images) are shown in Fig. 3(b). The testing image was segmented to 92,441 superpixels. Similar to testing site 1, six classes of interest are considered and described in Table 3, which also shows the number of the training and testing samples for each class. The training samples are randomly chosen from the reference ground-truth data and are shown in Table 3. The dictionary size is set to 500, and 20,000 pixels are randomly selected for the training dictionary via the -means clustering method. A total of 500 training samples per LULC class is randomly selected for classifier parameters (Table 3).
Class information of Wuhan rural residential area dataset of testing site 2.
|Class name||Training samples||Testing samples|
Experimental Results and Analysis for Testing Site 1
The experimental results for testing site 1 are reported to validate the effectiveness of the proposed MFSC-CRF for LULC classification. The classification accuracies of the various midlevel feature learning methods, namely, SF-SVM, GLCM-SVM, U-SVM, MFSC-SVM, GLCM-CRF, and U-CRF, which are different combinations of low-level feature descriptors and classifier, are compared. The SVM classifier with RBF kernel has been proven to be successful in supervised classification of high-dimensional HRI data. Among the SVM-based methods, MFSC-SVM achieves better classification results than the other three methods [Figs. 4(c)–4(f)]. However, the SVM algorithm, in which any neighborhood spatial contextual information is not considered, results in high isolated salt-and-pepper classification noise, because neighborhood interactions are not considered in the algorithms.
For the MFSC-CRF algorithm, which is proposed to combine different effective features, the oversmoothing is less serious in Fig. 4(e), as is shown in the red boxes of Figs. 4(e) and 4(h). Moreover, the boundaries of homestead are better preserved. By contrast, SF-SVM is more focused on the spectral information. Thus, the classification remarkably depends less on the structural information, which probably explains the misclassification of U-CRF.
The quantitative performances with the highest classification accuracies obtained by SF-SVM, U-SVM, GLCM-SVM, MFSC-SVM, SF-CRF, U-CRF, GLCM-CRF, and MFSC-CRF are reported in Table 4. The best result of each column are in bold. The results show that the algorithms in which spatial contextual information are considered significantly outperformed the SVM classification in classification accuracy. Moreover, the accuracy of MFSC-CRF is higher than the three other CRF-based classification methods (i.e., SF-CRF, U-CRF, and GLCM-CRF), indicating that the MFSC-CRF can adaptively incorporate different low-level feature descriptors. With GLCM as the low-level feature descriptor, the GLCM-CRF method achieves much higher accuracy than the SF-SVM, SF-CRF, U-SVM, and U-CRF. This result shows that GLCM can be very effective for LULC classification. In the dataset of the testing site 1 of Wuhan rural residential area (Table 4), the reported quantitative performance of MFSC-CRF exhibits the improvement in OA. Additionally, the 21% higher accuracy (from 64.9% to 86.3%) of MFSC-CRF compared with U-SVM shows that MFSC-CRF focuses more on spatial contextual information. Thus, spatial contextual information and other effective feature descriptors should be considered. Finally, the MFSC-CRF obtains the highest accuracy.
Classification accuracy for Wuhan rural residential area using dataset of testing site 1 with different classifiers.
|Methods||Accuracy (%)||OA (%)||Kappa|
Figure 5 shows the confusion matrices of different classification methods with various feature descriptors and classifiers. The methods, which used only spectral features as low-level feature descriptors (SF-SVM and SF-CRF), misclassified homestead to road with 14%. The reason is that the two LULC types have similar spectral characteristics, and all belong to the impermeable surface. The GLCM- (GLCM-SVM and GLCM-CRF) and MFSC-based methods (MFSC-SVM and MFSC-CRF) are less serious than the SF-based methods. The MFSC-CRF method incorporates different low-level feature descriptors and results in 89% accuracy for homestead.
Experimental Results and Analysis for Testing Site 2
The resulting maps for the visual classification for this testing image are shown in Figs. 6(a)–6(h). The quantitative classification results of the different classification methods are shown in Table 5 (The best result of each column is in bold) and Figs. 7(a)–7(h). The proposed MFSC-CRF method achieves the highest OA and Kappa than SF-SVM, U-SVM, GLCM-SVM, MFSC-SVM, SF-CRF, U-CRF, and GLCM-CRF. Compared with SF-SVM and U-SVM, the MFSC-SVM method achieves remarkably enhanced OA and homestead accuracy. Compared with GLCM-SVM, the classification accuracy of the MFSC-SVM method shows improvement for each LULC class. Considering neighborhood spatial contextual information, the quantitative performance of MFSC-CRF shows 0.1% accuracy improvement (from 87.4% to 87.5%) compared with MFSC-SVM method.
Classification accuracy for Wuhan rural residential area dataset of testing site 2 with different classifiers.
|Methods||Accuracy (%)||OA (%)||Kappa|
Parameter Sensitivity Analysis
The performance of the proposed MFSC-CRF method is further evaluated using different numbers of training samples. Testing image 1 is selected for parameter sensitivity analysis, and the effects of training sample numbers on the MFSC-CRF algorithms are examined. Different sizes ranging from 100 to 1000 are tested with an interval of 100 for each LULC class.
As shown in Fig. 8, the classification accuracy of MFSC-CRF initially increases for the datasets with gradual increase in the number of training samples per class (from 85.6% to 93.2%). The classification accuracy of MFSC-CRF is slightly higher than GLCM-CRF (from 84.0% to 92.0%) and MFSC-SVM (from 85.0% to 92.8%) classification approaches with Wuhan rural residential area dataset of testing site 1. The accuracy then remains roughly constant when the training sample number is set to 900 but slightly decreases. Moreover, the classification accuracy of the proposed method remains higher than the other seven methods at each training number. The training samples are randomly selected from the overall ground truth, and the remaining samples are used to evaluate the classification accuracies. The experiments show that the classification accuracies of the methods incorporating spatial contextual information (i.e., SF-CRF, U-CRF, GLCM-CRF, and the proposed MFSC-CRF) are all better than SVM-based classification methods. Moreover, the MFSC-CRF method is more robust than the other classification methods with different training samples.
A classification method for HSR remote sensing images based on MFSC and CRF models is proposed. The proposed MFSC-CRF method can effectively incorporate spectral, structural, and textural features, as well as spatial contextual information. Midlevel feature learning based on sparse coding is very important in image classification, and the proposed feature combination method can significantly improve the classification accuracy by effectively combining three complementary features, namely, DSIFT, spectral bands, and GLCM. Experiments on the Wuhan residential area datasets also show that the GLCM features can achieve more promising results than the original spectral features. This method is an open model, very convenient to cascade different features to improve the accuracy of image classification. Recently, the convolution neural network is widely used in image classification and achieved good results. However, the convolution neural network model requires a large number of training samples to train the parameters. Therefore, our next step is to use a small amount of training samples to fine-tune the convolution neural network model so that it can be effectively applied to remote sensing image classification applications.
Bin Zhang received his BS, MS, and PhD degrees from the School of Electronic Information, Wuhan University, in 2007, 2009, and 2013, respectively. He is currently working at China University of Geosciences. His research interests include image classification, scene-level land use classification, and deep learning.