Land use and land cover classification for rural residential areas in China using soft-probability cascading of multifeatures

Abstract. A multifeature soft-probability cascading scheme to solve the problem of land use and land cover (LULC) classification using high-spatial-resolution images to map rural residential areas in China is proposed. The proposed method is used to build midlevel LULC features. Local features are frequently considered as low-level feature descriptors in a midlevel feature learning method. However, spectral and textural features, which are very effective low-level features, are neglected. The acquisition of the dictionary of sparse coding is unsupervised, and this phenomenon reduces the discriminative power of the midlevel feature. Thus, we propose to learn supervised features based on sparse coding, a support vector machine (SVM) classifier, and a conditional random field (CRF) model to utilize the different effective low-level features and improve the discriminability of midlevel feature descriptors. First, three kinds of typical low-level features, namely, dense scale-invariant feature transform, gray-level co-occurrence matrix, and spectral features, are extracted separately. Second, combined with sparse coding and the SVM classifier, the probabilities of the different LULC classes are inferred to build supervised feature descriptors. Finally, the CRF model, which consists of two parts: unary potential and pairwise potential, is employed to construct an LULC classification map. Experimental results show that the proposed classification scheme can achieve impressive performance when the total accuracy reached about 87%.


Introduction
Image classification is crucial in the interpretation of remote sensing images with high spatial resolution (HSR). 1 The availability of HSR remote sensing imagery obtained from satellites (e.g., WorldView-2, IKONOS, QuickBird, ZY-3C, GF-1, and GF-2) increases the possibility of accurate Earth observations. Such HSR imagery provides highly valuable geometric and detailed information, which is important for various applications, such as precision agriculture, security applications, and damage assessment for environmental disasters and land use. 2 In these applications, mapping a high-resolution image for land use and land cover (LULC) is particularly relevant.
Local features [19][20][21][22][23] have been successfully applied to image retrieval, semantic segmentation, and scene understanding. These features gained popularity in the remote sensing community because of their robustness in rotation, scale changes, and occlusion. Sparse coding is one of the most effective approaches to group local features and performs well in object categorization, scene-level land use classification, etc. [24][25][26][27][28][29][30][31][32][33][34][35][36] The sparse coding method combined with max-pooling and spatial pyramid matching (SPM) can be used to learn midlevel features. In this approach, a class type is represented by the distribution of a set of visual words, which are usually obtained by unsupervised K-means clustering of a set of low-level feature descriptors. However, visual words are learned in an unsupervised manner, resulting in less discriminative midlevel features. This characteristic reduces the accuracy of classification. Several conventional low-level features, such as spectral features, are neglected in the building of midlevel features. Some studies have resolved this drawback and effectively incorporated spectral and local features. 33,34 Hu et al. 37 developed a method that combines convolutional neural networks (CNN) and sparse coding to learn discriminative features for scene-level land use classification, and impressive results were obtained when the total accuracy reached about 96%. However, this method is limited by the lack of information of the LULC class type, because the parameters of a CNN model are estimated by the ImageNet dataset. 38 In addition to feature learning, the selection of a classifier is particularly important for LULC classification based on high-resolution remote sensing images. Many classification methods, such as maximum likelihood, MRF, and SVM models, have been developed. The SVM classifier is widely used for various computer vision tasks and LULC classification, because this model has shown advantages on high-dimension feature space. MRF 39 and conditional random field (CRF) 40 are structured output models that consider interactions of random variables. These approaches have been successfully developed in remote sensing [14][15][16][17]41 and computer vision communities. [42][43][44][45][46][47][48][49] Moser et al. 14 proposed an LULC classification for high-resolution remote sensing images based on the MRF model. However, the results of this model always exhibit an oversmoothed appearance. 9,48 Another drawback of the MRF is its difficulty in processing high-dimension feature space. The CRF model overcomes these drawbacks and shows advantages on image classification and semantic segmentation.
Thus, we establish an LULC classification framework for HSR remote sensing images by exploiting labeled data based on midlevel feature learning and the SVM classifier to achieve multifeature soft-probability feature descriptors, and we employ a CRF classification method to jointly model the unary and pairwise costs.
In this paper, a multifeature soft-probability cascading and CRF (MFSC-CRF) classification model is designed to learn discriminative midlevel features in a supervised manner. First, we extracted the spectral, gray-level co-occurrence matrix (GLCM), and dense scale-invariant feature transform (DSIFT) features as low-level feature descriptors. Three types of midlevel feature descriptors are achieved by adopting sparse coding, superpixel segmentation, and max-pooling methods. Then, the probability that some labeled samples belong to LULC classes can be calculated. The three probability values are cascaded to construct the feature descriptors for each superpixel. Finally, the CRF model is introduced to generate the LULC classification.
The supervised learned feature descriptors can be obtained using the SVM classifier with training samples. This classifier has been demonstrated to effectively incorporate low-level features. Using the CRF classifier, the local spatial relationship between the neighboring superpixels is considered by combining the learned feature descriptor. Thus, the proposed method achieves better classification results than traditional methods.
The rest of this paper is structured as follows. In Sec. 2, the proposed method for midlevel feature learning and soft-probability cascading and CRF classification is presented. In Sec. 3, the experiments on the rural residential area dataset of Wuhan are discussed. Conclusions are drawn in Sec. 4.

MFSC-CRF Classification Framework
An HSR remote image classification framework for LULC classification is proposed. This method is based on midlevel feature learning by integrating sparse coding and the CRF method to utilize spectral, structural, and spatial contextual information. Three kinds of typical features, namely, GLCM, DSIFT, and spectral features, are selected to construct the low-level features.
The whole pipeline of the MFSC-CRF classification framework consists of two main steps, namely, feature learning and CRF classification (Fig. 1).
Midlevel feature descriptors are achieved during the feature learning step using the three features by combining sparse coding, SPM, and max-pooling method. The probability can be calculated by the SVM classifier using the training samples. The resulting probability values form the new discriminative feature descriptors.
During CRF classification, the CRF model is introduced to classify the superpixels according to the land cover class types. The probability feature descriptor from the first step is considered in this step, and an SVM classifier is adopted to construct the unary potentials. The pairwise potentials can be acquired by calculating the distance between neighboring superpixels. The graph-cut-based α-expansion algorithm is executed to obtain the classification result of the CRF models.

Midlevel Feature Descriptors
As discussed above, three typical features are adopted for the low-level feature descriptors, and the details are described as follows. The low-level feature descriptors are extracted from images, and each feature descriptor has size T. The visual dictionary D of K visual words obtained by unsupervised K-means clustering algorithm can be defined as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 8 4 D ¼ ½d 1 ; d 2 ; · · · ; d k ∈ R T×K ; (1)  where each d k is represented as a linear classifier with bias and calculated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 7 2 3 d k ¼ ½D k;1 ; D k;2 ; · · · ; D k;T T ∈ R T : (2) An encoding scheme based on the classification score obtained by each dictionary word is used, instead of sparse coding to encode each descriptor. This step is suggested in Ref. 23. If v is a descriptor vector, its coding vector f d k ðα i l Þ corresponding to dictionary D is given as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 6 5 6 Intuitively, the descriptor α should be similar only to a few words in the dictionary if the visual words of dictionary D are sufficiently discriminative. Therefore, the vector f d k ðα i l Þ is expected to have only a few values that are greater than zero. Given a dictionary D and a set of segmented superpixel regions L over an image, we represent the image by spatial max-pooling. For each superpixel region, l ∈ ½1; · · · ; N S of image i, where N S represents the number of superpixels extracted from the image, let α l j be a descriptor vector extracted from region l, where j ∈ ½1; · · · ; N l indexes the N l image pixels extracted from region l. Thus, given a dictionary D, region l can be encoded using max spatial pooling, as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 5 1 3 where x i l ;D represents the midlevel feature descriptor of superpixel l. x D ðiÞ represents the midlevel feature descriptor of image i. If the midlevel features of the pixels in the segmentation region are more similar to some of the visual words, these features can be used to represent the characteristics of the region, and the similarity is measured for the whole region.

Probability Feature Descriptors
Let x D be the midlevel feature vector of an image. This feature represents a vector in a K-dimensional space with a dictionary D. If three different types of features (DSIFT, spectral band, and GLCM) are used in the sparse coding phase, then an image can be represented by three different corresponding vectors. That is, each image i can be represented by the following vectors: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 3 2 4 where D 1 , D 2 , and D 3 are the dictionaries extracted from the DSIFT and spectral features, l represents the superpixels, and K1, K2, and K3 are the dictionary sizes. These two kinds of midlevel features combined with training samples are used to estimate the SVM classifier parameters and calculate the probability of vectors belonging to each LULC class, respectively. The probability vectors of the different midlevel feature descriptors can be represented as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 1 8 5 where KL represents the number of land cover classes. The MFSC feature descriptors for the final classification are given as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 9 4 P ¼ ½P1; P2; P3 ∈ R KL3×N S ; where KL3 represents the size of the feature descriptors, and this value is thrice the number of LULC classes. The size of MFSC feature descriptors is much smaller than the size of midlevel feature descriptors as in Eq. (5).

CRF Classification Model
The CRF model for the final classification of high-resolution remote sensing images is proposed. The CRF is defined over a set of superpixels ν extracted from the image I. Each superpixel i ∈ ν is associated with a class label x ∈ L ¼ f1; · · · ; Lg. The labeling of the image is denoted by the vector x ∈ L jνj . The interaction among various superpixels of the CRF is captured by the set of edges ε ∈ ν × ν, where each edge e ij ∈ ε corresponds to a pair of superpixels i; j ∈ ν that share a boundary. The CRF energy, which consists of unary and pairwise costs, can be formulated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 5 7 9 Eðx; where λ U ≥ 0 and λ p ≥ 0 are the relative weights of the unary and pairwise potentials, respectively. The unary potential, which is expressed as ψ U i ðx i ; IÞ in Eq. (8), models the cost of assigning a class label x i ∈ L to superpixel i in image I. This potential is defined as the score of a kernel SVM classifier for class x i applied to an MFSC feature vector of superpixel i described in Eq. (7). The classifier for class l is trained using the MFSC feature vector extracted from the superpixels in the training set. This vector is labeled as l. The radial basis function (RBF)-χ2 kernel is adopted for SVM classification.
The pairwise potential, ψ P ij ðx i ; x j ; IÞ, models the cost of assigning labels x i and x j to the neighboring superpixels i and j, respectively. When a CRF formulation is used for classification, the pairwise potentials are usually used to ensure the smoothness of the label assignments. A contrast sensitive cost is used as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 3 9 0 ψ P ij ðx i ; x j ; IÞ ¼ where L ij is the length of the shared boundary between superpixels i and j, andĪ i andĪ j are the gray mean values of superpixels i and j, respectively. The parameters in Eq. (8), λ U and λ P , are estimated by the cutting plane method, the details of which are described in Ref. 49. The classification result of the CRF models could be achieved by solving Eq. (8).

Experimental Results
We conduct experiments using the high-resolution aerial images to evaluate the effectiveness of the proposed MFSC-CRF framework for LULC classification.  Table 1. The details are described as follows.
1. SF-SVM: This method uses only the unary segmentation cost. Spectral features are considered low-level features in this technique. After midlevel feature learning, the SVM method is adapted to achieve classification results. This method is very similar to the simultaneous orthogonal matching pursuit method proposed by Chen et al. 51 2. U-SVM: This method is similar to SF-SVM, but they differ in the selection of low-level features. As described in Ref. 26, the DSIFT feature is considered as the low-level feature, and the SVM classifier is used for superpixel level classification. 3. GLCM-SVM: The GLCM feature is considered as the low-level feature in this method, and the SVM classifier is used for superpixel level classification.
4. MFSC-SVM: Multifeature soft-probability is used for the feature vector in this method, and SVM is adopted for LULC classification. 5. SF-CRF: Spectral feature is considered as the low-level feature in this method, which is combined with sparse coding and CRF to achieve the classification results. 6. U-CRF: Sparse coding and the CRF model are used in this technique, and DSIFT is considered as the low-level feature, as described in Ref. 48. 7. GLCM-CRF: GLCM is considered as the low-level feature descriptor in this model, in which CRF is adopted for classification. 8. MFSC-CRF: Probabilities are considered as feature descriptors in this proposed method, in which CRF is adopted for supervised classification.
The experimental results are evaluated using three kinds of accuracies, namely, the accuracy of each class, overall accuracy (OA), and kappa coefficient (Kappa). OA is the fraction of correctly classified pixels, based on all pixels of that ground-truth class. For a fair comparison, the classification results with the highest OA are selected for all classification algorithms. The effect of the number of training samples is further investigated in relation to the MFSC-CRF model.

Experimental datasets (testing site 1)
The first test image is captured over the rural residential area in Wuhan city, Hubei Province, China, through unmanned aerial vehicle aerial photography, including red, green, and blue three spectral bands. The image is of 1024 × 1200 pixels, with spatial resolution of 0.2 m and three multispectral channels. An overview of this dataset is shown in Fig. 2(a). The corresponding ground truth is shown in Fig. 2(b). The testing image was segmented to 52,654 superpixels  using the simple linear iterative clustering method. Six classes of interest, namely, low vegetation, homestead, farmland, waterbody, road, and woodland, are considered and listed in Table 2.
Rural homestead is the main type of rural residential land and is more scattered. This class contains various houses, walls, and other facilities with spatial correlation and semantic structure characteristics. The other five class types are mainly land cover types. A total of 100 training samples for each LULC class type is used from the reference ground-truth data, and the remaining samples are used to evaluate the accuracy. The results are shown in Table 2.

Experimental datasets (testing site 2)
This testing image is also captured over the rural residential area in Wuhan city, Hubei Province, China. The image is of 1113 × 1777 pixels, with spatial resolution of 0.2 m and three multispectral channels. Compared with testing site 1, testing site 2 is larger and has a more complex scene. More trees are around the homesteads in this rural residential area, and the shadow effect is more obvious. This image is a challenging task for LULC classification. The ground-truth image corresponding to the high resolution image (HRI) has been classified manually into the six most common LULC classes. The classification data (label images) are shown in Fig. 3(b). The testing image was segmented to 92,441 superpixels. Similar to testing site 1, six classes of interest are considered and described in Table 3, which also shows the number of the training and testing samples for each class. The training samples are randomly chosen from the reference ground-truth data and are shown in Table 3. The dictionary size is set to 500, and 20,000 pixels are randomly selected for the training dictionary via the K-means clustering method. A total of 500 training samples per LULC class is randomly selected for classifier parameters (Table 3).

Experimental Results and Analysis for Testing Site 1
The experimental results for testing site 1 are reported to validate the effectiveness of the proposed MFSC-CRF for LULC classification. The classification accuracies of the various midlevel   For the MFSC-CRF algorithm, which is proposed to combine different effective features, the oversmoothing is less serious in Fig. 4(e), as is shown in the red boxes of Figs. 4(e) and 4(h). Moreover, the boundaries of homestead are better preserved. By contrast, SF-SVM is more focused on the spectral information. Thus, the classification remarkably depends less on the structural information, which probably explains the misclassification of U-CRF.
The quantitative performances with the highest classification accuracies obtained by SF-SVM, U-SVM, GLCM-SVM, MFSC-SVM, SF-CRF, U-CRF, GLCM-CRF, and MFSC-CRF are reported in Table 4. The best result of each column are in bold. The results show that   the algorithms in which spatial contextual information are considered significantly outperformed the SVM classification in classification accuracy. Moreover, the accuracy of MFSC-CRF is higher than the three other CRF-based classification methods (i.e., SF-CRF, U-CRF, and GLCM-CRF), indicating that the MFSC-CRF can adaptively incorporate different low-level feature descriptors. With GLCM as the low-level feature descriptor, the GLCM-CRF method achieves much higher accuracy than the SF-SVM, SF-CRF, U-SVM, and U-CRF. This result shows that GLCM can be very effective for LULC classification. In the dataset of the testing site 1 of Wuhan rural residential area (Table 4), the reported quantitative performance of MFSC-CRF exhibits the improvement in OA. Additionally, the 21% higher accuracy (from 64.9% to 86.3%) of MFSC-CRF compared with U-SVM shows that MFSC-CRF focuses more on spatial contextual information. Thus, spatial contextual information and other effective feature descriptors should be considered. Finally, the MFSC-CRF obtains the highest accuracy. Figure 5 shows the confusion matrices of different classification methods with various feature descriptors and classifiers. The methods, which used only spectral features as low-level feature descriptors (SF-SVM and SF-CRF), misclassified homestead to road with 14%. The reason is that the two LULC types have similar spectral characteristics, and all belong to the impermeable surface. The GLCM-(GLCM-SVM and GLCM-CRF) and MFSC-based methods (MFSC-SVM and MFSC-CRF) are less serious than the SF-based methods. The MFSC-CRF method incorporates different low-level feature descriptors and results in 89% accuracy for homestead.

Experimental Results and Analysis for Testing Site 2
The resulting maps for the visual classification for this testing image are shown in Figs. 6(a)-6(h). The quantitative classification results of the different classification methods are shown in Table 5 (The best result of each column is in bold) and Figs. 7(a)-7(h). The proposed MFSC-CRF method achieves the highest OA and Kappa than SF-SVM, U-SVM, GLCM-SVM, MFSC-SVM, SF-CRF, U-CRF, and GLCM-CRF. Compared with SF-SVM and U-SVM, the MFSC-SVM method achieves remarkably enhanced OA and homestead accuracy. Compared with GLCM-SVM, the classification accuracy of the MFSC-SVM method shows ∼3% improvement for each LULC class. Considering neighborhood spatial contextual information, the quantitative performance of MFSC-CRF shows 0.1% accuracy improvement (from 87.4% to 87.5%) compared with MFSC-SVM method.

Parameter Sensitivity Analysis
The performance of the proposed MFSC-CRF method is further evaluated using different numbers of training samples. Testing image 1 is selected for parameter sensitivity analysis, and  the effects of training sample numbers on the MFSC-CRF algorithms are examined. Different sizes ranging from 100 to 1000 are tested with an interval of 100 for each LULC class.
As shown in Fig. 8, the classification accuracy of MFSC-CRF initially increases for the datasets with gradual increase in the number of training samples per class (from 85.6% to 93.2%). The classification accuracy of MFSC-CRF is slightly higher than GLCM-CRF (from 84.0% to 92.0%) and MFSC-SVM (from 85.0% to 92.8%) classification approaches with Wuhan rural residential area dataset of testing site 1. The accuracy then remains roughly constant when the training sample number is set to 900 but slightly decreases. Moreover, the classification accuracy of the proposed method remains higher than the other seven methods at each training number. The training samples are randomly selected from the overall ground truth, and the remaining samples are used to evaluate the classification accuracies. The experiments show that the classification accuracies of the methods incorporating spatial contextual information (i.e., SF-CRF, U-CRF, GLCM-CRF, and the proposed MFSC-CRF) are all better than SVMbased classification methods. Moreover, the MFSC-CRF method is more robust than the other classification methods with different training samples.

Conclusion
A classification method for HSR remote sensing images based on MFSC and CRF models is proposed. The proposed MFSC-CRF method can effectively incorporate spectral, structural, and textural features, as well as spatial contextual information. Midlevel feature learning based on sparse coding is very important in image classification, and the proposed feature combination method can significantly improve the classification accuracy by effectively combining three complementary features, namely, DSIFT, spectral bands, and GLCM. Experiments on the Wuhan residential area datasets also show that the GLCM features can achieve more promising results than the original spectral features. This method is an open model, very convenient to cascade different features to improve the accuracy of image classification. Recently, the convolution neural network is widely used in image classification and achieved good results. However, the convolution neural network model requires a large number of training samples to train the parameters. Therefore, our next step is to use a small amount of training samples to fine-tune the convolution neural network model so that it can be effectively applied to remote sensing image classification applications.