Urban built-up areas extraction by the multiscale stacked denoising autoencoder technique

Abstract. Stacked denoising autoencoder (SDAE) model has a strong feature learning ability and has shown great success in the classification of remote sensing images. However, built-up area (BUA) information is easily interfered with by broken rocks, bare land, and other features with similar spectral features. SDAEs are vulnerable to broken and similar features in the image. We propose a multiscale SDAE model to overcome these problems, which can extract BUA features in different scales and recognize the type of land object from multiple scales. The model effectively improves the recognition rate of BUA. The experimental results show that our algorithm can resist the disturbance information, and the classification accuracies are better than support vector machine, backpropagation, random forests, and SDAE. Then we investigate an application in Wuhan (China) metropolitan area analysis with the classification results of our algorithm. The range of the metropolitan area is 1.5-h isochronous circle calculated by Tencent map big data and is divided into three layers: core metropolitan area, subcore metropolitan area, and daily metropolitan. Finally, from the comprehensive statistical data and traffic data, we know that the Wuhan metropolitan area has a “target-shaped” distribution structure radiating outward from the core metropolitan area. It includes five metropolitan development corridors: Wuhan–Huanggang, Wuhan–Xiaogan–Suizhou, Wuhan–Ezhou–Huangshi, Wuhan–Xiantao–Tianmen, and Wuhan–Xianan–Chibi. The corridor is of great significance to the development of metropolitan areas.


Introduction
With the representative features of influence and radiation, the metropolitan area plays an increasingly important role in the field of city clusters. 1 The metropolitan area has the characteristic of urban agglomeration to some extent. With the development of rapid traffic and the expansion of commuter circle, the range of the metropolitan area is relatively larger. 2 Domestic and foreign research on the metropolitan area mainly focus on commuting distance and spatial distribution of built-up area (BUA). 3,4 Distance factors are mostly used to define the range of the metropolitan area, but there are no strict standards for travel time and spatial distance. [5][6][7][8] Zakaria 9 took Philadelphia as an example, studied the relationship between regional public transport or car accessibility and land use growth rate, and found that the growth rate of different areas in the urban area was inconsistent due to different accessibility. Liu et al. 10 consulted the transportation planning of highway and waterway (2002 to 2020) and skeleton highway network planning (2002 to 2020) of Hubei Province proposed the time-distance accessibility model, and established 1-h high-accessibility circle (0 to 75 km), 2-h medium accessibility circle (75 to 150 km), and 3-h low-accessibility circle. With the development of big data analysis and collection methods, mobile phone signaling data and traffic data have been widely applied in commuting analysis. It realizes the cognition of residents' travel needs and travel characteristics from the city microscale.
BUA is a relatively concentrated area that contains buildings, public facilities, and urban roads within the urban administrative area. 11 We define BUA for which the surface is predominantly impervious, including all nonvegetative, nonwater, nonsoil, human-constructed elements (e.g., roads and buildings). BUA is an objective reflection of urban construction and development in regional distribution and indicates the scale and size of construction land in different periods of urban development. 12 Spatial information of BUA is fundamental to a better understanding of the development direction of the metropolitan area, the inter-relationship between cities, and the radiation effect of core cities on surrounding cities. 13 It is of great significance for guiding and evaluating the development of the metropolitan area. 14 Many BUA extraction methods using remote sensing images have been proposed. 15,16 They can be mainly divided into three categories, including pixel-based, object-based, and deep learning-based methods. The pixel-based method is a popular method to classify remote sensing images, given its simplicity and high efficiency. However, the classification results display a "salt and pepper" effect. To overcome this problem, the object-based method has become a mainstream method in land-use/land-cover application recently. However, the segmentation scale is a key problem to the object-based method in the face of different data and application scenarios. Deep learning method is proposed to improve these performances with its perfect fitting ability. Its neural network has strong expression property to imitate various complex models, and thus it can be widely applied to the land use/land cover.
Stacked denoising autoencoder (SDAE) is a typical deep learning method and works in much the same way as stacking restricted Boltzmann machine in deep belief networks or ordinary autoencoders. 17,18 It learns to recover the corrupted data with the help of an unsupervised pretraining procedure that initializes the neural network. Then it seeks to be trained over the entire neural network using supervised learning to recognize the moving target. 19,20,21 Zhang et al. 22 extracted the spectral, spatial, and texture features for each object and put all features into stacked autoencoder or SDAE network, and then got the parameters of the network. The classification result is better than that of "linear" support vector machine (SVM) model and radial basis function (RBF) SVM model. Han et al. 23 used SDAE to predict human eye fixations in two steps. He used center patch and its surrounding patches to represent the features, developed model to learn feature from raw image data under an unsupervised manner, and then captured the intrinsic mutual patterns as the feature contrast and integrated them for final saliency prediction. Li 24 used SDAE and Softmax model to solve the problems of automatic feature extraction and dimension reduction in Braille recognition. The SDAE performs better than the traditional feature extraction algorithms and Softmax has a better performance than multilayer perceptron and RBF when they perform with SDAE. Zhang 25 constructed the detection vector of the center pixel based on the center pixel and its neighboring pixels and used the SDAE model to classify the land cover based on GF-1 image. The classification result is better than that of traditional SVM and backpropagation (BP) network. SDAE has been widely applied to feature learning in many other fields, such as denoising and target recognition. [26][27][28][29][30][31]32 The spatial distribution of BUA concentrates on distribution, consistent types, and structures. It is easy to be disturbed in the process of classification by broken rocks, bare land, and other features with similar spectral features. SDAE model has a strong feature learning ability, but it lacks spatial scale features. When encountering the phenomenon of "different object with the same spectra feature and same object with the different spectra feature in remote sensing image," the classification ability is limited.
In this paper, we develop a new BUA extraction method based on SDAE, and the BUA results extracted by this method are applied to the analysis of metropolitan area. First, the recognition of the same land object will have different results in different scales. It manifests that the spatial pattern of land objects is significantly different at different scales. 33 We will analyze the spatial scale features of BUA, generate multiscale hierarchical structure features, and integrate the learning ability of SDAE model. Then we propose a multiscale stacked denoising autoencoder (MSDAE) model to learn the features and extract BUA from multiple scales. It improves the classification ability of BUA. Second, taking Wuhan city for example, we divide the commuting isochronous circles into 0.5-h isochronous circle (0.5 h), 1-h isochronous circle (1 h), 1.5-h isochronous circle (1.5 h) base on Tencent map big data. We comprehensively analyze the clustering degree of BUA, population density, urban traffic, and corridor in this area.
The organization of paper is as follows: Sec. 2 gives an introduction of region and data; Sec. 3 introduces the SDAE and describes the proposed method in detail; Sec. 4 presents the extraction result; Sec. 5 presents precision evaluation; following that, based on the result of BUA extracted using MSDAE method, metropolitan area analysis are given in Sec. 6. Finally, the conclusion is drawn in Sec. 7.
2 Study Area and Data

Study Area
Wuhan is located in the east of Jianghan plain and in the middle of the Yangtze river, 113°41′E-115°05′ E, 29°58 ′N-31°22′N. The terrain in north is higher than in south, and it is mostly flat in middle. The average elevation of this area is 23.3 m with the same wave rolling hills and plains geomorphology. Wuhan is the capital of Hubei Province, the only subprovincial city and megacity in the six central provinces, the central city of central China, and an important industrial base, scientific and educational base, and comprehensive transportation hub in China.
Taking a central city as the center, the regional accessibility can well explain the radiation capacity and the degree of connection of the central city to the surrounding areas in different directions. 34 Based on population heat map and real traffic flow from Tencent map big data, we get the center of the city, calculate the distance from the center to the furthest point; it, respectively, takes 0.5, 1, and 1.5 h (considering the complexity of road condition during the day, we use the night traffic flow condition of travel time). By overlaying the administrative zoning map with the furthest distance, we get the isochronous circles based on administrative zoning map (see Fig. 1). In this paper, we define 1.5 h as the boundary of the metropolitan area. The metropolitan area covers 42 districts and counties, involving 10 cities. Among them, 0.5 h covers 11 districts and counties, 1 h covers 12 districts and counties, and 1.5 h covers 19 districts and counties, as shown in Table 1.

Data and Preprocess
The metropolitan area of Wuhan is taken as the research area; all available GF-1 WFV images data with cloud cover less than 10% are chosen for inclusion in this study and acquired in April 2018 with a spatial resolution of 16 m are selected. The preprocess includes geometric correction and mosaic and projection transformation. The result of processing is shown in Fig. 2.

Multiscale Stacked Denoising Autoencoder Method
In this section, we elaborate on the proposed MSDAE to extract BUA. First, we introduce the related work, namely, SDAE algorithm and its characteristics. Next, the proposed method is described in detail.

Stacked Denoising Autoencoder
The architecture of SDAE is divided into two steps. The first is feature learning, which is a process of unsupervised learning. The second step is optimization of network parameter, which is a process of supervised learning (Fig. 3). The basic building block of an SDAE is denoising autoencoder (DAE), which is one variant of the standard autoencoder. 35,36 Autoencoder can learn to recover the data from the corresponding corrupted input data. SDAE allows us to build a deep network to use denoise feature as an unsupervised objective to guide the learning of useful higher level representations. 37 As mentioned in Vincent's research, the autoencoder framework comprises two parts: encoder and decoder. The DAE is trained to reconstruct a clean "repaired" input from a corrupted version of it. Before encoding, the initial input x intox is done by means of a stochastic mapping x ∼ qDðxjxÞ in which some elements of x is forced to be zero randomly (masking noise). Then the encoder procedure is provided a nonlinear affine mapping function f θ ðxÞ, which transforms the corrupted vector into a hidden representation by the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 4 8 8 Its parameter θ ¼ fW; bg, where W is the weight matrix and b is an offset vector. The activation function sigm is set to sigmoid function, where sigm ¼ 1∕ð1 þ e −x Þ. A decoder is the process where the hidden representation y i is mapped back to a reconstructed vector z i in a similar equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 4 0 6 To meet criteria of feature representation, features in the data can be learned by minimizing the reconstruction error of the loss function. To emphasize on the corrupted dimensions, the weights are set differently among all components of the input. The corrupted dimensions is emphasized, and the squared loss yields: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 3 2 5 where κðxÞ denotes the indices of the components of x that were corrupted. The weight α denotes the reconstruction error on components that were corrupted, and β denotes those that were left untouched; α and β are considered hyperparameters. Fig. 3 The architecture of SDAE.
Finally, a feedforward neural network (FFNN) classifier can be added to the end of the deep neural network to form a complete SDAE model for image classification. Hidden layer network structure of FFNN is the same as hidden layer structure constructed of the SDAE. During training, the network parameters obtained by SDAE training are taken as the optimal initialization parameters of FFNN, and labeled samples are used to train the model. FFNN network uses error propagation mechanism, according to the difference between the output and the label, BP algorithm is used to fine-tune network parameters until convergence. The parameters of all layers are well tuned using the stochastic gradient descent algorithm. 23,38,39

Multiscale Stacked Denoising Autoencoder
Many research results have shown that the scale is a critical factor in remote sensing image classification. Land objects in remote sensing images are complex and broken, so the land object needs to be recognized from different scales. To tackle this problem, we propose a new method to recognize the type of land object from multiple scales, which is called MSDAE. The mode contains three parts: multiscale training, multiscale classification, and multiscale results merging. The architecture of the model is shown in Fig. 4.
To accomplish this, we first collect two types of samples: built-up sample and nonbuilt-up sample. Then, we crop each sample with 3 × 3 pixels corresponding to scale 1, 7 × 7 pixels corresponding to scale 2, 15 × 15 pixels corresponding to scale 3, 25 × 25 pixels corresponding to scale 4. The difference between the four scales is that scale 1 is used to determine the land object type of the center pixel by the vector composed of the center pixel and its eight surrounding points. Scales 2, 3, and 4 are used to determine the type of land object by patch. Therefore, from this perspective, scale 1 is pixel-based classification and scales 2 to 4 are patch-based classification. Therefore, the dimensionality of input vectors is different at different scales, so the configuration employed for MSDAE is shown in Table 2. In the training stage, the main aim is to train the MSDAE model by the layer-wise pretraining and supervised fine-tuning.
For each input image to be classified, we test it in the four scales separately. In every scale, there is no overlap in each direction among patches. Then we can obtain the result of

Built-Up Area Extraction Result
The task in our experiments was to classify all pixels in images into two categories: built-up and nonbuilt-up using our model MSDAE. Experiments were conducted on GF-1 WFV image data that cover the metropolitan area of Wuhan [ Fig. 5(a)], the size of the data is 20373 × 16376 and the data contain four spectral bands, which represents blue, green, red, and near-infrared in order. Using ROI tool from ENVI, 14,557 sample points of BUA and 72,206 sample points of non-BUA are selected on images. About 60% of them were randomly selected as training samples and 40% as test samples. The samples of MSDAE are made by expanding to 3 × 3, 7 × 7, 15 × 15, 25 × 25 slices with the pixel sample as the center. The model is trained and then is used to detect the image. The classification results are shown in Fig. 5. The result of SDAE-pixel-based [ Fig. 5(b)] obviously has more noise than the result of MSDAE proposed in this paper. The noise is caused by unused land, ridges, rocks, and other similar land object.  As shown in the Fig. 5, the detection of BUA from multiple scales can reduce the interference of other land objects.

Comparisons with Single-Scale Result
In this paper, five regions [ Fig. 5(a): [1][2][3][4][5] are selected to compare the classification results of scale 1, scale 2, scale 3, scale 4, and final results. As shown in Fig. 6, the classification result of scale 1 is of good detail, but the pepper-and-salt noise is obvious. The classification result of scale 2 is "granulated," but pepper-and-salt noise still exists. In the classification result at scale 3, the main part of the target object is highlighted and the noise is reduced. At scale 4, target object are more aggregated, but false alarm is amplified. As can be seen from the figure, the clustering degree of final result is higher and the noise is less. Through multiscale to confirm the type of land object type, we find that MSDAE can reduce the interference of similar features from other land object. R1 shows that both sides of the river are misclassified as BUA in scales 1 and 2 but not in scales 3 and 4. Accuracy ¼ tp þ tn tp þ tn þ fp þ fn ; Missing alarm ¼ fn tp þ fn ; where tp is true positive, fp is false positive, fn is false negative, and tn is true negative. These values can be calculated by confusion matrices. As shown in Table 3, precision of MSDAE result is 0.873, which is best. Accuracy of scale 2 is 0.859, which is similar to accuracy of MSDAE. Recall of scales 3 and 4 is higher, which is related to its detection window. The larger the detection window, the higher is the probability that it will hit the target. Conversely, false alarm is higher. To sum up, MSDAE is superior to the results of single scale.

Comparisons with Other Methods
To further investigate and evaluate the performance of our framework, we compare the results of the classifications based on SVM, BP, random forest (RF), and SDAE. These methods are robust and widely used for land cover classification. 15,14 In our evaluation, overall accuracy (OA), user's accuracy (UA), producer's accuracy (PA), F1 score, intersection over union (IOU) are used to assess the quantitative performance from SVM, BP, RF, SDAE, and MSDAE. The F1 score is calculated by Eq. (6). IOU is the value of the intersection of prediction and ground-truth regions over their union, as shown in Eq. (7). Note that all models were implemented on the same training dataset and test dataset. Table 3 compares the classification accuracies for five models from five evaluation indices. MSDAE consistently provided better results than other models and reached high scores (89.20% OA, 87.32% UA, 90.96% PA, 89.10% F1 score, and 80.35% IOU), which indicates that the MSDAE performed well on BUA extraction. The MSDAE clearly outperforms the SDAE by about 5% in the OA, about 7% in the UA, about 2% in the PA, about 5% in the F1 score, and about 7% in the IOU, respectively. This shows that the multiscale model is superior to the single-scale model. The proposed MSDAE can achieve a better performance (Table 4).
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 1 5 4 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 9 8 IOU ¼ precision × recall precision þ recall − precision × recall :

Metropolitan Area Analysis
To study the aggregation and spatial distribution of BUA in the metropolitan area of Wuhan, the extraction result of BUA is processed; the postprocess includes the small patch removing and hole filling and cutting. The processed result is shown in the Fig. 7. The figure shows that the BUA concentrates in the center of the city and disperses outward. The metropolitan Wuhan was divided as core metropolitan area, subcore metropolitan area, and daily metropolitan area following the 0.5, 1, and 1.5 h. Taking the circle as research object, we compare and analyze the area of BUA and population of the administrative region between different circles. According to the statistics, the 27.2714 million of permanent residents gathered in 49;890.29 km 2 metropolitan Wuhan in 2016. There were 56;439 km 2 BUA in the metropolitan area of Wuhan reported by our result. Proportion of district area, BUA, permanent resident population, and the population density are shown in Table 5. It indicates that the core metropolitan area took the 13.63% of whole metropolitan, in which the BUA took 44.02% and 32.35% of permanent residents gathered in with the population density of 1295 people∕km 2 ; proportions in the subcore metropolitan area showed relatively balanced with about 20.41% proportion of  administrative, about 27.49% proportion of BUA, about 24.28% proportion of permanent population, and 650 people∕km 2 of population density. The largest proportion of administrative area and smallest population density appeared in daily metropolitan area. The three circles form a "target-shaped" distribution structure radiating outward from the core metropolitan area on the said four indicators.
Corridor is a linear area that spans urban and rural regions with transportation infrastructure. 40 From the perspective of regional economics, corridor is a linear system connected by transportation. It is a corridor regional economic space system formed by highly developed multimode transportation network connecting at least two or more large and medium-sized cities or urban agglomerations. 41,42 Transport lines in the metropolitan area of Wuhan such as major roads, railways, and rivers are distributed outward as Wuhan as a core. They extend northward to Xiaogan and Suizhou, spread southward to Xianan, respectively, alone with the Jingguang and Handan railways; by the Huyu expressway and Yangtze river, transport lines extend eastward to Huanggang, Huangshi, westward to Xiantao, Tianmen. Hereunder, five corridors are shown in Fig. 8. These corridors are Wuhan to Huanggang, Wuhan to Xiaogan and then to Suizhou, Wuhan to Ezhou and then to Huangshi, Wuhan to Xiantao and then to Tianmen, and Wuhan to Xianan and then to Chibi. The spatial distribution of BUA and transportation infrastructure are highly consistent. Traffic corridors shorten the travel time between cities and enable people to obtain a wider space for activities and development. At the same time, they promote the sharing of various resources between cities and the optimization of resources in each functional area. They are the flow of people, material, and capital for the development of metropolitan areas and form the axis of urban development. The construction of traffic corridors will promote the process of urbanization and promote the outward expansion of BUA.

Conclusions
In this research, we proposed an MSDAE method for extracting BUA from GF-1 images, then we extracted BUA of the metropolitan area of Wuhan, and finally we analyzed the area of BUA, population and population density based on the extraction result and got five corridors by the distribution of BUA and transport lines. The conclusions are summarized as the following.
1. Aiming at the concentrated distribution of BUA but a large number of broken features, the MSDAE model is innovatively built by taking full advantage of the significant difference in the spatial pattern of features on different scales. MSDAE learns the features of land objects from the four scales of 1 × 1, 7 × 7, 15 × 15, 25 × 25, and identifies the types of land objects from the four scales, then determines the classification result by logistic method. By using multiscale detection windows, building multilevel feature structure, and optimizing merge rules, MSDAE can reduce the interference of similar land objects without losing the detailed information of BUA and avoid the phenomenon of salt-and-pepper noise generated by traditional pixel classification. Therefore, MSDAE effectively improves the recognition rate of BUA. MSDAE is superior to the results of single scale. Compared with other methods, MSDAE reaches high scores and provides better result than other models. Although our proposed method performs well, several issues remain to be resolved in future work. BUA has obvious feature in color space, remote sensing index, and other aspects. How to use multifeatures to integrate multiscale features to further improve the extraction accuracy of BUA and the robustness of the model needs further study. 2. Taking the metropolitan area of Wuhan as the study case, the range of the metropolitan area is 1.5 h calculated by Tencent map big data. The BUA of the Wuhan metropolitan area is extracted by MSDAE model from GF-1 WFV image. The metropolitan area of Wuhan is divided into three layers: core metropolitan area, subcore metropolitan area, and daily metropolitan. By calculating the proportion of the administrative area, BUA, permanent population, and population density, we know that the metropolitan area of Wuhan has a target-shaped distribution structure radiating outward from the core metropolitan area. Through the overlay analysis of city corridor and BUA of metropolitan area of Wuhan, it can be seen that the spatial distribution of the BUA is consistent with the spatial distribution of traffic lines; based on this, five corridors are identified: Wuhan-Huanggang, Wuhan-Xiaogan-Suizhou, Wuhan-Ezhou-Huangshi, Wuhan-Xiantao-Tianmen, and Wuhan-Xianan-Chibi. Corridor is conducive to the optimization and sharing of resources among metropolitan areas, can promote urbanization, and is of great significance to the development of metropolitan areas.