3D autoencoder algorithm for lithological mapping using ZY-1 02D hyperspectral imagery: a case study of Liuyuan region

Abstract. A hyperspectral image (HSI) contains hundreds of spectral bands, which provide detailed spectral information, thus offering an inherent advantage in classification. The successful launch of the Gaofen-5 and ZY-1 02D hyperspectral satellites has promoted the need for large-scale geological applications, such as mineral and lithological mapping (LM). In recent years, following the success of computer vision, deep learning methods have shown their advantage in solving the problem of hyperspectral classification. However, the combination of deep learning and HSI to solve the problem of geological mapping is insufficient. We propose a new 3D convolutional autoencoder for LM. A pixel-based and cube-based 3D convolutional neural network architecture is designed to extract spatial–spectral features. Traditional and machine learning methods are employed as competing methods, trained on two real hyperspectral datasets, and evaluated according to the overall accuracy, F1 score, and other metrics. Results indicate that the proposed method can provide convincing results for LM applications on the basis of the hyperspectral data provided by the ZY-1 02D satellite. Compared with traditional methods, the combination of deep learning and hyperspectral can provide more efficient and highly accurate results. The proposed method has better robustness than supervised learning methods and shows great promise under small sample conditions. As far as we know, this work is the first attempt to apply unsupervised spatial–spectral feature learning technology in LM applications, which is of great significance for large-scale applications.

1 Introduction its greater advantage in expressing of different types of geological bodies. Although geological mapping based on airborne hyperspectral has been performed for many years, the high cost of data acquisition has made the application and promotion of this technology more difficult. Benefiting from the development of hyperspectral satellite technology, Gaofen-5, 6,7 ZY-1 02D 8 has been successfully launched, thus providing sufficient data guarantee for large-scale LM. 9 Both Gaofen-5 and ZY-1 02D have a width of 60 km, a spatial resolution of 30 m, and hundreds of spectral bands, which are helpful for developing accurate, efficient, and low-cost geological mapping applications.
In the past few decades, various methods based on HIS have been proposed for geological mapping. Traditional lithology and mineral mapping methods can be summarized into three categories: image enhancement methods, spectral feature analysis methods, and object-oriented-based methods. Image enhancement methods such as minimum noise fraction rotation (MNF), 10,11 principal component analysis (PCA), 12 and band ratio 13,14 method aim to enhance the relevant feature of litho-units through transformation processing such as dimensionality reduction. Such methods are simple and effective, but they are mainly used to enhance the expression of lithology-related features and cannot directly achieve classification. Spectral feature analysis methods can be further subdivided into two types, spectral feature extraction (SFE) methods [15][16][17] and spectral matching (SM) methods. 18,19 As a method with clear physical meaning, the SFE method is based on the analysis of the diagnostic spectral features of typical mineral or litho-units and artificially identifies rules to achieve LM. However, because the algorithm mainly uses manual identification rules to implement LM, it is less efficient when applied to large-scale, multiple targets, and complex scenarios. 20 In the definition of the SM methods, rock types are distinguished by comparing the consistency of the target spectrum with that of the reference spectrum. Typical SM methods, such as spectral angle mapper (SAM) 21,22 and spectral information divergence (SID) 23 are the most commonly used. The unmixing method can also be regarded as an extension of this type of method. 24 Compared with the SFE methods, it is easier to implement, but it is not sensitive enough to minor diagnostic spectral signatures, 25 and the choice of reference spectrum is also important to accuracy. Object-wise methods should apply superpixel segmentation, 26,27 such as simple linear iterative clustering to the target images. Although this kind of method overcomes the salt-and-pepper effect to some extent, the recognition accuracy is still affected by the initial super-pixel precision and the insufficient utilization of spectral features. The above-mentioned traditional methods have their own advantages, but the problems of insufficient spatial-spectral feature combination, weak feature extraction ability, and low efficiency are difficult to avoid.
With advances in machine learning technology, a series of learning-based methods have been proposed for LM, such as support vector machine 28,29 and random forest. 30 Compared with traditional methods, the learning-based methods can capture more effective features through supervised learning, thereby overcoming the problem of artificial threshold setting in complex scenes. In recent years, deep learning technology has achieved remarkable progress in hyperspectral application. 4,31,32 However, few application cases of LM using deep learning methods combined with HSI. As shown in Refs. 27 and 33, a convolutional neural network (CNN) was used to address LM problems based on HSI in the supervised scenario, thereby providing a good case for the application of deep learning in LM. Considering that the accuracy of supervision-based classification algorithms is constrained by the amount and representativeness of training samples and that applying locally trained models to large-scale and complex scenarios is difficult, we believe that using self-learning methods to solve LM problem is a better choice. Most recently, the autoencoder-like architecture has been developed in hyperspectral unmixing and became a new trend in self-learning methods. In Refs. 34-36, denoising and sparseness autoencoder are introduced to estimate the abundance of endmembers. In Refs. 37 and 38, the 3D-CNN autoencoders are further employed for hyperspectral unmixing to improve the classification accuracy. Inspired by these autoencoder applications, we introduce a novel end-to-end 3D convolutional autoencoder for LM. The encoder with pixel-based or cube-based CNNs is proposed to explore the spatial-spectral contextual features of HSI, and the decoder is designed to estimate the endmember of the litho-units. The novelty of this method lies in the use of a 3D autoencoder, which implements LM through self-learning and avoids the use of large training sets. In our experiments, SAM, SID, and a simple CNN network are used as competing methods on the same data set and trained using the same parameter settings as the proposed method. Overall accuracy, F1-score, and other metrics are employed to evaluate the performance. As far as we know, this work is the first attempt to apply unsupervised spatial-spectral feature learning technology in LM application based on HSI.
The remainder of this paper is organized as follows: The proposed method is described in Sec. 2. The working area and data are described in Sec. 3. Experiments and results are presented in Sec. 4, and the conclusion and discussions are presented in Sec. 5.

Problem Formulation
Geological bodies and formations formed in diverse geological environments show differences in mineral composition, weathering characteristics, and alteration, thereby leading to distinct spectral signatures in hyperspectral data. Reasonably, the LM problem can be considered as an unmixing problem, which aims to estimate the proportions of each spectral endmember that representing certain litho-units. The formula can be expressed as (1) in which M is indicative of a mixture pixel of reflectance, E represents the endmembers of lithounits, A denotes their proportions, N is the additive vector, and Ψ represents the implicit nonlinear function applied to the linear transform. The problem investigated in this paper is the estimation of the proportion matrix A by given the endmembers matrix E. It is worth noting that the proportion non-negative and sum-to-one constrain are two physical constraints 39 that restrict the model from estimating the correct result following the physical rules. The former demonstrates all elements of the estimation result, which indicates that the proportions of each litho-unit must be nonnegative, and the latter requires that the sum of proportion in each pixel equals one. Autoencoder is an unsupervised training network composed of an encoder and a decoder, which can learn the data patterns and reconstruct input information with a minimum reconstruction error. Under certain constraints, the decoder part can be regarded as a reconstruction of HSI based on pure litho-units matrix and their proportions matrix. Hence, given the spectral endmembers of typical litho-units as the weight of the reconstructed layer, the proportion matrix of each litho-unit can be estimated as the activations of the last hidden layer for each input spectrum.

3D-CNN Autoencoder
Compared with other remote sensing data, hyperspectral data contains richer information in channel dimension. Thus, we used 3D convolution in hyperspectral data processing, which can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 2 2 9 v xyz lf ¼ σ in which v xyz lf represents the value of a unit at position ðx; y; zÞ on the f'th feature map in the l'th layer; m indexes the sets of the feature map in the preceding layer ðl − 1Þ; H k , W k , C k denote the height, width, and channel of the kernel, respectively; w hwd lfm stands for the weight at position ðh; w; dÞ connected to the f'th feature map; and b and σ are the bias and the activation function, respectively.
Inspired by the application of autoencoder for spectral unmixing problems, we introduce a pixel-based 3D convolutional autoencoder (PBA) and a cube-based 3D convolutional autoencoder (CBA). The main architectures of PBA and CBA are the same, but the inputs are different. The PBA takes a spectrum vector at each pixel as input, whereas the CBA takes the hyperspectral cube ðS × S × CÞ as input data to obtain joint spatial-spectral information, where S denotes the spatial window size and C is the number of spectral bands. In theory, the size of S is not fixed, it can be set according to the size of the data. In this case, S is set to 5. Generally, as an end-to-end network, the input and output of the autoencoder are the same, as in PCA. However, the output of the CBA is the central pixel of the input hyperspectral cube, as shown in Fig. 1. Similarly, the number of convolutional layers of the model's encoder is also changeable. The focus of this paper is to use the autoencoder to solve the LM problem, which is why a general encoder architecture is adopted in this case. Considering that several high-computational-cost fully connected layers (FC layers) are present in the network, we adopted five 3D convolutional layers as the encoder in this case, which can maintain sufficient feature extraction capabilities without increasing the computational burden as much as possible.
As shown in Fig. 2, an encoder with five 3D convolutional layers is designed to extract the spatial and spectral information of HSI. For the decoder part, we first use two FC layers to increase the nonlinearity of the model, and then use another FC layer to reconstruct the input spectrum. To follow the non-negative and sum-to-one constrain mentioned earlier, the absolute transformation (Abs in Table 1) and softmax activation is added before the final FC layer of the decoder. The detailed parameters of CBA are shown in Table 1.

Comparison Model
SAM and SID, which are two traditional spectral feature analysis methods, are employed as comparison methods. A supervised model with a basic CNN architecture is also used to evaluate the performance of the proposed model. Through a comparison of the angle difference between the spectrum of each pixel and the pure litho-units, the closest one is selected as the classification result. For the supervised model of deep learning in this paper, we use a basic CNN structure to obtain the probability of each category. The architecture is shown in Table 2. The network is Cube HSI 1×1×3 1×1×3 1×1×3 f mainly composed of two convolution layers and two FC layers. The former is responsible for extracting the image feature and the latter is responsible for flattening the feature to 1D. Moreover, the dropout technique is utilized to prevent overfitting.

Loss Function
We use SID as the loss function of our unsupervised methods and use categorical cross-entropy as the supervised methods' loss function. SID measures the difference between the input spectrum vector and the reconstructed spectrum by the decoder. It is a measure of the similarity evaluation of two spectral curves by using the relative entropy of spectral information. The SID of input spectrum x and reconstructed spectrum y can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 3 5 3 SIDðx; yÞ ¼ DðxkyÞ þ DðykxÞ: Cross-entropy is used to evaluate the difference between the predicted result and the reference ground truth (GT) in the supervised methods. The formula can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 2 9 7 in which N represents the number of samples, M represents the number of categories, and y ic is an indicator variable (0 or 1), which is 1 if the category c is the same as the category of sample i and 0 otherwise. p ic denotes the predicted probability that the sample i belongs to category c.

Evaluation
The performance of the proposed model is quantitatively measured according to the agreements and differences between the predicted results and GTs. The most common metrics overall accuracy, recall, F1-score, and precision were used as the evaluation index to evaluate the compared methods. For reference, a general analysis of the accuracy metrics for classification tasks can be found in Ref. 40. These metrics are defined as follows: Overall Accuracy where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

Working Area
The study area is located in Liuyuan town, northwestern Gansu Province, which is located in the middle east of the Dongtianshan-Beishan metallogenic belts, the southern margin of the Central Asian Orogenic Belt. In its long history of geological development, the Liuyuan area has experienced complex tectonic movements and magmatic activities and has excellent metallogenic conditions. The geological composition of the study area is relatively simple. According to the 1:50000 geological map, the main litho-units of the working area can be divided into five categories. Hercynian acidic intrusive rocks with different compositions occupy 50% of the study area. The Ordovician Huaniushan Formation is the main strata in the study area, which is mainly composed of plagioclase, basalt, and metamorphic sandstone. However, due to the small scale of the reference geological map, it cannot accurately reflect the edge of the geological bodies.
With the aid of the high spatial-resolution and high spectral-resolution of ZY-1 02D's images, the boundaries of different geological bodies can be observed in true-color images (Fig. 3). Further, in the HSI after MNF transformation, the difference between litho-units is enhanced and can be better distinguished through color and texture. Finally, in reference to the geological map, the final GT is labeled through manual interpretation based on MNF-transformed data. Five typical litho-unit's spectrum were collected as the endmember matrix for the extraction of the proportion matrix.

Data Acquisition
We apply our method to two hyperspectral datasets with GT to evaluate the performance of the proposed method. The Urban dataset is an airborne HSI obtained by HYDICE and is widely used in classification research. It contains 307 × 307 pixels and 210 bands in the range from 0.4 to 2.5 μm.
Forty-eight bad bands are removed, and the remaining 162 bands are used for classification. The data contain six endmembers, namely, asphalt, grass, tree, roof, metal, and dirt.
ZY-1 02D was launched on September 12, 2019, and is China's first self-built commercial hyperspectral satellite. ZY-1 02D will play an important role in large-scale monitoring and quantitative application by virtue of its wide spectrum range and high spatial and spectral resolution characteristics. ZY-1 02D carries a visible and near-infrared (VNIR) multi-spectral imager and a hyperspectral sensor. As shown in Table 3, it covers a range of 0.4 to 2.5 μm and has 166 spectral bands. The spectral resolution is 10 nm for VNIR and 20 nm for shortwave infrared (SWIR). The spatial resolution is 30 m and the swash width is 60 km.

Data Processing
The experimental ZY-1 02D images were obtained in the Liuyuan area of Gansu Province on February 7, 2020. Generally, before the mineral information extraction step, the original hyperspectral data needs to be processed into reflectance data. The preprocessing of ZY-1 02D mainly includes the following steps: 1. Removal of bad bands and overlapping bands. 2. Conversion of DN to radiance by using the absolute calibration coefficient. 3. Strip noise removal by using spectral moment matching methods. 4. Correction of radiance data to reflectance by using the FLAASH atmospheric correction model. 5. Geometric correction and orthorectification.

Experimental Setting
The Urban and the ZY-1 02D datasets with GT are used in this experiment to evaluate the classification performance of different methods. To test and verify the robustness of the proposed method, we randomly select 1/10, 1/50, and 1/100 of the original data to construct three datasets while considering the balance of the sample number of each category. In each dataset, 70% of  these data is used as training data, and 30% is used as validation data. The experiment was performed in a TensorFlow (1.13.1) framework on an NVIDIA Tesla V100 GPU and optimized by the adaptive moment estimation (Adam) algorithm (initial learning rate as 0.001). During the training process, the SID objective function is used in the CBA and PBA models, whereas the categorical cross-entropy objective function is used in the basic CNN model. Moreover, 30 epochs are sufficient for training, and the batch size was set to 32.

Urban
Evaluating proposed methods on public datasets provides more convincing results. The classification performance of different methods on the Urban datasets is shown in Fig. 4. Evidently, the performance of the learning-based methods is better than those of the traditional SID and SAM methods, which have obvious misclassification issues. Interestingly, SID, which is used as the loss function in learning-based methods, obtains the worst result among all methods [ Fig. 4(b)]. This finding illustrates that the importance of feature learning ability of convolutional layers for classification. Based on the visual evaluation, it is difficult to distinguish which is the best among CNN, PBA, and CBA. To quantitatively evaluate the performance of each model in the classification task, we adopted precision, recall, overall accuracy, and F1-score as the evaluation metrics. The data in Table 4 clearly show that the proposed PBA and CBA methods are better than the other comparison methods on all metrics, under all kinds of sampling ratio conditions. Moreover, as shown in Fig. 6, as the sampling ratio decreases, the F1-score of the CNN method drops significantly, whereas the performance of the proposed method is relatively stable. This finding illustrates the potential of unsupervised feature learning methods in classification applications based on a small sample.

ZY-1 02D
The ZY-1 02D HSI from Liuyuan is introduced to further evaluate the performance of comparison methods for the application of LM. Figure 5 shows the reference endmember spectrum of litho-units and endmember spectrum reconstructed by the proposed method. The overall shape of the reconstructed spectrum is consistent with that of the original spectrum. Although a slight difference in amplitude is observed after 2000 nm, the position of the absorption feature is relatively consistent. Correct reconstruction of the original data indicated that the autoencoder effectively extracted the spatial-spectral features from the HSI. We first compared the performance of CNN and autoencoder methods under different sampling ratios. In Fig. 6, we can see that, as the sampling ratio decreases, the F1-score of the CNN method presents the same decline pattern in the Urban dataset. When the sampling ratio is set to 1/100, the F1-score of CNN is lower than 0.55, whereas the F1-score of the proposed methods is stable at around 0.65 (Fig. 7).
As we mentioned earlier, the composition and formation environments are complex, resulting in variations in the spectrum. Therefore, obtaining enough representative samples to achieve LM through supervised learning methods is difficult. The application of supervised learning methods  to small data is more likely to lead to overfitting, thereby leading to difficulty in maintaining robustness in larger and more complex scenarios. Table 5 shows that when the sampling ratio is set to 1:50, the classification results of the proposed method are nearly equal to the results of the CNN method. However, CNN has obvious misclassification in the prediction of categories with a small volume of samples [ Fig. 8(d)]. Figure 9 shows that the prediction accuracy of the CNN method differs for each category, whereas the CBA model has relatively good robustness in the prediction of all categories.
In this case of LM, the performance of CBA is slightly better than that of PBA (Table 5, Fig. 9) because CBA takes the HSI cube as input to obtain more spatial information. We believe that CBA is more suitable for scenes with larger classification targets; otherwise, there is no guarantee that its performance will be better than that of PBA (Table 4). Although the proposed method still has room for improvement in classification accuracy due to the inaccuracy of manual labeling, it has better robustness than the supervised learning method. Combined with  the high-quality HSI of ZY-1 02D, the proposed method shows great potential in the large-scale applications under small sample conditions.

Conclusion
In this work, we present a new 3D convolutional autoencoder for LM. The pixel-based and cubebased 3D convolutional architecture is designed as an encoder to extract spatial-spectral features. An FC layer with non-negative and sum-to-one constrain is employed to extract the endmember of the litho-units. The traditional methods SAM and SID and machine learning methods CNN are employed as competing methods and trained on both airborne and spaceborne hyperspectral datasets. The experimental results indicate that the proposed method can provide convincing results for LM applications on the basis of hyperspectral data provided by the ZY-1 02D satellite. Compared with traditional methods, the combination of deep learning and hyperspectral data can provide more efficient and highly accurate results. The proposed method has better robustness than the supervised learning methods and shows great promise under small sample