Tackling the over-smoothing problem of CNN-based hyperspectral image classification

Abstract. Convolutional neural networks (CNNs) are very important deep neural networks for analyzing visual imagery. However, most CNN-based methods have the problem of over-smoothing at boundaries, which is unfavorable for hyperspectral image classification. To address this problem, a spectral-spatial multiscale residual network (SSMRN) by fusing two separate deep spectral features and deep spatial features is proposed to significantly reduce over-smoothing and effectively learn the features of objects. In the implementation of the SSMRN, a multiscale residual convolutional neural network is proposed as a spatial feature extractor and a band grouping-based bi-directional gated recurrent unit is utilized as a spectral feature extractor. Considering that the importance of spectral and spatial features may vary depending on the spatial resolution of images, we combine both features with two weighting factors with different initial values that can be adaptively adjusted during the network training. To evaluate the effectiveness of the SSMRN, extensive experiments are conducted on public benchmark data sets. The proposed method can retain the detailed boundary of different objects and yield competitive results compared with several state-of-the-art methods.


Introduction
With the rapid development of remote sensing imaging spectroscopy technology, hyperspectral images (HSIs) have become increasingly important in Earth observation due to their rich spectral and spatial information. Classification is an important technique for HSI data exploitation. HSI classification (HSIC) is the task of identifying the category for each pixel with a proper land-cover label, 1 which is more challenging because of the large dimensionality, spectral heterogeneity, and complex spatial distribution of the objects. 2 To alleviate these problems, traditional HSIC methods involve two steps: (1) feature selection and extraction. 3 This step relies on utilizing feature engineering skills and domain expertise to design several human-engineered features. (2) Classifier training. A classifier in machine learning is an algorithm that automatically orders or categorizes data into one or more of a set of classes. However, the traditional HSIC approaches use handcrafted features to train the classifier. These features may be insubstantial in the case of real data. Therefore, it is difficult to fine-tune between robustness and discriminability as a set of optimal features considerably vary between different data. 4 Deep neural networks (DNNs) can automatically learn the features from data in a hierarchical manner to construct a model with growing semantic layers until a suitable representation is achieved. 5 To overcome the issue of high intraclass variability and high interclass similarity in HSI, stacked autoencoders [6][7][8] and deep belief networks 9,10 are introduced as accurate unsupervised methods to extract layerwise trained deep features. However, their standard fully connected (FC) architecture imposes a feature flattening process before the classification, leading to the loss of spatial-contextual information. 11 On the contrary, convolutional neural networks (CNNs) can automatically extract spectral-spatial features from the raw input data. Recurrent neural networks (RNNs) process the spectral information of HSI data as a time sequence considering the spectral bands as time steps. There are three basic models of RNN: (1) Vanilla, (2) long-short-term memory (LSTM), and (3) gated recurrent unit (GRU). Therefore, a large number of CNN or RNN-based methods are proposed for end-to-end modeling and can handle HSI data in spectral and spatial domains individually, and also in a coupled fashion. 12 For instance, Yang et al. 13 designed a CNN model with two-branch architecture to learn the spectral features and spatial features jointly. Zhong et al. 14 raised an end-to-end three-dimensional (3D) residual CNN architecture for spectral-spatial feature learning and classification. Motivated by the attention mechanism of the human visual system, a residual spectral-spatial attention network (RSSAN) 15 was proposed for HSI classification. To reduce computations, fully convolutional networks were proposed for HSIC. 16 For correctly discovering the contextual relations among pixels, the graph convolutional network was adopted for dealing with the HSIC, which was originally designed for arbitrarily structured non-Euclidean data. 17 The morphological operations, i.e., erosion and dilation, are powerful nonlinear feature transformations. Inspired by these, an end-to-end morphological CNN (MorphCNN) 11 was introduced for HSIC by concatenating the outputs from spectral and spatial morphological blocks extracted in a dual-path fashion. To represent high-level semantic features well, a spectral-spatial feature tokenization transformer (SSFTT) method 18 was proposed to capture spectral-spatial features and high-level semantic features. Keeping in view the sequential property of HSI to determine the class labels, an RNN-based HSIC framework with a novel activation function (parametric rectified tanh) and GRU was proposed. 19 The work 20 proposed a spectral-spatial LSTM-based network that learns spectral and spatial features of HSI by utilizing two separate LSTM-followed Softmax layers for classification, while a decision fusion strategy is implemented to get joint spectral-spatial classification results. In the literature, several works have proposed a CNN joint RNN architecture for HSIC. Spatial-spectral unified network (SSUN) combined a spectral dimensional band grouping-based LSTM model with 2D CNN for spatial features and integrated the spectral feature extraction (FE), spatial FE, and classifier training into a unified neural network. 2 In a spectral-spatial attention network (SSAN), 21 RNN with attention can learn inner spectral correlations within a continuous spectrum, while CNN with attention is designed to focus on saliency features and spatial relevance between neighboring pixels in the spatial dimension. The work 22 integrated CNN with bidirectional convolutional LSTM (CLSTM) in which a 3D CNN model is used to capture low-level spectral-spatial features and CLSTM recurrently analyzes this low-level spectral-spatial information.
CNN is commonly applied to analyze visual imagery. 23 Most of the above methods are based on the CNN backbone and its variants. However, most CNN-based methods have the problem of over-smoothing at boundaries, which is unfavorable for HSIC. DNNs usually yield overfitting methods 24 and are sensitive to perturbations. 25 A large number of training samples are usually required for deep learning methods. 26,27 To significantly reduce the over-smoothing effect and effectively learn the features of objects, a multi-task learning spectral-spatial multiscale residual network (SSMRN) is proposed for the end-to-end HSIC. The contributions can be summarized as follows: 1. An end-to-end SSMRN is designed by fusing two separate deep spectral features and deep spatial features to extract spectral-spatial features for HSIC. The model yields competitive results under different training sample conditions. 2. The proposed framework takes the weight between spectral and spatial features into consideration, which increases the influence of the current pixel and reduces over-smoothing. Meanwhile, the multi-task learning technology is integrated into the framework, improving the stability of results.
The rest of the sections are organized as follows. First, Sec. 2 introduces the preliminary knowledge of CNN, residual networks, and RNN. The proposed architecture along with the design methodology is introduced in Sec. 3. Next, experimental data sets and results are given in Sec. 4. Then, the impact of the SSMRN architecture on classification results is analyzed in Sec. 5. Finally, Sec. 6 concludes the paper with a summary of the proposed method and the scope of future work.

Preliminary
In this section, we mainly recall the background information on CNN, residual networks, and RNN.

Convolutional Neural Network
A CNN 28 is a class of DNNs, most commonly applied to analyzing visual imagery. Three main types of layers are used to build CNNs architectures: convolutional layer, pooling layer, and FC layer. Compared with multilayer perceptron neural networks, CNNs are easier to train because of the parameter sharing scheme and local connectivity.
While CNN-based methods have achieved large improvement in HSIC, they usually suffer from severe over-smoothing problems at edge boundaries. There are two major reasons: (1) the scales between supervised information and spatial features do not match. The supervised information of HSIC is pixel-level, while the spatial features are extracted from the neighbourhood of the current pixel. (2) The parameter sharing scheme makes the spatial features extracted for the patch instead of the current pixel. Two major reasons lead to an insufficient influence of the current pixel in classification. Attentional mechanisms can counteract the effects of parameter sharing, 15,21 but increase the amount of computation. A smaller size patch will also decrease the possibility of the over-smoothing phenomenon 2 but result in insufficient extraction of spatial information and lower classification accuracy (CA). 29 Another approach is to utilize superpixel segmentation, 17 but the segmentation algorithm affects the classification results.

Residual Networks
A residual network is an effective extension to CNNs that has empirically shown to increase performance in ImageNet classification. A residual network does this by utilizing a skip connection to jump over some layers. As shown in Fig. 1, the typical residual block is implemented with double-layer skips that contain nonlinearities. The skip connections between layers add the outputs from previous layers to the outputs of stacked layers. One motivation for skipping over layers is to avoid the problem of vanishing gradients, by reusing activations from a previous layer until the adjacent layer learns its weights. 30 Skipping effectively simplifies the network, using fewer layers in the initial training stages. The residual block is easy to understand and optimize and can be stacked to any depth and embedded in any existing CNN. Fig. 1 Architecture of a typical residual block.

Recurrent Neural Network
RNNs allow us to operate over sequences of input, output, or both at the same time. RNN makes them applicable to challenging tasks involving sequential data such as speech recognition and language modeling. LSTM and GRU 31 are introduced to learn long-term dependencies and alleviate the vanishing/exploding gradient problem. These two architectures do not have any fundamental differences from each other, but they use different functions to compute the hidden state. LSTM is strictly stronger than the GRU as it can easily perform unbounded counting. The GRU has fewer parameters than LSTM, and GRU has been shown to exhibit better performance on certain smaller and less frequent data sets. Bi-directional RNNs (Bi-RNNs) utilize a finite sequence to predict or label each element of the sequence based on the element's past and future contexts, as shown in Fig. 2. Bi-RNN concatenating the outputs of two RNNs allows them to receive information from the sequence from left to right, the other one from right to left.
Hyperspectral data usually have hundreds of bands. So, pixel classification in HSI can be treated as a many-to-one task where we are given a sequence of bands of a pixel and then classify what classification that pixel is. A natural idea is to consider each band as a time step. The large length of RNNs input sequence can lead to an overfitting issue, which consumes high computing and storage resources. In addition, a large number of spectral channels and limited training samples restrict the performance of HSIC. 26

Proposed Framework
The deep networks used for HSIC are divided into spectral-feature networks, spatial-feature networks, and spectral-spatial-feature networks. To effectively learn the features of objects, we utilize the spectral-spatial-feature networks to extract joint deep spectral-spatial features for HSIC. The joint deep spectral-spatial features are mainly obtained by the following three ways: 32 (1) mapping the low-level spectral-spatial features to high-level spectral-spatial features via deep networks; (2) directly extracting deep features from original data or several principal components of the original data; and (3) fusing two separate deep spectral features and deep spatial features. Considering that the importance of spectral and spatial features may vary depending on the spatial resolution of images, we adopt the way of fusing two separate deep features to conveniently adjust the influence of different features on the classification results.
Three sections are playing crucial roles in our methodology: a multiscale residual CNN (MRCNN)-based spatial feature learner, a bi-directional GRU (bi-GRU)-based spectral feature learner, and a multi-task learning model that combines both features with two weighting factors.

Multiscale Residual CNN for Spatial Classification
The proposed MRCNN architecture is shown in Fig. 3. Let X ∈ R r×c×b be the original HSI data, where r, c and b are the row number, column number, and band number, respectively. First, to suppress noise and reduce the computational costs, the principal component analysis is applied to the original HSI data, and only the first p principle components are reserved. Denote the dimension-reduced data by X p ∈ R r×c×p . Around each pixel, a neighbor region is extracted with the size of k × k × p as the input of the spatial branch.  Considering the complex environment of the HSI, where different objects tend to have different scales, we propose to extract both shallow and deep features by applying a convolution layer with rectified linear unit (ReLU) activation and two residual blocks in the classification. The local max pooling layer is adopted in residual blocks. We add a flatten layer and an FC layer with the same number of neurons after each scale output. Then, these FC layers are merged into a new FC layer. Let h ðjÞ ¼ fðW ðjÞ x ðjÞ þ b ðjÞ Þ, j ¼ 1;2; 3 denotes the j'th FC layer, where x ðjÞ is the flattened features in the jth flatten layer, W ðjÞ and b ðjÞ are the corresponding weight matrix and bias term, respectively. The fourth FC layer h ð4Þ can be calculated as h ð4Þ ¼ concat½h ð1Þ ; h ð2Þ ; h ð3Þ . In this way, features in different layers are taken into consideration during the classification stage, and the network will possess the property of multiscale.
The loss function for cross entropy of MRCNN can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 4 0 8 where y n m andŷ n m denote the truth and predicted labels, respectively. M is the number of training samples and N is the number of classes.

Bi-GRU for Spectral Classification
GRU has fewer parameters than LSTM for modeling various sequential problems, and Bi-GRU allows the sequential vector to be fed into the architecture one by one to learn continuous features with forward and backward directions. So, we utilize Bi-GRU for spectral classification.
The complete spectral classification framework is shown in Fig. 4. To reduce computation, a suitable grouping strategy 2 is used in this paper. For each pixel x in the HSI, let x ¼ ðλ 1 ; λ 2 ; : : : ; λ j ; : : : λ b Þ T be the spectral vector, where λ j is the reflectance of the j'th band and b is the number of bands. Let rð≪ bÞ be the number of time steps (e.g., number of groups). The transformed sequences can be denoted by ðc 1 ; c 2 ; : : : ; c t ; : : : c r Þ, where c t is the sequence at the tth time step. Specifically, the grouping strategy is where m ¼ floorðb∕rÞ is the sequence length of each time step and floorð·Þ function rounds numbers down. After grouping, spectral vector x is transformed into sequences ðc 1 ; c 2 ; : : : ; c t ; : : : c r Þ.
The input to our model is the sequences ðc 1 ; c 2 ; : : : ; c t ; : : : c r Þ, and the bi-directional hidden vector is calculated as Forward hidden state: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 4 8 8 Backward hidden state: where the coefficient matrices W ð1Þ and W ð2Þ are from the input at the present step, U ð1Þ is from the hidden state h ð1Þ t−1 at the previous step, U ð2Þ is from h ð2Þ tþ1 at the succeeding step, tanh is the hyperbolic tangent, and the memory of the input as the output of this encoder is g t : E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 3 6 5 where concatð·Þ is a function of concatenation between the forward hidden state and backward hidden state. The grouping strategy uses the original HSI spectral vector as the feature of the new sequence and the RNN uses the parameter sharing scheme, so a one-dimensional convolutional residual block is added to reassign the weight of the feature based on the channel attention mechanism. So, we can compute the predicted label y i of pixel x i as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 2 5 9 y i ¼ VðF 1d ðg 1 ; : : : ; g t ; : : : ; g r Þ þ ðg 1 ; : : : ; g t ; : : : ; g r ÞÞ; where F 1d ð·Þis one-dimensional convolutional layer with stride one and Vð·Þ indicates a series of operations as shown in Fig. 4, including a ReLU activation, a flatten function, an FC layer and a Softmax activation function.

SSMRN
The proposed SSMRN framework is shown in Fig. 5, which starts with two branches, learning the spatial and spectral features, respectively. Then, concatenate these two branches into a layer. λ spatial and λ spectral are the corresponding weighting factors.
To better train the whole network, two auxiliary tasks are added to the framework. 2 So, the proposed SSMRN is a triple-task framework, including one main task (classification based on spectral-spatial information) and two auxiliary tasks (classification based on spectral information and classification based on spatial information). The complete loss function for cross entropy of the SSMRN is defined as where L joint is the main loss function, L spectral and L spatial are two auxiliary loss functions,ŷ njoint m , y nspectral m , andŷ nspatial m are the corresponding predicted labels, y n m is the true label. M is the number of training samples and N is the number of classes. The whole network is trained in an end-toend manner, where all the parameters are optimized by the batch stochastic gradient descent algorithm at the same time. In this way, the complete loss function will balance the convergences of both the whole network and the subnetworks.

Experiment
In this section, we introduce three public data sets used in our experiment and the configuration of the proposed SSMRN. In addition, classification performance based on the proposed method and other comparative methods is presented.

Experimental Data
Three publicly available hyperspectral data sets are utilized to evaluate the performance of the proposed method, i.e., Indian Pines (IP) from the airborne visible/infrared imaging spectrometer (AVIRIS) sensor, Pavia University (PU) from the reflective optics systems imaging spectrometer (ROSIS) sensor, and Salinas (SA) from the AVIRIS sensor. The data set details are shown in the following Table 1.

Evaluation indicators
To quantitatively analyze the effectiveness of the proposed method and other methods for comparison, three quantitative evaluation indexes are introduced, including class-specified CA, overall classification accuracy (OA), and Kappa coefficient (Kappa). The larger value of each indicator represents a better classification effect.

Configuration
All the experiments are implemented with an Intel(R) Xeon(R) Sliver 4210 CPU @ 2.20-GHz with 64 GB of RAM and an NVIDIA RTX2080 graphic card, TensorFlow 2.3.1, and Keras 2.4.3 with python 3.7.6. We use the Adam optimizer to train the networks with a learning rate of 0.001. The gradient of each weight is individually clipped so that its norm is no higher than 1. The training epochs are set as 1500 with batch size 1048.

Parameter setting
All the experiments in this paper are randomly repeated 30 times. In each repetition, we first randomly generate the training set from the whole data set with the same number of the labelled class. Then, the remaining samples make up the test set. Details are given in Tables 2-4. For the proposed MRCNN, the input is a 24 × 24 × 4 patch, where 4 is the number of reserved principal components. All convolutional layers have 64 filters. The kernel size of the first left convolutional layer is 1 × 1, and the other kernel sizes are 3 × 3. The size of the max pooling layers is 2 × 2. The three FC layers after each scale output each own 64 units. For the proposed Bi-GRU, let 3 be the number of time steps. The hidden size in GRU is 64, so one-dimensional convolutional layers have 128 filters because of Bi-GRU. For the proposed SSMRN, the input is as same as the Bi-GRU and MRCNN. The number of neurons of the FC layer in the spectral branch and spatial branch is 192, so the number of neurons in the joint FC layer is 384.
In our study, we adopt the way of fusing two separate deep spectral features and deep spatial features. Since the importance of spectral and spatial features may vary depending on different spatial resolutions, we consider the weight of these two parts and we need to specify the initial value of these hyperparameters. The principle is that the higher the spatial resolution and the smaller the influence of the mixed pixel effect, the greater the initial spectral weight should be. Suppose the sum of the two weights is 1 and the weights for both parts are close to each other. Owing to the proposed strategy, the weights for the spectral and spatial parts can be adjusted adaptively. The initial value of weighting factors λ spatial and λ spectral are given in Table 5.

Ablation study
In this section, we compare the SSMRN with the SSMRN without auxiliary tasks. As shown in Table 6, the SSMRN surpasses the SSMRN without auxiliary tasks, especially for small samples of the IP data set. These results demonstrate that multi-task learning can select the useful HSI data for feature learning.

Classification Results
To demonstrate the superiority and effectiveness of the proposed SSMRN model, it is compared with the proposed Bi-GRU, MRCNN, and advanced spectral-spatial DNNs methods, such as  And the two kinds of features are fused to generate the joint deep spectral-spatial features. The difference is that SSMRN considers the weight relationship between the spectral and spatial branches depending on the spatial resolution of images, and embeds multi-task learning technology at the same time.   For SSUN, SSAN, and MorphCNN, the input is a 24 × 24 × 4 patch, where 4 is the number of reserved principal components. Limited by our computer configuration, we cannot run RSSAN properly with the original input size in the corresponding reference, so the input of RSSAN is a 24 × 24 × 8 patch, where 8 is the number of reserved principal components instead of the number of spectral bands. According to the reference, the input of SSFTT is a 13 × 13 × 30 patch. For the SSUN, SSAN, RSSAN, MorphCNN, and SSFTT, all network settings are as described in their corresponding references. For a fair comparison, the training sample sets and test sample sets of all methods are randomly selected, as shown in Tables 2-4. Quantitative evaluation: Tables 7-9 report the CA, OA, and Kappa using all the mentioned methods for the IP, PU, and SA datasets, respectively. All algorithms are executed 30 times. The average results with the standard deviation obtained are reported to reduce random selection effects. The optimal results are denoted in bold. The evaluation data clearly show that the proposed SSMRN method performs the best. The SSMRN obtains the highest OA and Kappa. SSMRN also generates most of the highest class-specific accuracy, where the results of a few classes have slightly lower precisions than MRCNN, SSUN, and SSFTT. Particularly in the IP datasets, the results of SSMRN are higher than other methods, which shows that SSMRN can effectively learn the features of objects, especially under the condition of a small number of samples. The CA, OA, and Kappa of Bi-GRU are lower than other methods, specifically in Table 5. Because Bi-GRU only uses the spectral feature, and the IP datasets have lower spatial resolution and the bigger influence of the mixed pixel effect. MRCNN's results are second only to SSMRN, which shows that good results can be obtained using spatial features and proper deep network structure. Especially in the SA data set, the results of the MRCNN and SSMRN models are almost identical. The likely reason is that the ground objects of interest in the image are homogeneous, regular, and have a large area. The pixel-level supervised information can be better regarded as the patch-level supervised information. The scales between supervised information and spatial features match. The structures of SSUN and SSAN are similar to that of SSMRN, which belongs to the way of fusing two separate deep spectral features and deep spatial features. However, the reason why the results of SSUN and SSAN are not as good as SSMRN may be that the network depth of spectral and spatial FE is not enough. The structures of RSSAN, MorphCNN, and SSFTT belong to the way of directly extracting deep features from original data or several principal components of the original data. The RSSAN and SSFTT are powerful methods. The main limitation of RSSAN and SSFTT is that a certain number of samples are required, which may result in poor performance with small samples, such as in the IP data sets. The accuracy of classification results of MorphCNN is low and unstable in Tables 7  and 8. Because compared with the objects in PU and SA, the morphological feature contained in the patch is not obvious. As shown in Tables 7-9, Bi-GRU, SSUN, and SSFTT generally cost less time than MRCNN and other spectral-spatial feature methods. The reasons may be the grouping strategy of Bi-GRU, the grouping strategy and insufficient network depth of SSUN, and the transformer encoder module of SSFTT. The runtime of the MorphCNN is the longest. The reason is that network structure is more complex and deeper than other networks.      to the ground-truth map. Due to the lack of spatial features, classification maps of Bi-GRU suffer from the pepper noise and misclassification inside an object. Compared with spectral FE methods, spatial FE methods make full use of the continuity of the ground object and yield a cleaner classification map. The main problem of MRCNN lies in the over-smoothing phenomenon. SSRAN, MorphCNN, and SSFTT have the over-smoothing phenomenon, too. They belong to the way of directly extracting joint deep spectral-spatial features from original data or several principal components of the original data, and spectral features come from the patch scale. Meanwhile, SSMRN, SSUN, and SSAN can better retain the detailed boundary of different objects, and acquire more smooth and homogeneous results, especially within the white dashed box. The most likely reason is that they have spatial and spectral FE branches, and spectral features come from the pixel scale. But SSUN and SSAN do not consider the weight relationship between the two branches depending on the spatial resolution of images. The proposed SSMRN takes the weight between spectral and spatial features into consideration and can further reduce over-smoothing.

Discussion
The experimental results of the three public data sets indicate that SSMRN has a more competitive performance in terms of three measurements (CA, OA, and Kappa) and classification maps than all the compared methods. This is due to: 1. The SSMRN is designed with a spectral branch and a spatial branch to extract spectralspatial features. These operations join spectral features and spatial information together sufficiently. 2. The proposed framework takes the weight between spectral and spatial features into consideration and can reduce over-smoothing. Meanwhile, the multi-task learning technology is integrated into the framework, improving the stability of results.

Conclusion
To significantly reduce the over-smoothing effect and effectively learn the features of objects, a multi-task learning SSMRN has been proposed to extract spectral-spatial features. The experimental results of the three public data sets demonstrate that the method not only mitigates the over-smoothing phenomenon, but also has a better performance compared with the other methods in terms of CA, OA, and Kappa. Our method significantly outperforms other methods under different training sample conditions. Although we utilize the proposed band Bi-GRU and MRCNN as the spectral and spatial feature extractors in the implementation of the proposed SSMRN, other deep networks can also be introduced into our model, especially for spectral extractors. It deserves to be investigated in future work.