Image-based neural architecture automatic search method for hyperspectral image classification

Abstract. Convolutional neural networks (CNNs) have shown excellent performance for hyperspectral image (HSI) classification due to their characteristics of both local connectivity and sharing weights. Nevertheless, with the in-depth study of network architecture, merely manual empirical design can no longer meet the current scenario needs. In addition, the existing CNN-based frameworks are heavily affected by the redundant three-dimensional cubes of the input and result in inefficient description issues of HSIs. We propose an image-based neural architecture automatic search framework (I-NAS) as an alternative to CNN. First, to alleviate the redundant spectral–spatial distribution, I-NAS feeds a full image into the framework via a label masking fashion. Second, an end-to-end cell-based structure search space is considered to enrich the feature representation. Then, it determined the optimal cells by employing a gradient descent search algorithm. Finally, the well-trained CNN architecture is automatically constructed by stacking the optimal cells. The experimental results from two real HSI datasets indicate that our proposal can provide a competitive performance in classification.

embedding hash features extracted from the spectrum into the CNN architecture. However, unlike common natural images, HSI is essentially a third-order tensor containing two spatial dimensions and one spectral dimension. 18 It is very important for HSIC to integrate spatial and spectral information. [19][20][21] CNNs allow the use of spatial HSI patches as data input, providing a natural way to merge spatial contextual information through its local receptive domain to improve classification performance. 22 To effectively utilize spatial information, Li et al. 23 combined CNN model with pixel pairs to learn discriminative features and use a majority voting strategy to obtain final classification results. Zhao et al. 24 combined a CNN model-based spatial feature extraction process with a spectral feature extraction process based on balanced local discriminative embedding to superimpose the obtained features and then perform the final classification step. Although these methods incorporate different techniques on the basis of CNN to extract spectral-spatial information separately, they do not consider the inherent continuity of the three-dimensional (3D) HSI cubes. In contrast, the 3D CNN approach takes the neighborhood cube of the raw HSI as input data and computes a 3D convolution kernel for each pixel and its spatial neighborhood and the corresponding spectral information. For example, Mei et al. 25 used a 3D convolutional autoencoder network to learn the spectral-spatial features of the HSI. Roy et al. 26 proposed a hybrid spectral CNN for HSIC, which consists of a spatial 2D-CNN and a spectral-spatial 3D-CNN to join spectral-spatial feature representation. However, the classification accuracy of the above CNN models decreases with increasing network depth, and the network structure of depth is prone to Hughes phenomenon.
To alleviate the above problem, Zhong et al. 27 constructed spectral and spatial residual blocks for accelerating the backpropagation of the network framework to prevent gradient explosion, respectively. Wang et al. 28 proposed an end-to-end fast, dense spectral-spatial convolution framework, which uses dynamic learning rates, parametric corrected linear units, batch normalization, and dropout layers to increase speed and prevent overfitting. In addition, to take full advantage of the positional relationships in the HSI pixel vector, Zhu et al. 29 proposed a CNN-based capsule network (CapsNet). The CapsNet architecture uses local connections and shared transform matrices. In CapsNet, the number of trainable parameters is reduced, which has the potential to alleviate the overfitting problem when the number of available training samples is limited. In addition, the generative model can generate high-quality samples to alleviate the above problems. 30, 31 Wang et al. 32 designed an adaptive dropblock and embedded it into a generative adversarial network (ADGAN) to alleviate problems such as training data imbalance and pattern collapse in HSIC. However, the architectures of these models were designed manually by experts. Designing an appropriate neural structure is a key aspect of classifiers based on neural network, which requires a large amount of prior knowledge and is a time-consuming and trial and error process.
Recently, neural architecture search (NAS) framework has attracted much attention because it can automatically search the structure of neural networks. To design a suitable CNN architecture, Chen et al. 33 proposed an automated CNN approach for HSIC for the first time, constructed a cell-based search space, which uses NAS to search CNN architectures, and 1D auto-CNN and 3D auto-CNN based on the gradient descent method as spectral and spectral space HSI classifiers, respectively. Zhang et al. 34 applied the particle swarm optimization (PSO) method to CNN architecture search, which is able to obtain the global optimal architecture, and designed a new direct encoding strategy to encode the structure into particles and use PSO algorithm to find the optimal deep structure from the particle swarm. Compared with existing deep learning methods, the NAS-based method has better performance.
Although the above methods have made significant progress in HSIC tasks in recent years, when these methods are used for HSIC, there is a large amount of redundant information between neighboring cubes due to the high data complexity of HSI [see Fig. 1(a)], and it is easy to see from Figs. 1(a) and 1(b) that the redundant information increases as the patch size keeps increasing. Therefore, training on the neighborhood cube will consume more training time and memory than on the raw image, and it is difficult to design a classification model based on HSI patches to fit arbitrary images. Meanwhile, the model uses HSI neighborhood cubes as data input, which limits the scope of using spatial neighborhood information.
To alleviate the data redundancy problem of HSI patches, expand the application scope of spatial neighborhood information and improve the processing efficiency in training and testing, Cui et al. 35 proposed an image-based classification framework using image as data input [see Fig. 1(c)] and proposed a multiscale spatial-spectral CNN (HyMSCN) for HSI to integrate multiple receptive fields fused features and different levels of multiscale spatial features. However, it does not enable automatic search of the neural network structure.
Although using images as data input can improve processing efficiency in training and testing, the pooling operation in CNN is based on downsampling to a manageable level of feature space size, which logically, inevitably causes information loss. To alleviate this problem, this paper proposes an end-to-end cell structure in which the input nodes are merged into the output node to further come to enrich the feature information of the input images and ensure the accuracy of classification.
In this context, an image-based neural structure automatic search (I-NAS) method in this paper is proposed to further improve the processing efficiency of training and testing and to ensure the classification accuracy while realizing the neural structure automatic search. Specifically, I-NAS first uses the masked image containing training samples as the input to the network architecture search and classification model, while extracting the spatial location coordinate information of pixel points. Second, an end-to-end cell search space is constructed and select some operations including convolution and pooling. Then a gradient descent-based search algorithm is used to find the cell structure with the best classification performance on the validation dataset. Finally, a CNN classification model is constructed by stacking the cells. In the testing phase, the masked image containing the test samples is used as the input to the CNN classification model and the corresponding labels of all pixels are predicted. The experimental results on multiple datasets show that, compared with using neighboring cubes as the network input, the proposed method improves the running time and has good classification performance.
This proposal in this paper addresses the following aspects: 1. An image-based neural architecture automatic search method (I-NAS) is proposed. The image-based framework increases the image spatial receptive field, reduces data redundancy, and improves the efficiency and performance of classification. 2. An end-to-end cell structure is proposed. In this cell structure, two input nodes are merged into the output node, reducing the feature information loss of the input image due to convolution and pooling operations. 3. On two well-known HSI datasets, I-NAS takes significantly less time in the training and testing phases than other deep learning models. I-NAS takes significantly less time and memory in the architecture search phase than the patch-based neural architecture search algorithm (P-NAS).
The remainder of this paper is organized as follows. In Sec. 2, related works are briefly introduced. In Sec. 3, we introduce our algorithm, including the image-based neural-architecture automatic search framework and cell-based network-structure search. Section 4 presents the experimental results of our method and its comparison with other HSIC methods, discusses the time and space complexity of the experiment, and describes the optimal cells. Finally, the advantages and disadvantages of this method are presented in Sec. 5.   Figure 2 shows the process of designing an architecture using the NAS approach. The general process of NAS is to first construct a search space, which is a collection of optional CNN architectures. Then, the best network architecture in the search space is searched using a search strategy according to the results of the classification performance evaluation. The classification performance evaluation is to evaluate the classification performance of the optional CNN architectures in the search space using evaluation metrics.
Many different search strategies have been used for NAS, including reinforcement learning (RL), evolutionary method, and gradient-based method. For instance, Zoph et al. 37 proposed an RL-based NAS approach that uses an RNN to generate a model description of a neural network and uses RL to train this neural network to maximize the expected accuracy of the generated structures on the validation set. 38 In the search strategy based on evolutionary method, 39,40 each neural network structure is encoded as a digital sequence. Each digital sequence is trained and the performance on the validation set is used as the fitness of the digital sequence. Based on the fitness, a new high-performance neural network structure is generated. However, both search methods require several thousand hours of running with GPU during the architecture search phase to obtain the optimal architecture, which is too time-consuming. In contrast, in the gradient-based search strategy, Liu et al. 36 proposed a method to convert the discrete search space into a continuous space, which enables the gradient-based search method to obtain suitable neural structures and greatly reduces the training time required for the architecture search phase. Due to the simplicity and effectiveness of gradient-based search methods, gradient-based search strategies have become a hot research topic in neural structure search. In this paper, the gradientbased method is also used to search the network structure of cells.

Proposed Methodology
This section presents the framework of I-NAS. I-NAS first preprocesses the HSI dataset to obtain the masked training, validation, and test groups. Then, the training and validation groups are used as input for architecture search to obtain the optimal cell architecture to determine the final classification model. Finally, the test group is used as the input of the final classification model to obtain the classification map. The framework diagram of I-NAS is shown in Fig. 4.

Image Data Preprocessing
Take the Indian Pines dataset as an instance, Fig. 3 shows the progress of the preprocessing to the input HSI data. Suppose an HSI cube X ∈ R h×w×b as the input, where h and w are the spatial sizes of X, and b indicates the number of spectral bands. It then is transformed to the index set I with its labels of each land cover category, which consists of both pixel indexes and corresponding position information. Each element of I can be described as ði; pÞ, in which i is the index of each sample and p is the position information of the corresponding sample. Therefore, the dataset can be randomly divided into three groups of sets according to the corresponding labeled indexes of each class, including the training index set I 1 , the validation index set I 2 , and the test index set I 3 .
To reconstruct to the raw input shape of the HSI cube X, we employ a masking operation to eliminate the interference of irrelevant pixels for each index set according to the position information of the selected pixels in the I 1 , I 2 , and I 3 , respectively. And the image groups of X 1 , X 2 , and X 3 are generated within mutual independent spectral pixel sets. At this time, each group of datasets contains more sparse data distribution, which is convenient to identify the homogeneous spectral energy and efficient to extract the high correlation spatial contexts for classification.

Image-Based NAS Framework
The image-based NAS (I-NAS) framework for HSI classification is shown in Fig. 4. In which the whole structure can be divided into two steps: the architecture search phase and the model testing phase. Take an HSI cube X as the input, the training group X 1 and validation group X 2 that generated by the image data preprocessing are employed for training. In which, both X 1 and X 2 are fed into the optional architecture D in cell-based search space to extract discriminative features of HSI within various deep neural operations. Then it outputs the transformation matrix T that consists of the feature vector with probabilities of each land cover category for each pixel by the softmax classifier. The corresponding predicted labels are selected according to the positions of the training and test pixels. Then the predicted label vector Y Ã ¼ ½y Ã 1 ; y Ã 2 ; : : : ; y Ã S is transformed during the select operation. In addition, each element of labeled pixel has its corresponding annotation Y ¼ ½y 1 ; y 2 ; : : : ; y S . Therefore, the parameters of D are updated by backpropagating the gradient of the cross-entropy objective function 41 in Eq. (1), which represents the difference between both Y Ã and Y. The equation of the cross-entropy objective function is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 1 1 2 (1) Fig. 3 HSI data preprocessing process, where X 1 , X 2 , and X 3 are the training, validation, and test datasets, respectively. I 1 , I 2 , and I 3 are the training, validation, and test index sets, respectively.
where S represents the number of land-cover categories, and Cð·Þ represents the difference between the predicted label vector y Ã and the ground reality label vector y. Finally, the optimal cells can be acquired from the search space. The model testing phase aims at constructing the well-trained CNN classification model in a cellwise stack fashion with the above optimal cells, which is utilized for hyperspectral interpretation. In the model testing phase, the testing group X 3 containing the testing sample is used as the input of the final CNN classification model, and the corresponding labels of all pixels are predicted. The image-based classification framework can make full use of the graphics processing unit during testing to accelerate the reasoning process of nonredundant information. In addition, because there is no slicing operation, the testing process is straightforward. Thus, the result of the testing image can be directly output as an inference. More computing resources are conserved, and efficiency is improved.

Cell structure search
To figure out the optimal settings of the well-trained framework, the neural cells are implemented and evaluated during the architecture search phase in the search space. In which, we suppose a cell with an ordered sequence that consists of two input nodes, three intermediate nodes, and one output node. To demonstrate the cell search stream, we equipped the cell structure search space with three independent neural cells, which is shown in Fig. 5. It can be seen that each intermediate node with two operated sets O can be regarded as an element, which aims to determine the optimal operations collaboratively. Thus, we considered various neural operations to construct the operated sets and learn latent spectral-spatial distribution. Each operation of the set O is participated in the neural calculation to explore the variety of the spectral signatures and predict the characteristic parameters for performing the operation in the optimal cell.
At each operated set O in a cell, we define eight operations to parallel extract both available homogeneous area and neighboring contextual correlation of HSIs. They are separable convolutions with the kernel size 3 × 3 and 5 × 5, dilated convolutions 36 with the kernel size 3 × 3 and 5 × 5, an average pooling operation, a max pooling operation, identification map, and the skip connection, respectively. Once each element of O has been operated and generated corresponding feature maps of HSIs, the most suitable intermediate nodes that construct the optimal cell can be determined by selecting high utility operated sets with corresponding feature maps of the highest weight parameters. Finally, the search structure predict the label of the input traing group X in a concatenation fashion and optimize the cell structure space via a gradient descent method.
In particular, a neural cell can be regarded as a directed acyclic graph that contains N sequential nodes. Each directed edge indicates a series of operations that transform the input node to the intermediate node. At this time, the potential representation of HSIs can be explored within various operations. If P ðmÞ and P ðnÞ represent the intermediate node and the input node of the cell, respectively, the structure of the neural cell can be formulated as (2) where o ðn;mÞ is the operation between P ðmÞ and P ðnÞ , such as convolution, pooling, and skip connections. In addition, ðn; mÞ indicates the directed edge between P ðnÞ and P ðmÞ and is associated with the operation o ðn;mÞ that transforms P ðnÞ . To visualize the cell structure search strategy, Fig. 6 shows the search process distributed in a cell within one input node, two intermediate nodes, and an output node. In which we annotate independent digital labels for the various nodes, P ðnÞ and P ðmÞ represent the intermediate node 1 and intermediate node 2 of the cell, respectively. First, each node of the cell is connected with a dotted line, in which all operations of the directed edges are "unknown." Second, to initialize the parameters of the cell, we employed the operation set O to activate each node of the cell for the data generalization of HSIs. Then, it will eliminate the operations that contribute poor performance to the feature extraction during updating the parameters of weights and determine the optimal operations between both intermediate nodes P ðnÞ and P ðmÞ as shown in Fig. 6. It can be seen that the operation of the directed edge between P ðnÞ and P ðmÞ with bright color indicates the optimal operation within effective feature descriptions of HSI. The dark one representants the operations that have been abandoned during the optimization. Finally, the optimal cell structure is built by concatenating all of the operated nodes that contain optimal operations.
In addition, to reduce the loss of feature information and capture the complete characteristics for HSIs, we merge two input nodes of the cell as the output of the cell to avoid the inefficient description issues. The mathematical expression can be formulated as  (3) where P ðlÞ represents all nodes in the cell except the output nodes, including the node P ðmÞ and node P ðnÞ . The function of concatð·Þ is to concatenate two tensors and h is the output of the cell.

Optimization
To accelerate the search processing, we introduce the gradient descent algorithm to update the hyperparameters of the cell- , the most probable operation is selected as the final operation, and then the cell structure is determined.
As with artificial neural network structures, the performance on the validation dataset is used to guide the structure design. The stochastic gradient descent (SGD) is used to accelerate the search process and optimize the architectural variable B and network weight w in this paper. This is a two-level optimization process. The mathematical expression is as follows: where L train and L val denote the training and validation loss, respectively. L train and L val depend on architectural variable B and network weight w. The goal of NAS is to find the variable B Ã corresponding to the minimum validation loss L val ðw Ã ; B Ã Þ and the weight value w Ã corresponding to the minimum training loss w Ã ¼ arg min w L train ðw; B Ã Þ. B Ã and w Ã are used for the architecture design.
For the cell search strategy, SGD or similar optimizer is introduced to find and evaluate the optimal architecture on the validation dataset. It can be applied as a predictive constraint item during validation after training at each epoch and can be formulated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 5 6 3 where θ L val denotes the parameter weights and biases to be optimized, and x i and y i are the validation samples and their corresponding annotations. h θ ð·Þ represents the cell operations for predicting the probability vectors containing the land cover categories. Finally, the gradients of −∇ θ L val are employed to update θ L val through SGD or similar optimization methods for learning the discriminative spectral-spatial distribution and constructing a well-trained CNN architecture for classification.
When the cell search process is finished, the output of each node is calculated based on only the two strongest previous nodes. Here, the variable B ðn;mÞ o defines the strength of the connection between two nodes. Then, the architecture of I-NAS is determined (as shown in Fig. 5). We train the I-NAS architecture from scratch on the training dataset. Algorithm 1 shows the whole process of I-NAS for HSIC.

Experimental Results and Analysis
This section describes the results of experiments conducted on two open HSI datasets, Indian Pines (IN) and University of Pavia (UP). The algorithms were evaluated on overall accuracy (OA), classification accuracy of each class (CA), average accuracy (AA), and kappa coefficient (K × 100). All of the experiments were performed on a Tesla V100-SXM2 equipped with an NVIDIA SMI 418.67 32 GB GPU. The software environment consisted of Ubuntu18.04.3 as the operating system, CUDA10.1, the PyTorch deep learning acceleration library, and Python3.6 as the programming language. All results of the experiments were given as the average ± standard deviation over 10 independent runs.

Experimental Dataset
The IN dataset was collected by the AVIRIS sensor. The spatial resolution of the sensor is 20 m. IN has 224 spectral bands with wavelengths ranging from 400 to 2500 nm. The spatial size of the IN is 145 × 145. Because some bands cannot be reflected by water, the remaining 200 bands are generally studied. The dataset includes 16 land cover categories. The false-color image and the real image of the ground, the true value of the background, and the color code are shown in Fig. 7.
The UP dataset was collected using the reflective optics spectrographic imaging system. The UP data contain 115 spectral bands with wavelengths ranging from 430 to 860 nm, among which 12 bands are eliminated because of noise, so the image formed by the remaining 103 spectral bands is generally used. The spatial resolution was 1.3 m. The spatial size of the UP data is 610 × 340. The dataset includes nine land cover categories. Figure 8 shows the false-color image, ground true image, background truth value, and color code.  Require: Initialize number of training group X 1 , validation group X 2 , and operation set O.

Experimental Setup
The epochs of the cell architecture search and final model training evaluation were 150 and 300, respectively. Considering the stability of model training, the learning rate of the CNN weight was initialized to 0.016 and 0.008, respectively, and the learning rate of the architecture variable was 0.0003. We used the Adam optimizer to optimize the loss function. 42 Because we selected the image containing samples as input for the architecture search, the batch size was set to 1, and the stride of each convolution kernel was set to 1 to avoid feature loss. In addition, to explore the diversity of the searched cells, we also searched for two types of cells, as shown in Figs. 9 and 10. For dataset sampling, 50 samples were randomly selected in each land cover category for training, and if the number of labeled samples in a category was <50, half the samples of that category were randomly selected for training. Similarly, the validation dataset was identical to the training dataset extraction method. The only difference was that the number of randomly   selected samples in each category was half that of the training dataset. All the remaining labeled samples are used as the testing datasets.

Optimal Architecture
We show the optimal cell structures in IN (see Figs. 9 and 10) and UP (see Figs. 11 and 12).
In the architecture search phase, the architecture variable B is selected according to the classification performance on the validation dataset, and thus the optimal architecture is obtained. Finally, I-NAS is trained from scratch and the classification accuracy of the network is evaluated on the test dataset. Here, we take the IN and UP as examples, showing the detailed structure of each cell from Figs. 9-12. As shown in the figures, the input of the cell is the output of the two previous cells c − fk − 2g and c − fk − 1g, which has three intermediate nodes 0, 1, and 2, and the output of the cell c − fkg is a splice of two input nodes and three intermediate nodes.

Classification Results on Hyperspectral Datasets
To evaluate the effectiveness of the I-NAS method, we used five representative HSI classification methods for qualitative comparison, including the support vector machine (SVM), SSRN, 27 ADGAN, 32 3DCNN, 43 and CapsNet. 29 In addition, to evaluate the feature characterization capability and computational efficiency of the proposed method in this paper, we also compared P-NAS to explore the difference between the neighborhood cube input and the image-based input. In Tables 1 and 2, the best experimental results are in bold.
The purpose of using the IN dataset is to verify the robustness of the algorithm when dealing with unbalanced samples. For the IN dataset, as shown in Table 1, although the SVM has good robustness in HSIC, training the classifier using only spectral features limits the degree of feature generalization of the SVM, resulting in the worst classification accuracy. For example, the fourth and ninth classes reached 44.02% and 50.11%, respectively. Compared with SVM, 3DCNN exhibits promising classification performance. In addition, SSRN, ADGAN and CapsNet all possess different degrees of classification performance due to the introduction of different degrees of spatial information. Among them, CapsNet can achieve results that compete with I-NAS and P-NAS due to the use of pixel spatial location information. In addition, the I-NAS proposed in this paper possesses a more advanced classification performance, reaching Fig. 11 The optimal structure of a cell based on I-NAS method on UP.   94.64% on OA. Hwever, the OA difference between P-NAS and I-NAS is not significant, which may be due to the similar spectral information of neighboring pixels in homogeneous regions of P-NAS. The latter does not have the same advantages as the neighboring spectral information, but it has a larger spatial receptive field and can extract abundant spatial texture information. Compared with the SVM, the OA, AA, and K of I-NAS are improved by 17.81%, 22.41%, and 20.14%, for 3DCNN by 3.15%, 5.18%, and 4.43%, for SSRN by 4.57%, 13.29%, and 5.19%, and for CapsNet by 0.25%, 11.58%, and 0.36%, respectively. The OA of I-NAS is 6.67% higher than that of ADGAN and 0.39% higher than that of P-NAS. In I-NAS, the classification accuracy of the first, seventh, ninth, and sixteenth categories reached 100%. This may be due to the small number of samples in these classes and the proportion of training samples in the homogeneous classes was larger than that in other classes. This confirms that I-NAS can achieve a better classification effect when dealing with hyperspectral data with unbalanced labeled samples, which indicates that I-NAS is a powerful HSIC framework. The UP dataset is used to examine the effectiveness of the algorithm in processing highresolution data samples. As shown in Table 2, due to the finer spectral features of the UP dataset, the pixel-based approach (SVM) can provide relatively good experimental results on UP, especially for the ninth category. In addition, I-NAS and P-NAS possess more advanced classification performance compared with 3DCNN, SSRN, ADGAN, and CapsNet, and they reach 97.45% and 96.71% on OA. From Table 2, the OA, AA, and K of I-NAS are 13%, 16.79%, and 16.75% higher than the SVM, 10.13%, 15.90%, and 13.06% higher than 3DCNN, 6.25%, 6.83%, and 8.45% higher than SSRN, 16.20%, 10.21%, and 9.09% higher than ADGAN, 2.01%, 7.10%, and 2.46% higher than CapsNet, 0.74%, 0.99%, and 0.97% higher than the P-NAS. The validity of the algorithm was verified. For the UP dataset, the classification accuracy of the five land cover categories in I-NAS is higher than that of other classification algorithms, among which the classification accuracy of the fifth, seventh, eighth, and ninth categories reached more than 99%. This indicates that I-NAS can achieve high classification accuracy when dealing with highresolution data samples. Figures 13 and 14 show the classification maps for various classifiers using the IN and UP datasets. It is evident from the resulting images that SVM using spectral features alone always produces noisy scatter and depicts more errors than the spectral-spatial approach [see  greater the difficulty of classification. Due to the spectral similarities of categories 2, 10, and 11, misclassification easily occurs. However, I-NAS has achieved good results on small samples, such as categories 1,4,7,9,and 16 [see Figs. 13(c)-13(h)]. By comparing the real ground reference with the classification map, I-NAS is shown to obtain more accurate classification results.
Similarly, due to the higher spatial resolution of the UP dataset, the SVM has better classification performance than on the IN dataset. Compared with the spatial-spectral-based algo-

Investigation of Number of Training Samples
We evaluated the sensitivity of the model to the number of samples by selecting different numbers of training samples separately and investigated the effect of different numbers of training samples on OA. Figure 15(a) shows the OA of all classification methods using different numbers of training samples on the IN dataset, indicating that an increase in the number of training samples has a decisive effect on the classification performance of the seven classification methods. Although the classification accuracy of the SVM can be improved by selecting the best training sample to improve the generalization performance, 44 it was observed in the experiment that the  1, 7, 9, 13, and 16, no matter how many training samples are considered, which is due to the relatively small number of samples in these five classes and the intraclass variation in land coverage can be negligible. Thus, although the number of training samples is relatively small, beneficial classification performance can be obtained. For the remaining classes, when the number of samples increases, their contribution to CA gradually increases and then stabilizes. Similar evidence is shown in Fig. 16(b), which shows the CA of each class on different numbers of training samples obtained from the I-NAS algorithms on the UP dataset. It is shown in Fig. 16(b) that for classes 5, 7, and 9, the variability of CA values is negligible as the number of training samples increases. For the other classes, the CA values gradually stabilize as the number of samples increases. In summary, by modifying the training samples, not all classes contribute equally to OA. The reason may be that the spectral signatures confront the challenge of spectral variability due to  illumination and atmospheric conditions, which means that several classes have high intraclass difference, and these classes may need more training samples to characterize the class features, so those classes that are not disturbed by spectral variability only need small training samples.

Time Complexity and Space Complexity Experiments
The time and space complexity of experiments is always an important point in the study of HSIC. This section discusses the training and testing times required by different algorithms and studies the number of training parameters for different algorithms. The time consumption and the number of parameters in the architecture search phase of P-NAS and I-NAS algorithms were studied. Tables 3 and 4 show the time consumption of all algorithms for training and testing. For the IN dataset, compared with the SVM, other deep learning methods needed more time to train the network. I-NAS is faster than other deep learning methods because its input is complete image rather than the neighboring cube, so it reduces data redundancy and computational burden. Of all the methods compared, SSRN and 3DCNN are the most time-consuming. For the former, more training time is required due to the many network layers and needing to extract spectral and spatial features. For the latter, a large number of parameters in the 3D convolution kernel lead to extensive training time. Because the CapsNet considers the spatial location information of pixels, it takes a relatively long time. The P-NAS method requires more training time than CapsNet because P-NAS has more network layers and concat the information of each node in each cell, resulting in more redundant information. In addition to I-NAS, ADGAN requires least training time, because ADGAN is a semisupervised classification method, which has less label information than other methods. Somewhat similar results were obtained on the UP dataset.  Similar results as above were obtained regarding testing time. I-NAS requires less time than the other classification algorithms for reasons mentioned above, i.e., lower data redundancy and computational burden. We obtained similar results on the UP dataset.
In this study, the number of trainable parameters (MB) was used to discuss the extent to which different algorithms occupy computer resources. The detailed results are listed in Table 5. Because the CNN architecture based on an image is searched for automatically, the optimal architecture found by each search could be different (the experiment in this study was run 10 times). We chose an optimal structure and calculated the number of trainable parameters.
As shown in Table 5, the number of trainable parameters of I-NAS is significantly less than that of other methods (such as SSRN, CapsNet, and P-NAS) for the IN dataset. The number of trainable parameters of I-NAS in UP was greater than that for the IN dataset. The reason may be that I-NAS takes entire image as input, while the spatial size of UP is larger than that of IN. In addition, I-NAS can achieve better classification performance with fewer trainable parameters for two reasons. First, the limited number of training samples and the considerable number of trainable parameters easily overfits the model. Second, I-NAS, which has a strong generalization ability, is an automatically designed architecture that depends on the training and validation samples.
For NAS, an architecture search is a very time-consuming step. Tables 6 and 7 show the time consumption and number of parameters of P-NAS and I-NAS in the architecture search phase on the two datasets, respectively. As shown in Tables 6 and 7, the time consumption and the number of required parameters in the architecture search phase of I-NAS are significantly lower than those of P-NAS.

Conclusions
This study proposes a new image-based NAS method. Compared with the 3DCNN and SSRN designed by human experts, I-NAS has better classification accuracy. I-NAS uses a masking operation to eliminate the interference of irrelevant pixels in each index set according to the position information of selected pixels in I 1 , I 2 , and I 3 . The image groups of X 1 , X 2 , and X 3 are generated within mutual independent spectral pixel sets. Then, image groups X 1 and X 2 are used as the input for architecture search and determines the optimal cell structure in the search space by the gradient-based search strategy. To improve the classification performance, an end-to-end connection method was adopted in the cell to merge the input nodes into the output node to reduce the data loss caused by convolution and pooling. This paper discusses the temporal and spatial complexity of the experiments, in which I-NAS spent less time (∼7 min) during the architecture search process on the IN dataset than P-NAS (∼51 min). In addition, the CNN architecture obtained by the I-NAS method can automatically matched specific datasets and has good generalization ability. However, I-NAS also has many limitations. I-NAS used the entire image as a network input, which inevitably resulted in data loss. In addition, I-NAS consumed considerable memory because the input was an entire image and its size directly affected the memory of the graphics card being used. In future work, the image will be further processed to reduce the loss of feature information and the burden on the graphics card memory and more efficient search algorithms will be found to improve further the classification performance.