
1.IntroductionThe convolutional neural network (CNN) was first proposed in 1960s. Hubel and Wiesel^{1} observed for the first time that neurons in the visual cortex were sensitive to moving edge in their experiments on visual cortex cells of cats and proposed the concept of “receptive field.” They further discovered the hierarchical processing mechanism of information in visual cortical pathways, pointing out that simple cells detect location information, and complex cells integrate information stimulated by simple cells. The concept of “receptive field” proposed in Ref. 1 was later introduced into the research work of CNNs. In the 1980s, Fukushima and Miyake^{2} proposed “neocognitron” based on the “receptive field,” which can be regarded as the first implementation of CNNs. Neocognitron decomposes a visual model into several submodels and then processes them on the hierarchical and progressive connected feature planes so that the recognition can be completed even if the object has displacement or slight deformation. Neocognitron is the first artificial neural network based on local connectivity and hierarchical structure among neurons. But at that time, due to the lack of suitable learning algorithm, the network adopted other unsupervised algorithms and was mainly applied to handwritten digit recognition. After that, researchers have tried to use the multilayer perceptron to learn features instead of manual design features and trained the model with the backpropagation (BP) algorithm, which was first proposed by Paul.^{3} Through the work of Rumelhart et al.,^{4} BP gained recognition. LeCun et al.^{5} presented an application of BP networks to handwritten digit recognition, which show that large BP networks can be applied to real imagerecognition problems without a large, complex preprocessing stage requiring detailed engineering. LeCun et al.^{6} summarized the endtoend training principle of modular system and proposed a CNN architecture called “LeNet5,” which showed better performance than all other techniques on a standard handwritten digit recognition task at that time. However, since some shallow machine learning models^{7}^{,}^{8} were proposed one after another at that time, and the traditional BP neural network would encounter problems such as local optimum, overfitting, and vanishinggradient^{9} as the number of network layers increased, the research on deep neural network model was shelved. Hinton et al.^{10}^{,}^{11} found that the artificial neural network with multiple hidden layers has excellent feature learning ability. The learned features are more fundamentally to characterize the data, which is beneficial to visualize or classify the data, and the vanishinggradient problem in neural network training can be alleviated through normalized initialization.^{12} Since then, deep learning has attracted more and more attention. The CNN model AlexNet presented by Krizhevsky et al.^{13} at the ILSVRC2012 image classification competition^{14}^{,}^{15} achieved a top5 test error rate of 15.3%, almost halved the error rate of image classification compared to 26.2% achieved by the secondbest entry. CNNs have been proved to be effective in various fields of visual recognition^{13}^{,}^{16}^{–}^{18} and have attracted more and more attention from researchers in the field of deep learning. Lecun et al.^{19}published a review article in Nature titled “Deep learning,” which sheds light on the basic principles and core strengths of deep learning. First, this paper introduces the history of CNN and then analyzes the development of CNN architecture in image classification. Then the advantages and disadvantages of various convolution network architectures are compared and analyzed, and the future development of CNN is prospected. 2.Deep Convolutional Neural NetworkSince AlexNet^{13} achieved amazing results in ILSVRC2012 image classification competition, more and more researches have focused on the improvement of the architecture of CNN. Visual geometry group (VGG)^{20} and the inception module of GoogLeNet^{21}^{,}^{22} demonstrated the benefits of increasing network depth and width. ResNets^{23}^{,}^{24} constructed the residual learning block through the shortcut connection of identity mapping, making the neural network model break through the barrier of hundreds or even thousands of layers. DenseNet^{25} and others^{26} confirmed that refomulations of the connections between network layers can further improve the learning and representational properties of deep networks. In this section, we first introduce the basic composition and characteristics of CNNs through the network model of LeNet5 proposed by LeCun et al.^{6} Then the classical deep CNN model structure in recent years is analyzed accordingly. 2.1.LeNetLecun et al.^{6} proposed a CNN named LeNet5. The network model of LeNet5 is shown in Fig. 1. According to Fig. 1, CNN architecture is generally composed of convolution layers, subsampling (pooling) layers, and fully connected layers. The following three sections are explained in turn. 2.1.1.Convolution layerThe convolution layer consists of multiple feature maps, which are obtained by convolution of the convolution kernel with the input signal. Each convolution kernel is a weight matrix, which can be a $3\times 3$ or $5\times 5$ matrix for a twodimensional (2D) image of a single channel. Figure 2 illustrates an example of the 2D convolution. The convolution operation provides a way to process variablesize inputs using convolution kernels, and different input features are extracted through convolution operation in convolution layer. The first layer extracts lowerlevel features such as edges, end points, and corners. Then the higher layer extracts more complex and higherlevel features by processing the lowerlevel features. Convolution layer mainly has the characteristics of sparse interactions and weight sharing. Sparse interactionsTraditional neural networks use matrix multiplication to build connections between inputs and outputs. Each output unit interacts with each input unit. When an input image contains thousands of pixels, this connection will increase the storage requirements of the model and increase the amount of calculation. Different from the traditional connection, the convolution network has the characteristic of sparse interactions (also known as sparse connectivity), which is achieved by controlling the size of the convolution kernel far less than the size of the input. The graphical interpretation of the sparse connections is shown in Fig. 3. In this figure, the input unit ${X}_{3}$ and the output unit affected by ${X}_{3}$ are highlighted. If there are $m$ inputs and $n$ outputs, the fully connected form of the model requires $\text{\hspace{0.17em}\hspace{0.17em}}m\times n$ parameters and the complexity of the corresponding algorithm is $O(m\times n)$. In sparse connection, the connections number of per output is $k(k\ll m)$, so this connection only needs $k\times n$ parameters and the complexity of the corresponding algorithm is $O(k\times n)$. The sparse interaction of convolution layer not only reduces the storage requirements of the model but also requires less computation to obtain the output, thus improving the efficiency of the model. Weight sharingThe convolution layer also has the characteristic of weight sharing, which is realized by the convolution kernel. Convolution kernels are used to control the number of parameters and to impose a spatially restricted weighting to handle variablesize inputs. Weight sharing means that units in a layer use the same weights and deviations. For example, the C1 layer of LeNet5 is a convolution layer, which is obtained through the calculation of six convolution kernels, and each convolution kernel has a fixed weight when convolving with the previous layer. When the input is a singlechannel signal, the C1 layer contains six convolution kernels with the size of $1\times 5\times 5$. If the bias is taken into account, the C1 layer contains a total of $(63\times 5\times 5+6)=156$ parameters. Compared with the fully connected network architecture, the weight sharing reduces the network training parameters to a greater extent, which can effectively prevent the network overfitting caused by a large number of parameters and improve the efficiency of network operation. 2.1.2.Subsampling layerUsually, a subsampling (pooling) layer is inserted periodically between the convolution layers, whose function is to gradually reduce the spatial size of the data, so as to reduce the number of parameters in the network and reduce the consumption of computing resources. The pooling layer can also learn some invariant features of the input. Commonly used pooling layer methods are global average pooling^{27} and max pooling. The input data processed by the pooling layer is generally a feature map obtained after convolution operation. The most commonly used max pooling layer is shown in Fig. 4. It can be seen that the max pooling unit is only sensitive to the surrounding maximum, not to the exact location. Therefore, by pooling the obtained features, we can learn some invariant features of the input. In LeNet5, the max pooling layer mainly uses a spatial window with a size of $2\times 2$ and a step size of 2 to convolute. The maximum value in this window is taken as the output result. 2.1.3.Fully connected layerAfter a series of convolution and pooling layers, the feature map of the image is extracted, and all the neurons in the feature map are transformed into a fully connected layer. Finally, the output can be classified by softmax layer. The function of the fully connected layer is to integrate the local information with class distinction both in convolution layer and pooling layer^{28} so as to improve the performance of the whole CNN. LeNet5 is a classical CNN architecture. The combination of convolution layer, pooling layer, and fully connected layer is still the basic components of modern deep CNN. LeNet5 has a groundbreaking significance for the development of deep CNNs. 2.2.AlexNetDue to insufficient hardware computing and data, LeNet5 did not attract enough attention after it was proposed. With the development of computer hardware and the increase in the amount of data available for neural network training, in 2012, AlexNet network^{13} won the ILSVRC2012 image classification competition^{15} with a far lower error rate than the second place. Since then, deep neural networks have begun to attract widespread attention. The structure of AlexNet is shown in Fig. 5. Compared with LeNet5, the improvements of AlexNet network architecture are as follows:
AlexNet is a milestone in the development of deep CNN, which has caused a new wave of neural network research. The success of AlexNet mainly depends on the development of computer hardware and the enhancement of data sets. 2.3.ZFNetAfter AlexNet achieved excellent results in the ImageNet image classification competition, researchers began to study the CNN more deeply. However, there is no clear theoretical explanation for why a CNN model can perform well. Zeiler and Fergus^{31} proposed a visualization technique to understand CNNs and proposed ZFNet. The network has made minor improvements on AlexNet; the main contribution of Ref. 31 is to explain to a certain extent why CNNs are effective and how to improve network performance. The main contributions are detailed as follows:
The ZFNet is shown in Fig. 6. It changed the size of the convolution kernel in AlexNet’s first layer from $11\times 11$ to $7\times 7$ and changed the step size of the convolution kernel from 4 to 2. Comparing ZFNet model with AlexNet single model, the error rate of top5 is reduced by 1.7%,^{31} which confirms the correctness of this improvement. 2.4.VGG16/19The shallow neural network model has certain limitations in largescale image recognition tasks. In order to further explore the performance of the deeper network model, Simonyan and Zisserman^{20} proposed the VGG. The main contribution of VGG is a thorough evaluation of networks of increasing depth using an architecture with very small ($3\times 3$) convolution filters, which shows that a significant improvement on the priorart configurations can be achieved by pushing the depth to 16 to 19 weight layers. Simonyan and Zisserman^{20} mentioned six different network configurations and compared them on the ImageNet dataset. The configuration information of convolution network is shown in Table 1, and the performance of the corresponding network model is shown in Table 2. Table 1ConvNet configuration.20
Table 2ConvNet performance at a single test scale.20
Unlike AlexNet and ZFNet, VGG uses a small convolution kernel of $3\times 3$ throughout the construction of the network and superimposes deep networks by superposing $3\times 3$ small convolution kernels. In the experiment, in order to keep the computational complexity of the constituent structures at each feature layer roughly consistent, the number of convolution kernels at the next layer is doubled when the size of the feature map is reduced by half through the max pooling layer. The various configurations in Table 1 almost have the same number of parameters, and Table 2 shows the results of various VGGNets in a singlescale test. The results show that the VGG19 model achieved the best results, with an error rate of 8.0%. This also confirms that increasing network depth is beneficial to improve the accuracy of image classification. At the same time, it can be found that the result of ALRN in Table 2 is worse than that of A. This also shows that the effect of LRN on classification results is not beneficial. With the introduction of batch normalization (BN),^{32} LRN is replaced already. The innovation of VGG is mainly the application of $3\times 3$ small convolution kernels. The receptive field of two $3\times 3$ convolutions is equivalent to that of a $5\times 5$ convolution (as shown in Fig. 7), and the receptive field of three $3\times 3$ convolutions is equivalent to that of a $7\times 7$ convolution. The network used three $3\times 3$ convolutions instead of a $7\times 7$ convolution for two main reasons: First, it contains three ReLU layers instead of one, making the decision function more discriminatory; second, it can reduce the number of parameters. For example, if the input and output both have C channels, $3\times (3\times 3\times C\times C)=27\times C\times C$ parameters are required for three convolution layers of $3\times 3$, and $7\times 7\times C\times C=49\times C\times C$ parameters are required for one convolution layer of $7\times 7$. Before VGG, An et al.^{33} also used small convolution kernels for experiments, but the network was not as deep as VGG and was not tested on largescale ImageNet datasets. Using small convolution kernels, VGG can make the CNN reach a depth of 19 layers. In the ILSVRC2014 image classification competition, VGG took the second place with a 7.3% (Ref. 20) top5 error rate, this also confirms the benefits of neural network depth for neural network classification results. 2.5.GoogLeNet/Inception v1 to v3GoogLeNet and VGG were the winner and runnerup of the ILSVRC2014 image classification competition. VGG built a deeper network model through the construction of small convolution kernels, and GoogLeNet was inspired by network in network^{27} to broaden the network structure and skillfully proposed the inception module.^{21} The network with the inception module allowed the model to better describe the input data content while further increasing the depth and width of the network model. The inception module has been constantly updated and improved since it was proposed. The different versions of the inception modules are described as follows. 2.5.1.Inception v1The biggest highlight of inception v1 is the introduction of $1\times 1$ convolution kernel inspired by network in network.^{27} The structure of inception v1 is shown in Fig. 8. As can be seen from Sec. 2.1, one function of the convolution layer is to reduce and increase the dimension via using the number of channels (filters) in the convolution layer. In inception v1, the dimension is reduced mainly by $1\times 1$ convolution kernel, which can reduce the number of network parameters and feature maps. The input feature maps are convoluted by $1\times 1$ convolution kernel. This operation is equivalent to the original image scale transformation under the condition of unchanged size, which can greatly improves the accuracy of image classification. Inception v1 also uses convolution kernels of $1\times 1$, $3\times 3$, and $5\times 5$, which also increases the adaptability of the network to the scale transformation of the input image. The GoogLeNet constructed by inception v1 is shown in Fig. 9. Compared with VGG, GoogleNet has 22 layers, and the network is deeper and wider. GoogLeNet took the first place in the ILSVRC2014 image classification competition with a 6.7% (Ref. 21) top5 error rate. 2.5.2.Inception v2The architecture of inception v2, as shown in Fig. 10, is mainly updated on the basis of inception v1 from the following aspects:
Inception v2 architecture on the ImageNet test data set yielded a top5 error rate of 4.9%,^{22} which was lower than the 4.94% top5 error rate of PReLU proposed by He et al.^{34} in the same time. PReLU’s top5 error rate of 4.94% was the first to surpass humanlevel performance (5.1%)^{15} on the visual recognition challenge. 2.5.3.Inception v3The architecture of inception v3^{22} is shown in Fig. 11. It is mainly updated on the basis of inception v2 as follows:
Inception v3 module obtained 3.58% (Ref. 21) top5 error rate of on the ImageNet test set. 2.6.ResNets2.6.1.ResNetIt can be found from the above development of various CNN models that increasing the depth and width of neural network can improve the network performance. For example, VGG greatly improves network performance by adding network depth to AlexNet. For the original network such as VGG, simply increasing the depth will lead to vanishing/exploding gradients. He et al.^{23} pointed out that the problem of vanishing gradients has been largely addressed by normalized initialization^{12} and intermediate normalization layers. Although it is possible to train dozens of layers of networks by the above method, another problem arises, i.e., degradation problems. As shown in Fig. 13, when the number of network layers increased, the accuracy of training set was saturated or even decreased. This cannot be interpreted as overfitting, as overfit should be better in the training set. The degradation problem shows that deep networks cannot be optimized easily and well. He et al.^{23} proposed the ResNet in order to solve the above problems. The main contribution of ResNet is to solve the side effects (degradation) caused by increasing network depth so that network performance can be improved by simply increasing network depth. ResNet constructed by residual learning blocks can break through a 100layers barrier and even reach 1000 layers. The ResNet is mainly composed of the residual learning block, as shown in Fig. 14. In the residual learning block in Fig. 14, assuming the original function to be learned is $H(x)$, the residual learning block is then converted to $F(x)=H(x)x$. These two expressions have the same effect, but the difficulty of optimization is different. To the extreme, if an identity mapping was optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.^{23} The addition connection of the identity mapping does not add additional parameters and computation to the network but can greatly increase the training speed of the model and improve the training effect. Two residual blocks are used in the ResNet network structure for ImageNet. One is to concatenate two convolution kernels of size $3\times 3$ as one residual block shown in Fig. 15(left), and the other is to connect the $1\times 1$, $3\times 3$, and $1\times 1$ kernels together as a “bottleneck” building block shown in Fig. 15(right). In the “bottleneck” building block, the first $1\times 1$ convolution kernel mainly reduces the dimension of the feature map from 256dimensional to 64dimensional. Next, the convolution kernel of $3\times 3$ is used for calculation, and finally the data dimension is changed to 256 using the convolution kernel of $1\times 1$. ResNet, constructed by residual learning block, won the first place in the ILSVRC2015 image classification competition with a top5 error rate of 3.57%.^{23} As the number of layers increases, ResNet solves the “degradation” problem well, as shown in Figs. 16 and 17. 2.6.2.Improvement of ResNetSoon after ResNet put forward, He et al.^{24} further studied the identity mapping in the residual learning block and improved it. The method is compared with the original ResNet and Highway networks.^{23}^{,}^{35} The results further confirm the importance of identity mapping. The main contributions of Ref. 24 are as follows:
Among the various types of shortcut connections shown in Fig. 18, the network consisting of the original shortcut connections achieved a 6.61% (Ref. 24) error rate on the CIFAR10 data set, which is better than the other connections. This confirms the importance of the identity mapping. In the experiment,^{24} among the various usages of activation, the best classification results were obtained using a full preactivated connection [Fig. 19(e)]. 2.6.3.Other residual networksWith the increasing depth of residual networks, the diminishing feature reuse will make the networks training very slow.^{36} In order to reduce the impact of “feature disappearance,” Zagoruyko and Komodakis^{37} proposed a widedropout block, as shown in Fig. 20(d). This block makes it possible to increase the depth of the original residual network by increasing the network width. The experiment also proves its feasibility. ResNeXt, proposed by Xie et al.,^{38} puts forward the concept of cardinality beyond depth and width, and points out that increasing cardinality is more effective than increasing the depth and width. The residual learning block of ResNext is shown in Fig. 21. ResNeXt secured second place in ILSVRC2016 image classification competition with 3.03% top5 error rate.^{38} In addition, Szegedy et al.^{39} proposed inception v4 by combining inception module with residual learning block and constructed inceptionResNetv2 network. The network achieves good results in the ILSVRC2016 image classification competition with a top5 error rate of 3.08%.^{39} 2.7.DenseNetSince ResNet was put forward, many networks have been developed using ResNet. Each network has its own characteristics and its performance has been improved. As the depth of CNNs increases, the input or gradient must passes through many layers, which will vanish and “wash out” when it reaches the end (or beginning) of the network.^{25} This has aroused people’s rethinking of the network structure. Before the dense block was put forward, Huang et al.^{40} trained deep network with stochastic depth to achieve good results. This shows that some network layers in the residual network carry unnecessary information for classification results, which can be discarded in training. Based on Ref. 40, considering create short paths from early layers to later layers, Huang et al.^{25} proposed DenseNet, which is mainly composed of dense blocks as shown in Fig. 22. There are $L$ connections with the traditional $L$ layer neural network, and the dense block of $L$ layer has $\frac{L\times (L+1)}{2}$ connections. The network setup growth rate $k$ indicates the added number of input channels when pass through a layer. For example, assuming that ${K}_{0}$ as the number of input feature maps, and the output of each nonlinear transformation $H$ is $k$ feature maps, then the input of the $i$’th layer is ${K}_{0}+(i1)\times k$. One major difference between DenseNet and the previous mentioned networks is that DenseNet can accept fewer feature maps as the output of the network layer. DenseNet is constructed mainly by dense blocks, as shown in Fig. 23. In the same dense block, the feature size is required to be the same size. The transition layers are set between different dense blocks to achieve down sampling. The main advantage of DenseNet is that the features extracted by some earlier layers can still be directly used by deeper layers through dense connections. Through the setting of the growth rate $k$, DenseNet can adjust the number of feature maps, thus effectively reducing the number of parameters. DenseNet outperformed ResNet on the CIFAR10 dataset, and on the ImgeNet dataset, DenseNet was able to converge faster by increasing the number of layers.^{25} DenseNet is also widely used as a commonly used neural network model today. 3.Auxiliary Methods and StrategiesThis section mainly introduces some auxiliary methods and strategies in the development of CNNs, including the improvement of activation functions, normalization, and some other strategies. 3.1.Activation FunctionBefore the ReLU activation function, the traditional neural network mostly uses sigmoid as the activation function. In general, sigmoid functions can be divided into logistic sigmoid and tanh sigmoid. The sigmoid function in this paper generally refers to the former, as shown in Fig. 24. The output value of the sigmoid function is between 0 and 1, which is consistent with the definition of probability output. The nonlinear sigmoid function is widely used in the activation function because of its large signal gain in the central region and small signal gain on both sides, similar to the excitation and suppression states of neurons. However, when the number of neural network layers increases, the sigmoid gradient value will gradually become smaller, network learning becomes very slow, and even the gradient will vanish. Therefore, the network cannot be deepened indefinitely until the ReLU function is presented. ReLU is the activation function used by many current network models. ReLU has the following advantages over sigmoid: unilateral inhibition, relatively wide excitation boundaries, sparse activation, and alleviate the vanishinggradient problems. Leaky ReLU^{41} improved the negative half axis of ReLU function to avoid zero gradient, but the experimental results were not greatly improved. He et al.^{34} put forward the PReLU function on this basis, as shown in Fig. 25. The learnable parameter $a$ is added to PReLU. When $a=0$, PReLU becomes the ReLU function, and when $a=0.01$, PReLU becomes the leaky ReLU. Experiments show that this adaptive activation function can improve the classification results of the network. 3.2.NormalizationIn the training process, when the input distribution in the hidden layer of the deep neural network is offset, the global distribution will gradually approach the upper and lower bounds of the value range of the nonlinear function, resulting in slow training convergence. Therefore, the network needs to be normalized. The proposition of BN^{32} has a milestone significance in the field of deep learning. BN takes minibatch as the unit to unify the input distribution of the nonlinear function into a standard normal distribution with a mean of 0 variance of 1, which makes the input value of activation function to fall in the region where the nonlinear function is sensitive. BN improves the speed of training, accelerates the convergence process, and improves the classification results. Moreover, BN can be seen as a regularization technique that prevents overfitting, similar to dropout.^{30} The addition of BN in the network also makes the network initialization less demanding. The disadvantage of BN is that the network is dependent on minibatch dimension, and the change of batch dimension will affect the classification effect. When the batch size is small, the network effect of using BN layer is obviously worse as shown in Fig. 26. This does not satisfy some networks that require batch size 1 or 2 for other visual recognition tasks.^{16}^{,}^{42}^{,}^{43} To alleviate this problem, layer normalization,^{44} instance normalization,^{45} group normalization,^{46} and other normalization methods have been proposed. Figure 26 confirms that group normalization computation accuracy is more stable than BN when batch size changes. Figure 27 is a schematic diagram of various normalization methods. Normalization is an indispensable part of the modern convolution network architecture. It has made a vital contribution to the development of CNNs. 3.3.Other StrategiesIn the development of a CNN in image classification field, the improvement of some network initialization methods has also played a positive role. Network initialization is to ensure that the activation value of each layer does not appear saturation when the network is initialized, and the activation value of each layer is not 0. Sutskever et al.^{47} proposed a Xavier initialization method to solve the network initialization problem. AlexNet^{13} used random initialization for network training, and VGGNet^{20} initialized the deep network by initializing the shallow model first and then applying its parameters to the deeper model. Ioffe and Szegedy^{32} proposed BN and He et al.^{34} proposed Microsoft Research Asia (MSRA) initialization method, they all considered the nonlinear ReLU function situation, and the deep neural network initialization problem was solved more effectively. In addition to network initialization, the innovation of optimization method has also promoted the development of CNN. The optimization algorithm develops from stochastic gradient descent (SGD) to gradient descent with momentum^{47}, and then to Adam with adaptive learning rate,^{48} which is widely used nowadays. In the latest work, Reddi et al.^{49} explained how the exponential moving average used in Adam leads to nonconvergence through a simple convex optimization problem and proposed a beyond Adam algorithm. 4.Comparison of Various Image Classification MethodsThe analysis and comparison results of various image classification methods are shown in Table 3. The main comparative factors include model name, publication year, algorithm test data set, algorithm evaluation index, network model parameters, algorithm experimental results, algorithm characteristics, and notes (such as algorithm achievements, whether multiscale training is needed, and so on). Table 3Experimental results of various image classification methods.
Table 3 compares and analyzes the performance of various image classification algorithms on the ImageNet data set or CIFAR10 data set and summarizes the characteristics of the algorithm. 5.SummaryFrom the initial appearance of AlexNet to the gradual increase of network layers of VGGNet, all of them show the potential of neural network depth. The ingenious design of the inception module also shows the charm of the neural network architecture. ResNets further explores the effect of the neural networks depth, which plays a crucial role in the development of today’s networks. On the other hand, DenseNet makes CNNs better for learning representation from the point of feature reuse, which provides a new perspective for the development of network architecture. In the following, the development trend of CNNs in image classification is prospected through several aspects. 5.1.Application of Transfer LearningIn the application of deep neural network, when we are faced with a large amount of data, it takes a lot of calculation and time to train the model and optimize the parameters after building the deep neural network model. If a model that has been trained for a large amount of time can solve the same kind of problems, then the cost performance of the model will be greatly improved, which promotes the use of transferable model to solve the same kind of problems. Zeiler and Fergus^{31} used a CNN for pretraining on ImageNet data sets, and then migrated the network to caltech101 and caltech256 for image classification data sets, respectively, for training and testing. The accuracy of image classification was improved by about 40%. Through transfer learning, we can apply a welltrained model to solve the similar problems by making small adjustments and achieve good results. At the same time, we can effectively solve the problem with less original data by adopting transferable model. Using transfer learning, the network model in image classification can be further applied to semantic segmentation, object detection, and other fields. In recent years, many researchers have devoted themselves to the field of transfer learning. 5.2.Introduction of Visual Attention MechanismIn recent years, attention mechanism has been adopted in the field of deep learning. Visual attention mechanism is a special brain signal processing mechanism of human vision. By rapidly scanning the whole image, human vision can obtain the target area that needs to be paid attention to, then devote more attention resources to this area to obtain more detailed information of the target, and inhibit other useless information. Hu et al.^{52} introduced attention mechanism to construct squeezeandexcitation module, which reconstructs the relationship between feature channels by embedding multiscale processing. SENet won the ILSVRC 2017 image classification championship with a top5 test set error rate of 2.251%. In the future, the design of CNN framework can seek for introducing and strengthening attention mechanism in different layers to make computer vision closer to human visual ability. 5.3.Study on the Stability of CNNA CNN has a large number of parameters, so the experiment of CNN often fails to achieve the effect of network in corresponding papers. At present, the parameter setting in training CNN is mostly based on experience and practice. The optimization analysis of parameters and the study of system stability are the problems to be solved. 5.4.Hardware Development and Data Set BuildingThe development of deep learning is inseparable from the innovation of hardware devices and the expansion of data sets. With the support of hardware devices and data sets, CNN will further help and solve the cognitive defects existing in the current network structure. ReferencesD. H. Hubel and T. N. Wiesel,
“Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,”
J. Physiol., 160
(1), 106
–154
(1962). https://doi.org/10.1113/jphysiol.1962.sp006837 JPHYA7 00223751 Google Scholar
K. Fukushima and S. Miyake,
“Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position,”
Pattern Recognit., 15
(6), 455
–469
(1982). https://doi.org/10.1016/00313203(82)900243 Google Scholar
W. Paul,
“Beyond regression: new tools for prediction and analysis in the behavioral sciences,”
Harvard University,
(1974). Google Scholar
D. E. Rumelhart, G. E. Hinton and R. J. Williams,
“Learning representations by backpropagating errors,”
Nature, 323
(6088), 533
–536
(1986). https://doi.org/10.1038/323533a0 Google Scholar
Y. LeCun et al.,
“Handwritten digit recognition with a backpropagation network,”
Adv. Neural Inf. Process. Syst., 2
(2), 396
–404
(1990). Google Scholar
Y. LeCun et al.,
“Gradientbased learning applied to document recognition,”
2278
–2324
(1998). https://doi.org/10.1109/5.726791 Google Scholar
C. Cortes and V. Vapnik,
“Supportvector networks,”
Mach. Learn., 20
(3), 273
–297
(1995). https://doi.org/10.1023/A:1022627411411 MALEEZ 08856125 Google Scholar
Y. Freund and R. E. Schapire,
“A decisiontheoretic generalization of online learning and an application to boosting,”
in Eur. Conf. Comput. Learn. Theory,
23
–37
(1995). Google Scholar
M. Jones and T. Poggio,
“Regularization theory and neural networks architectures,”
Neural Comp., 7
(2), 219
–269
(1995). https://doi.org/10.1162/neco.1995.7.2.219 NEUCEB 08997667 Google Scholar
G. E. Hinton, S. Osindero and Y. W. The,
“A fast learning algorithm for deep belief nets,”
Neural Comput., 18
(7), 1527
–1554
(2006). https://doi.org/10.1162/neco.2006.18.7.1527 NEUCEB 08997667 Google Scholar
G. E. Hinton and R. R. Salakhutdinov,
“Reducing the dimensionality of data with neural networks,”
Science, 313
(5786), 504
–507
(2006). https://doi.org/10.1126/science.1127647 SCIEAS 00368075 Google Scholar
X. Glorot and Y. Bengio,
“Understanding the difficulty of training deep feedforward neural networks,”
J. Mach. Learn. Res., 9 249
–256
(2010). Google Scholar
A. Krizhevsky, I. Sutskever and G. E. Hinton,
“ImageNet classification with deep convolutional neural networks,”
in Int. Conf. Neural Inf. Process. Syst.,
1097
–1105
(2012). Google Scholar
J. Deng et al.,
“ImageNet: a largescale hierarchical image database,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
248
–255
(2009). https://doi.org/10.1109/CVPR.2009.5206848 Google Scholar
O. Russakovsky et al.,
“ImageNet large scale visual recognition challenge,”
Int. J. Comput. Vision, 115
(3), 211
–252
(2015). https://doi.org/10.1007/s112630150816y IJCVEQ 09205691 Google Scholar
J. Long, E. Shelhamer and T. Darrell,
“Fully convolutional networks for semantic segmentation,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
(2015). Google Scholar
S. Ren et al.,
“Faster RCNN: towards realtime object detection with region proposal networks,”
in Int. Conf. Neural Inf. Process. Syst.,
91
–99
(2015). Google Scholar
A. Toshev and C. Szegedy,
“DeepPose: human pose estimation via deep neural networks,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
1653
–1660
(2014). https://doi.org/10.1109/CVPR.2014.214 Google Scholar
Y. Lecun, Y. Bengio and G. E. Hinton,
“Deep learning,”
Nature, 521
(7553), 436
–444
(2015). https://doi.org/10.1038/nature14539 Google Scholar
K. Simonyan and A. Zisserman,
“Very deep convolutional networks for largescale image recognition,”
in Int. Conf. Learn. Represent.,
(2015). Google Scholar
C. Szegedy et al.,
“Going deeper with convolutions,”
in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR),
(2015). https://doi.org/10.1109/CVPR.2015.7298594 Google Scholar
C. Szegedy et al.,
“Rethinking the inception architecture for computer vision,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
2818
–2826
(2016). https://doi.org/10.1109/CVPR.2016.308 Google Scholar
K. He et al.,
“Deep residual learning for image recognition,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
(2016). https://doi.org/10.1109/CVPR.2016.90 Google Scholar
K. He et al.,
“Identity mappings in deep residual networks,”
in Eur. Conf. Comput. Vision,
630
–645
(2016). Google Scholar
G. Huang et al.,
“Densely connected convolutional networks,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
2261
–2269
(2017). https://doi.org/10.1109/CVPR.2017.243 Google Scholar
Y. Chen et al.,
“Dual path networks,”
in Int. Conf. Neural Inf. Process. Syst.,
(2017). Google Scholar
M. Lin, Q. Chen and S. Yan,
“Network in network,”
in Int. Conf. Learn. Represent.,
(2014). Google Scholar
T. N. Sainath et al.,
“Deep convolutional neural networks for LVCSR,”
in IEEE Int. Conf. Acoust. Speech and Signal Process.,
8614
–8618
(2013). https://doi.org/10.1109/ICASSP.2013.6639347 Google Scholar
V. Nair and G. E. Hinton,
“Rectified linear units improve restricted Boltzmann machines,”
in Int. Conf. Mach. Learn.,
807
–814
(2010). Google Scholar
N. Srivastava et al.,
“Dropout: a simple way to prevent neural networks from overfitting,”
J. Mach. Learn. Res., 15
(1), 1929
–1958
(2014). Google Scholar
M. D. Zeiler and R. Fergus,
“Visualizing and understanding convolutional networks,”
in Eur. Conf. Comput. Vision,
818
–833
(2014). Google Scholar
S. Ioffe and C. Szegedy,
“Batch normalization: accelerating deep network training by reducing internal covariate shift,”
in Int. Conf. Mach. Learn.,
448
–456
(2015). Google Scholar
D. C. An et al.,
“Flexible, high performance convolutional neural networks for image classification,”
in Int. Joint Conf. Artif. Intell.,
1237
–1242
(2011). Google Scholar
K. He et al.,
“Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification,”
in IEEE Int. Conf. Comput. Vision,
1026
–1034
(2015). https://doi.org/10.1109/ICCV.2015.123 Google Scholar
R. K. Srivastava, K. Greff and J. Schmidhuber,
“Highway networks,”
in Int. Conf. Mach. Learn. Workshop,
(2015). Google Scholar
R. K. Srivastava, K. Greff and J. Schmidhuber,
“Training very deep networks,”
in Conf. and Workshop Neural Inf. Process. Syst.,
(2015). Google Scholar
S. Zagoruyko and N. Komodakis,
“Wide residual networks,”
in Br. Mach. Vision Conf.,
(2016). Google Scholar
S. Xie et al.,
“Aggregated residual transformations for deep neural networks,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
5987
–5995
(2017). https://doi.org/10.1109/CVPR.2017.634 Google Scholar
C. Szegedy et al.,
“Inceptionv4, inceptionResNet and the impact of residual connections on learning,”
in Workshop Track Int. Conf. Learn. Represent.,
(2016). Google Scholar
G. Huang et al.,
“Deep networks with stochastic depth,”
in Eur. Conf. Comput. Vision,
646
–661
(2016). Google Scholar
A. L. Maas, A. Y. Hannun and A. Y. Ng,
“Rectifier nonlinearities improve neural network acoustic models,”
in Int. Conf. Mach. Learn.,
(2013). Google Scholar
R. Girshick,
“Fast RCNN,”
in IEEE Int. Conf. Comput. Vision,
1440
–1448
(2015). Google Scholar
K. He et al.,
“Mask RCNN,”
in IEEE Int. Conf. Comput. Vision,
2980
–2988
(2017). Google Scholar
J. L. Ba, J. R. Kiros and G. E. Hinton,
“Layer normalization,”
(2016). Google Scholar
D. Ulyanov, A. Vedaldi and V. Lempitsky,
“Instance normalization: the missing ingredient for fast stylization,”
(2016). Google Scholar
Y. Wu and K. He,
“Group normalization,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
(2018). Google Scholar
I. Sutskever et al.,
“On the importance of initialization and momentum in deep learning,”
in Int. Conf. Mach. Learn.,
1139
–1147
(2013). Google Scholar
D. P. Kingma and J. Ba,
“Adam: a method for stochastic optimization,”
(2014). Google Scholar
S. J. Reddi, S. Kale and S. Kumar,
“On the convergence of Adam and beyond,”
in Int. Conf. Learn. Represent.,
(2018). Google Scholar
A. Berg, J. Deng and L. FeiFei,
“Large scale visual recognition challenge 2010,”
(2012) http://www.imagenet.org/challenges Google Scholar
K. He et al.,
“Spatial pyramid pooling in deep convolutional networks for visual recognition,”
IEEE Trans. Pattern Anal. Mach. Intell., 37
(9), 1904
–1916
(2015). https://doi.org/10.1109/TPAMI.2015.2389824 ITPIDJ 01628828 Google Scholar
J. Hu, L. Shen and G. Sun,
“Squeezeandexcitation networks,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
(2018). Google Scholar
BiographyWei Wang received his BS, MS, and PhD degrees in information and communication engineering from the National University of Defense Technology, China, in 1997, 2003, and 2010, respectively. He is currently a professor at the College of Computer and Communication, Changsha University of Science and Technology, China. His research interests include signal processing, computer vision, and pattern recognition. Yujing Yang received his BS degree in communication engineering from Changsha University of Science and Technology, China, in 2017. He is currently a postgraduate at Changsha University of Science and Technology, China. His research interests include computer vision and pattern recognition. Xin Wang received her BS and MS degrees in information and communication engineering from Wuhan University of Technology, China, in 1998 and 2006, respectively. She is currently a lecturer at the College of Computer and Communication, Changsha University of Science and Technology, China. Her research interests include signal processing, computer vision, and pattern recognition. Weizheng Wang received his BS degree in applied mathematics from Hunan University in 2005 and his PhD in technology of computer application from Hunan University in 2011, respectively. Presently, he is a lecturer at the College of Computer and Communication Engineering of Changsha University of Science and Technology. His research interests include builtin selftest, design for testability, lowpower testing, and test generation. Ji Li received his BS degree from Beijing Information Science and Technology University, China, in 2002, and his PhD from Wuhan University, China, in 2010, respectively. He is currently a lecturer at the College of Computer and Communication, Changsha University of Science and Technology, China. His research interests include signal processing, computer vision, and pattern recognition. 