With the rapid development of remote sensing technology, high-resolution satellite image understanding has received much more research interest, and advanced approaches are also of urgent demand. As one of the most crucial tasks, aircraft detection and recognition have shown increasing significance in both the military and civilian community. However, these two tasks are still challenging for the following three reasons. First, compared to natural images, satellite images have larger size, more complicated background, and lower resolution. Second, different kinds of aircraft might be finely similar in shape and color, whereas the same kind of aircraft tends to be dissimilar in different scenarios. Finally, rotation invariance is required owing to the viewpoint of the remote imaging sensor.
Conventional works for aircraft detection and recognition can be categorized into two groups, namely part-based and shape-based approaches. Typically, Zhang et al.1 proposed a rotation invariant part-based model. They found the dominant orientation of the HOG feature in the detection region is highly related to the direction of the plane. Accordingly, they encoded a new HOG feature to achieve rotation invariance in a way similar to SIFT. The recognition process is similar to the successful discriminatively trained part-based models.1 In addition to the part-based methods, Liu et al.2 used coarse-to-fine shape prior combined with template matching to detect aircraft. The part-based and shape-based methods have also been explored in those works by Refs. 34.5.–6. However, these methods are developed under the assumption that the object proposals or contours can be well acquired, which is difficult to satisfy in practice.
Recently, a great amount of attention has been paid to the deep convolutional neural networks (CNNs). CNNs achieve state-of-the-art performance in many computer vision tasks, such as object detection7 and semantic segmentation.8 In comparison with designing low-level hand-crafted features, CNNs can directly capture the high-level representation of objects, which proves to largely benefit vision tasks. CNNs usually have a huge number of parameters to learn. Thus, they require a large amount of labeled data for training to avoid overfitting. However, the insufficient number of labeled remote sensing images prevents CNNs from being widely applied in aircraft detection and recognition. Even so, there are still some pioneering works that apply CNNs into aircraft detection.9,10 However, these works only focus on detection rather than recognition. That is to say, they can only provide the locations of the planes but are unable to tell their categories. Moreover, in these works, object proposals are generated based on traditional methods that are decoupled from the end-to-end learning of the CNN frameworks. The underlying drawbacks mainly lie in that such step-by-step approaches have not exploited the high-level features in CNNs for proposal generation, and the incorrectness in the proposals is hard to correct.
To address the above-mentioned issues, we propose a CNN-based model, called DeepPlane, which unifies aircraft detection and recognition into a single framework. Instead of using sliding windows or objectness for detection and followed by classification, we train DeepPlane in an end-to-end manner that leads to low computation cost and high accuracy. Moreover, DeepPlane could be generalized to other object recognition tasks in remote sensing images without any modification.
For clarity, we highlight our contributions as follows. First, we propose a unified DeepPlane network motivated by faster R-CNN11 for aircraft detection and recognition. To the best of our knowledge, none of the present approaches adopt CNNs for aircraft recognition. We also show that this network could be modified for pose estimation simultaneously with minor additional computation cost. Second, we tackle the problem of training data insufficiency by leveraging the strengths of “inception module,”12 which helps to largely reduce the number of parameters in our model. To further avoid overfitting, several regularization methods are added, such as dropout, dataset augmentation, and gradient noise addition. Last but not the least, we collect a large aircraft satellite image dataset for aircraft detection, recognition, and pose estimation. More details can be found in Sec. 3.
Figure 1 shows the architecture of the proposed DeepPlane model. As shown, this model consists of two subnetworks. The first one generates object proposals, and the second one is used for classification and bounding-box regression. The two subnetworks share a common part for feature extraction and use the same feature maps for their subsequent procedures. Taking the whole image as input, we get feature maps via several convolutional (conv) and max pooling layers. Then, the feature maps are adopted in the first subnetwork to obtain object proposals. In the second subnetwork, the region feature vectors are extracted from the region of interest layer (RoI) for each object proposal. After passing through two fully connected (FC) layers, these region feature vectors finally branch into three output layers: one produces softmax probabilities of the proposals belonging to each category, one outputs four values encoding the refined bounding-box positions, and the last one produces softmax probabilities of the proposals belonging to each pose. The details are described in the subsequent sections.
Deep CNN: Generating High-Level Features
High-level features, which are shared between the two subnetworks, are critical for the performance of the whole architecture. Compared to natural images, remote sensing images have a more complex background, which leads to a more complicated model. It is known that the complexity of a model increases linearly with the number of parameters in the neural network. However, there are no large datasets, such as ImageNet,13 in the remote sensing community to learn a huge number of parameters in such a complicated model. Therefore, adopting models directly from deep learning community may not be reasonable. For instance, VGG-16,14 which consists of 16 layers, is employed in several famous architectures to generate the convolutional feature maps. Unfortunately, this model failed in aircraft detection due to overfitting.
A new feature extraction network is required to narrow the gap between complicated model and scarce labeled data. In the proposed method, we build a 29-layers CNN architecture to generate high-level features. This architecture contains convolutional, max pooling, convolutional, max pooling, and three “inception modules” layers in order, as shown in Figs. 1 and 2(a). The convolutional layer will reduce the number of feature map channels, which is beneficial for dimension reduction. This subnetwork contains several convolutional layers that greatly reduce the number of parameters before the expensive and convolutional layers. The main advantage of this deep CNN architecture is generating high-level features with fewer parameters. Intuitively, the storage size of our final model is 232M while that in faster R-CNN is 607M.
Region Proposal Network: Producing Object Proposals
The region proposal network (RPN)11 is adopted in the first network to produce object proposals. As shown in Fig. 2(b), RPN uses fully convolutional networks to predict proposal locations and confidence. On top of the network, two sibling convolutional layers are used for regressing box locations and predicting box confidences, respectively. The box regression is related to a set of default bounding boxes (called anchors). RPN is well suitable for aircraft recognition. Generally, traditional object proposal methods are built on low-level features, such as superpixels15 or edges.16 Since high-spatial satellite images contain lots of complicated background, these methods will bring many false-positive proposals, e.g., buildings. In contrast, RPN can achieve high precision and recall rates, since positive training samples only contain aircraft. This is quite beneficial for improving the accuracy in the subsequent recognition.
However, RPN cannot be applied directly on aircraft detection, considering the different sizes of objects in these two tasks. In our framework, the scales and aspect ratios of anchors are and [0.7, 1, 1.3], respectively. This will generate kinds of anchors. We acquire the scales and aspect ratios of all aircraft in our dataset and then take the cluster center () as the scales and aspect ratios of anchors. In addition, the top-ranked 300 proposals are selected for the second subnetwork.
Multitask Loss: Performing Multitask Learning
In the second subnetwork (recognition network), a convolutional layer is first added for dimension reduction. Then for each RoI, we extract fix-length feature vector using RoI pooling. After passing through two FC layers, these feature vectors are fed into three sibling output layers. As shown in Fig. 1, the first softmax layer outputs a discrete probability distribution , representing the likelihood of the proposal belonging to the categories ( kinds of aircraft plus background), respectively. The second softmax layer also outputs a discrete probability distribution , representing the likelihood of the proposal belonging to poses ( kinds of poses), respectively. We convert the aircraft pose estimation problem into a simple direction classification problem. The direction of the aircraft is discretized into eight categories as shown in Fig. 3 (left). Given an aircraft, the direction label is set to the number of the bin where the nose appears. After pose estimation, all aircraft can be rotated to a certain direction. So rotation-invariant features could be extracted to further determine the aircraft type. The final regression layer predicts bounding-box regression offsets, . For each aircraft class , the scale-invariant translation related to an object proposal is represented by while the log-space height/width shift is represented by . This is the parameterization given in Ref. 17.
In the training stage, each RoI has three ground-truth values: class , direction , and bounding-box regressing target . We use a multiloss to jointly train the classification, pose estimation, and bounding-box regression, namely11
The targets are normalized to have zero mean and unit variance.
The best way to make CNN convergent is to train with more data. But the amount of aircraft we have is limited. So creating more fake data is needed to address this problem. We first sample small images from original high-spatial satellite images to avoid the insufficiency of GPU memory. Meanwhile, the same aircraft surrounded by different backgrounds are considered to be different samples (Fig. 3, right-top). Then, we rotate (90 deg) and reflect horizontally the training images while the testing images remain unchanged.
Gradient noise addition
Motivated by Ref. 18, we add time-dependent Gaussian noise to the gradient at every training step . To a certain extent, the added noise will enlarge the search scope of weights and help the model step out local minimum. To the best of our knowledge, we are the first one to adopt this strategy in object detection neural network. It can be implemented by
In order to make the deep CNN to be shared between the two subnetworks, the training process contains four steps, as shown in Algorithm 1. Stochastic gradient descent is adopted to optimize the model. We use an initial learning rate of 0.01 and drop it by a factor of 10 every 3 K iterations. Momentum and weight decay are set to 0.9 and 0.0005, respectively. The maximum iteration is 10K and the dropout ratio is 0.5. Images are resized so that their shortest sides are 600 pixels in both training and testing. We implement our model in MATLAB® and C++ (Caffe19) on GPU (GTX 970). The total training time is about 6 h and the required GPU memory is less than 2.5 GB. The average testing time of each image is less than 200 ms.
|1: Train network 1 initialized with GoogLeNet model.|
|2: Train network 2 initialized with GoogLeNet model using step-1 RPN proposals.|
|3: Initialize network 1 with step-2 model, then train network 1 with deep CNN layers fixed.|
|4: Initialize network 2 with step-2 model, then train network 1 with deep CNN layers fixed using step-3 RPN proposals.|
In this section, we evaluate our model on our dataset HRAD2016, which is collected from Google Earth20 with a spatial resolution of . (We show some samples and results on this site provided in Ref. 21. The full dataset will be released soon.) We sample 495 images from the original high-spatial airport satellite images. The size of each image is about . These images contain 2520 military planes, which can be classified into six categories according to their usages and shapes, as shown in Fig. 3 (right-bottom). From left to right, the numbers of each class are [727, 627, 719, 64, 248, 135], respectively. We split HARD2016 into two parts for training the model and testing the performance. However, the number of AEM planes is only 64. Thus, the training set may contain few AEM planes if we split the dataset randomly. To remove the influence of the imbalanced partition, we first split the AEM samples into three equal-sized subsets individually, and the remaining ones are processed in the same way. The final training set has 330 images and the testing set has 165 images. After dataset augmentation (rotation and horizontal refection as described in Sec. 2.4), the training dataset is expanded four times (from 330 to 1320 images). Figure 4 shows some qualitative results of our method.
In this section, we discuss some performance issues about the aircraft detection, including its sensitivity to different anchors, as well as the detection accuracy in comparison with other related works. Detection rate (namely the object recall given proposals per image) is used to evaluate the performance. First, Fig. 5(a) shows the detection rates when varying the intersection over union for different anchor scales, within which the aspect ratios are fixed. Specifically, we achieve 97.70% detection rate with scales of , which is the best among the four kinds of scales. Second, the effects of the aspect ratios, with fixed scales, are shown in Fig. 5(b). The best performance is also achieved when the cluster center () is used. Notice that the default value used in faster RCNN is included in this experiment, i.e., and . The chosen sizes and ratios based on cluster center achieve better performance. In addition, we conduct some experiments to evaluate the detection accuracy by comparing with three state-of-the-art alternatives: BING,22 EdgeBox,16 and selective search.15 As shown in Fig. 5(c), our method achieves better detection performance than others.
Comparisons with related methods
We use the result of faster R-CNN on our dataset as the baseline. As shown in Table 1, the proposed method outperforms the baseline by a significant margin, both on precision and run time. The difference mainly results from the overfitting: the aircraft dataset is too small to fit the huge number of parameters of faster R-CNN. HOG + SVM is frequently used for aircraft recognition in the last few years. It performs poorly on our dataset. It should be noted that HOG + SVM is only developed for binary (plane or background) classification. The mAP of classification will drop to 27.2% if we use HOG + SVM for six-type classification. SSD7 is the state-of-the-art method on VOC 2012, but its performance is 5.7% lower than our method. However, SSD has the lowest run time because it unifies proposal generation and detection into one stage. In addition, it can be seen that data augmentation plays an important role in our method (about 4% increase in classification mAP).
Quantitative comparisons between our method and other related methods. Here, cls stands for the mAP of classification while dir is the precision of pose estimation. HOG+SVM represents that we extract the HOG feature of the proposals and use SVM as the classifier. The input size of SSD is 500×500. Ours1 is the version of our method without data augmentation while Ours2 is the version with data augmentation.
|Method||Baseline||HOG + SVM||SSD7||Ours1||Ours2|
The method with the best performance is highlighted in boldface.
An asterisk represents the result which cannot be reported.
To avoid bias, we perform threefold crossvalidation and repeat each experiment three times and adopt the mean value as the result. The final mAP of classification are [91.5%, 91.9%, 92.6%], and precisions of pose estimation are [98.4%, 98.6%, 98.7%]. Thus, the previous performance does not have bias. For the reason of time-consuming training process of deep learning, we perform remaining experiments without cross validation.
Performance in different situations
In Eq. (1), we use and to balance these three losses. The relationship between the mAP and the two parameters is shown in Table 2. We obtain the best result when and and the worst when and . This means that the bounding-box regression is more relevant to classification than pose estimation. Naturally, the bounding-box regression can make the locations of the proposals more accurate, which is helpful for classification. However, the performance decreases when we set (C versus C + O and C + B versus C + B + O). On one side, this may be caused by the nonoptimal value of as shown in Table 3, on the other side, this is the result of inadequate training images. Therefore, we set and for the remaining experiments.
The mAP of classification when setting different κ and λ in Eq. (1). C: κ=λ=0 (only classification). C + B: κ=0,λ=1 (only classification and box regression). C + O: κ=1,λ=0 (only classification and pose estimation). C + B + O: κ=λ=1 (all are included).
|C + B (%)||91.9||90.4||91.9||89.1||98.7||90.3||90.9|
|C + O (%)||88.7||90.2||86.8||86.3||95.7||89.6||83.8|
|C + B + O (%)||90.0||90.3||89.1||88.8||96.2||89.7||86.1|
The method with the best performance is highlighted in boldface.
The mAP of pose estimation and classification with different κ in Eq. (1). We set λ to 1.
In Fig. 1, the convolutional layer before the RoI pooling layer is used to compute reductions before the expensive FC layers. It reduces the number of channels from 512 to 256. This can reduce lots of parameters and yield better generalization as shown in Table 4. In addition to reducing parameters, the added convolutional layer is also helpful for transfer learning. In our framework, deep CNN is initialized with a model pretrained on natural images, whereas the subsequent layers are initialized from scratch for aircraft recognition on satellite images. Therefore, the added convolutional layer can be viewed as an intermediate layer between the feature extractor and the aircraft detector to enable transfer learning.
The mAP of classification with/without 1×1 convolutional layer before the RoI pooling layer in Fig. 1.
The method with the best performance is highlighted in boldface.
Adding gradient noise is also very beneficial to train our model. As shown in Table 5, we achieve the best result when . The result of is even worse than no gradient noise. This is possibly caused by adding too large noise, which leads to numerical instability.
The mAP of classification when setting η=(0,0.01,0.3) in Eq. (4). We do not add gradient noise when setting η=0.
The method with the best performance is highlighted in boldface.
In this paper, we have introduced a unified DeepPlane network for aircraft detection and recognition. The main contributions of our work are twofold. First, a joint DeepPlane network which consists of the RPN and the multitask learning network is proposed. The proposed DeepPlane network has three properties: end-to-end, multitask, and easy generation. Second, an “inception” module is introduced to balance the gap between the large amount of parameters of CNN and the insufficient training aircraft images. Several regularization methods, such as dropout, dataset augmentation, and gradient noise addition, are also employed to prevent overfitting. Experiments on our collected aircraft dataset demonstrate that integrating detection with classification yields higher accuracy. Practically, CNN takes a large region as input and encodes the input into a fixed-length feature vector. Thus, the loss of detailed information is inevitable in the proposed model. Our future work will concentrate on taking features from different levels so that we can integrate detailed and contextual information effectively to further improve the performance.
This work was supported by the National Natural Science Foundation of China under Grants Nos. 91646207, 91438105, 91338202, 61403376, and 61375024.
Hongzhen Wang received his BS degree from the Department of Automation, Institute of Engineering, Ocean University of China, Qingdao, Shandong, China, in 2013. He is currently pursuing his PhD degree at the Institute of Automation, Chinese Academy of Sciences, Beijing. His research interests include object detection, semantic segmentation, and deep learning.
Yongchao Gong received his BS degree in automation from the School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui, China, in 2012. He is currently pursuing his PhD at the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include pattern recognition, machine learning, and image processing.
Ying Wang received his BS and MS degrees from Nanjing University of Information Science and Technology, China, in 2005 and 2008, respectively, and received his PhD from the Institute of Automation, Chinese Academy of Sciences, China, in 2012. He is currently an associate professor at the Institute of Automation of Chinese Academy of Sciences. His research interests include computer vision, pattern recognition, and remote sensing.
Lingfeng Wang received his BS degree in computer science from Wuhan University, Wuhan, China, in 2007. He is currently an associate professor at the National Laboratory of Pattern Recognition of Institute of Automation, Chinese Academy of Sciences. His research interests include computer vision and image processing.
Chunhong Pan received his BS degree from Tsinghua University, Beijing, China, in 1987, his MS degree from Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Beijing, in 1990, and his PhD from the Institute of Automation, Chinese Academy of Sciences, in 2000, where he is currently a professor at the National Laboratory of Pattern Recognition. His research interests include computer vision, image processing, computer graphics, and remote sensing.