1 September 2017 DeepPlane: a unified deep model for aircraft detection and recognition in remote sensing images
Author Affiliations +
J. of Applied Remote Sensing, 11(4), 042606 (2017). doi:10.1117/1.JRS.11.042606
Abstract
Deep convolutional neural networks (CNNs) have shown outstanding performance in object recognition from natural images. In contrast, object recognition from remote sensing images is more challenging, due to the complex background and inadequate data for training a deep network with a huge number of parameters. We propose a unified deep CNN, called DeepPlane, to simultaneously detect the position and classify the category of aircraft in remote sensing images. This model consists of two correlative deep networks: the first one is designed to generate object proposals as well as feature maps and the second one is cascaded upon the first one to perform classification and box regression in one shot. The “inception module” is introduced to tackle the insufficient training data problem that is one of the most challenging obstacles of detection in remote sensing images. Extensive experiments demonstrate the efficiency of the proposed DeepPlane model. Specifically, DeepPlane could model detection and classification jointly and achieves 91.9% mAP in six categories of aircraft, which advances the state-of-the-art, sometimes considerably, for both tasks.
Wang, Gong, Wang, Wang, and Pan: DeepPlane: a unified deep model for aircraft detection and recognition in remote sensing images

1.

Introduction

With the rapid development of remote sensing technology, high-resolution satellite image understanding has received much more research interest, and advanced approaches are also of urgent demand. As one of the most crucial tasks, aircraft detection and recognition have shown increasing significance in both the military and civilian community. However, these two tasks are still challenging for the following three reasons. First, compared to natural images, satellite images have larger size, more complicated background, and lower resolution. Second, different kinds of aircraft might be finely similar in shape and color, whereas the same kind of aircraft tends to be dissimilar in different scenarios. Finally, rotation invariance is required owing to the viewpoint of the remote imaging sensor.

Conventional works for aircraft detection and recognition can be categorized into two groups, namely part-based and shape-based approaches. Typically, Zhang et al.1 proposed a rotation invariant part-based model. They found the dominant orientation of the HOG feature in the detection region is highly related to the direction of the plane. Accordingly, they encoded a new HOG feature to achieve rotation invariance in a way similar to SIFT. The recognition process is similar to the successful discriminatively trained part-based models.1 In addition to the part-based methods, Liu et al.2 used coarse-to-fine shape prior combined with template matching to detect aircraft. The part-based and shape-based methods have also been explored in those works by Refs. 34.5.6. However, these methods are developed under the assumption that the object proposals or contours can be well acquired, which is difficult to satisfy in practice.

Recently, a great amount of attention has been paid to the deep convolutional neural networks (CNNs). CNNs achieve state-of-the-art performance in many computer vision tasks, such as object detection7 and semantic segmentation.8 In comparison with designing low-level hand-crafted features, CNNs can directly capture the high-level representation of objects, which proves to largely benefit vision tasks. CNNs usually have a huge number of parameters to learn. Thus, they require a large amount of labeled data for training to avoid overfitting. However, the insufficient number of labeled remote sensing images prevents CNNs from being widely applied in aircraft detection and recognition. Even so, there are still some pioneering works that apply CNNs into aircraft detection.9,10 However, these works only focus on detection rather than recognition. That is to say, they can only provide the locations of the planes but are unable to tell their categories. Moreover, in these works, object proposals are generated based on traditional methods that are decoupled from the end-to-end learning of the CNN frameworks. The underlying drawbacks mainly lie in that such step-by-step approaches have not exploited the high-level features in CNNs for proposal generation, and the incorrectness in the proposals is hard to correct.

To address the above-mentioned issues, we propose a CNN-based model, called DeepPlane, which unifies aircraft detection and recognition into a single framework. Instead of using sliding windows or objectness for detection and followed by classification, we train DeepPlane in an end-to-end manner that leads to low computation cost and high accuracy. Moreover, DeepPlane could be generalized to other object recognition tasks in remote sensing images without any modification.

For clarity, we highlight our contributions as follows. First, we propose a unified DeepPlane network motivated by faster R-CNN11 for aircraft detection and recognition. To the best of our knowledge, none of the present approaches adopt CNNs for aircraft recognition. We also show that this network could be modified for pose estimation simultaneously with minor additional computation cost. Second, we tackle the problem of training data insufficiency by leveraging the strengths of “inception module,”12 which helps to largely reduce the number of parameters in our model. To further avoid overfitting, several regularization methods are added, such as dropout, dataset augmentation, and gradient noise addition. Last but not the least, we collect a large aircraft satellite image dataset for aircraft detection, recognition, and pose estimation. More details can be found in Sec. 3.

The remainder of this paper is organized as follows. The proposed method is introduced in Sec. 2. Then, we present the experimental results in Sec. 3. Finally, conclusions are made in Sec. 4.

2.

Proposed Method

Figure 1 shows the architecture of the proposed DeepPlane model. As shown, this model consists of two subnetworks. The first one generates object proposals, and the second one is used for classification and bounding-box regression. The two subnetworks share a common part for feature extraction and use the same feature maps for their subsequent procedures. Taking the whole image as input, we get feature maps via several convolutional (conv) and max pooling layers. Then, the feature maps are adopted in the first subnetwork to obtain object proposals. In the second subnetwork, the region feature vectors are extracted from the region of interest layer (RoI) for each object proposal. After passing through two fully connected (FC) layers, these region feature vectors finally branch into three output layers: one produces softmax probabilities of the proposals belonging to each category, one outputs four values encoding the refined bounding-box positions, and the last one produces softmax probabilities of the proposals belonging to each pose. The details are described in the subsequent sections.

Fig. 1

Architecture of the proposed model DeepPlane. There are three parts: deep CNN is used for extracting features, RPN is for outputting proposals, and DN is for classifying the proposals into aircraft and background. These three parts form two networks, as shown in the figure. Deep CNN is shared between the two networks.

JARS_11_4_042606_f001.png

2.1.

Deep CNN: Generating High-Level Features

High-level features, which are shared between the two subnetworks, are critical for the performance of the whole architecture. Compared to natural images, remote sensing images have a more complex background, which leads to a more complicated model. It is known that the complexity of a model increases linearly with the number of parameters in the neural network. However, there are no large datasets, such as ImageNet,13 in the remote sensing community to learn a huge number of parameters in such a complicated model. Therefore, adopting models directly from deep learning community may not be reasonable. For instance, VGG-16,14 which consists of 16 layers, is employed in several famous architectures to generate the convolutional feature maps. Unfortunately, this model failed in aircraft detection due to overfitting.

A new feature extraction network is required to narrow the gap between complicated model and scarce labeled data. In the proposed method, we build a 29-layers CNN architecture to generate high-level features. This architecture contains 7×7 convolutional, 3×3 max pooling, 3×3 convolutional, 3×3 max pooling, and three “inception modules” layers in order, as shown in Figs. 1 and 2(a). The 1×1 convolutional layer will reduce the number of feature map channels, which is beneficial for dimension reduction. This subnetwork contains several 1×1 convolutional layers that greatly reduce the number of parameters before the expensive 3×3 and 5×5 convolutional layers. The main advantage of this deep CNN architecture is generating high-level features with fewer parameters. Intuitively, the storage size of our final model is 232M while that in faster R-CNN is 607M.

Fig. 2

(a) Inception module. (b) RPN.

JARS_11_4_042606_f002.png

2.2.

Region Proposal Network: Producing Object Proposals

The region proposal network (RPN)11 is adopted in the first network to produce object proposals. As shown in Fig. 2(b), RPN uses fully convolutional networks to predict proposal locations and confidence. On top of the network, two sibling 1×1 convolutional layers are used for regressing box locations and predicting box confidences, respectively. The box regression is related to a set of default bounding boxes (called anchors). RPN is well suitable for aircraft recognition. Generally, traditional object proposal methods are built on low-level features, such as superpixels15 or edges.16 Since high-spatial satellite images contain lots of complicated background, these methods will bring many false-positive proposals, e.g., buildings. In contrast, RPN can achieve high precision and recall rates, since positive training samples only contain aircraft. This is quite beneficial for improving the accuracy in the subsequent recognition.

However, RPN cannot be applied directly on aircraft detection, considering the different sizes of objects in these two tasks. In our framework, the scales and aspect ratios of anchors are [80×80,112×112,144×144] and [0.7, 1, 1.3], respectively. This will generate k=9 kinds of anchors. We acquire the scales and aspect ratios of all aircraft in our dataset and then take the cluster center (Kmeans) as the scales and aspect ratios of anchors. In addition, the top-ranked 300 proposals are selected for the second subnetwork.

2.3.

Multitask Loss: Performing Multitask Learning

In the second subnetwork (recognition network), a 1×1 convolutional layer is first added for dimension reduction. Then for each RoI, we extract fix-length feature vector using RoI pooling. After passing through two FC layers, these feature vectors are fed into three sibling output layers. As shown in Fig. 1, the first softmax layer outputs a discrete probability distribution p=(p0,,pZ), representing the likelihood of the proposal belonging to the Z+1 categories (Z kinds of aircraft plus background), respectively. The second softmax layer also outputs a discrete probability distribution d=(d0,,dE), representing the likelihood of the proposal belonging to E poses (E kinds of poses), respectively. We convert the aircraft pose estimation problem into a simple direction classification problem. The direction of the aircraft is discretized into eight categories as shown in Fig. 3 (left). Given an aircraft, the direction label is set to the number of the bin where the nose appears. After pose estimation, all aircraft can be rotated to a certain direction. So rotation-invariant features could be extracted to further determine the aircraft type. The final regression layer predicts bounding-box regression offsets, ok=(oxz,oyz,owz,ohz). For each aircraft class z, the scale-invariant translation related to an object proposal is represented by (oxz,oyz) while the log-space height/width shift is represented by (owz,ohz). This is the parameterization given in Ref. 17.

Fig. 3

(a) Left: An example of the aircraft direction and eight directions used in pose estimation. (a) Right: Sampling two small images from the original image. (b) Six categories of aircraft. Both bomber_1 and bomber_2 contain only one model of plane. Bomber_1 is a B-1 Lancer bomber and bomber_2 is a B-52 Stratofortress bomber. Other four types consist of several models. The difference between airfreighter_1 and airfreighter_2 is whether the wing is perpendicular to the body.

JARS_11_4_042606_f003.png

In the training stage, each RoI has three ground-truth values: class z, direction q, and bounding-box regressing target r. We use a multiloss L to jointly train the classification, pose estimation, and bounding-box regression, namely

(1)

L(p,z,d,q,oz,r)=Lcls(p,z)+κ[z1]Ldir(d,q)+λ[z1]Lloc(oz,r),
in which κ and λ are used to balance the three losses. Lcls is defined as logpz, which represents the log loss for true class z. Ldir is similar to Lcls. Lloc is a tuple of loss between the predicted offsets oz and the true offsets r. The indicator function [z1] is 1 if z1 and 0 otherwise, as we only consider positive samples when computing Ldir and Lloc. We use the Huber loss function for Lloc as in Ref. 11

(2)

Lloc(oz,r)=i{x,y,w,h}smoothL1(oizri),
in which

(3)

smoothL1(x)={0.5x2if  |x|<1|x|0.5otherwise.

The targets ri are normalized to have zero mean and unit variance.

2.4.

Implementation Details

2.4.1.

Dataset augmentation

The best way to make CNN convergent is to train with more data. But the amount of aircraft we have is limited. So creating more fake data is needed to address this problem. We first sample small images from original high-spatial satellite images to avoid the insufficiency of GPU memory. Meanwhile, the same aircraft surrounded by different backgrounds are considered to be different samples (Fig. 3, right-top). Then, we rotate (90 deg) and reflect horizontally the training images while the testing images remain unchanged.

2.4.2.

Gradient noise addition

Motivated by Ref. 18, we add time-dependent Gaussian noise to the gradient g at every training step s. To a certain extent, the added noise will enlarge the search scope of weights and help the model step out local minimum. To the best of our knowledge, we are the first one to adopt this strategy in object detection neural network. It can be implemented by

(4)

gtgt+N(0,σt2),σt2=η(1+s)α,
where η is selected from {0.01,0.3} and α=0.55. The noise level decreases with the increasing of training step s.

2.4.3.

Training process

In order to make the deep CNN to be shared between the two subnetworks, the training process contains four steps, as shown in Algorithm 1. Stochastic gradient descent is adopted to optimize the model. We use an initial learning rate of 0.01 and drop it by a factor of 10 every 3 K iterations. Momentum and weight decay are set to 0.9 and 0.0005, respectively. The maximum iteration is 10K and the dropout ratio is 0.5. Images are resized so that their shortest sides are 600 pixels in both training and testing. We implement our model in MATLAB® and C++ (Caffe19) on GPU (GTX 970). The total training time is about 6 h and the required GPU memory is less than 2.5 GB. The average testing time of each image is less than 200 ms.

Algorithm 1

Training process.

1: Train network 1 initialized with GoogLeNet model.
2: Train network 2 initialized with GoogLeNet model using step-1 RPN proposals.
3: Initialize network 1 with step-2 model, then train network 1 with deep CNN layers fixed.
4: Initialize network 2 with step-2 model, then train network 1 with deep CNN layers fixed using step-3 RPN proposals.

3.

Experiments

In this section, we evaluate our model on our dataset HRAD2016, which is collected from Google Earth20 with a spatial resolution of 1  m/pixel. (We show some samples and results on this site provided in Ref. 21. The full dataset will be released soon.) We sample 495 images from the original high-spatial airport satellite images. The size of each image is about 600×1000. These images contain 2520 military planes, which can be classified into six categories according to their usages and shapes, as shown in Fig. 3 (right-bottom). From left to right, the numbers of each class are [727, 627, 719, 64, 248, 135], respectively. We split HARD2016 into two parts for training the model and testing the performance. However, the number of AEM planes is only 64. Thus, the training set may contain few AEM planes if we split the dataset randomly. To remove the influence of the imbalanced partition, we first split the AEM samples into three equal-sized subsets individually, and the remaining ones are processed in the same way. The final training set has 330 images and the testing set has 165 images. After dataset augmentation (rotation and horizontal refection as described in Sec. 2.4), the training dataset is expanded four times (from 330 to 1320 images). Figure 4 shows some qualitative results of our method.

Fig. 4

Some instances of our aircraft detection.

JARS_11_4_042606_f004.png

3.1.

Detection Performance

In this section, we discuss some performance issues about the aircraft detection, including its sensitivity to different anchors, as well as the detection accuracy in comparison with other related works. Detection rate (namely the object recall given N proposals per image) is used to evaluate the performance. First, Fig. 5(a) shows the detection rates when varying the intersection over union for different anchor scales, within which the aspect ratios are fixed. Specifically, we achieve 97.70% detection rate with scales of [802,1122,1442], which is the best among the four kinds of scales. Second, the effects of the aspect ratios, with fixed scales, are shown in Fig. 5(b). The best performance is also achieved when the cluster center (ratio=[0.7,1,1.3]) is used. Notice that the default value used in faster RCNN is included in this experiment, i.e., size=[1282,2562,5122] and ratio=[0.5,1,2]. The chosen sizes and ratios based on cluster center achieve better performance. In addition, we conduct some experiments to evaluate the detection accuracy by comparing with three state-of-the-art alternatives: BING,22 EdgeBox,16 and selective search.15 As shown in Fig. 5(c), our method achieves better detection performance than others.

Fig. 5

(a) The detection rate with different anchor scales when fixing the ratios. (b) Results with different anchor ratios when fixing the scales. (c) Quantitative comparisons in detection performance between our method and other related methods, i.e., BING,22 EdgeBox,16 and selective search.15

JARS_11_4_042606_f005.png

3.2.

Recognition Performance

3.2.1.

Comparisons with related methods

We use the result of faster R-CNN on our dataset as the baseline. As shown in Table 1, the proposed method outperforms the baseline by a significant margin, both on precision and run time. The difference mainly results from the overfitting: the aircraft dataset is too small to fit the huge number of parameters of faster R-CNN. HOG + SVM is frequently used for aircraft recognition in the last few years. It performs poorly on our dataset. It should be noted that HOG + SVM is only developed for binary (plane or background) classification. The mAP of classification will drop to 27.2% if we use HOG + SVM for six-type classification. SSD7 is the state-of-the-art method on VOC 2012, but its performance is 5.7% lower than our method. However, SSD has the lowest run time because it unifies proposal generation and detection into one stage. In addition, it can be seen that data augmentation plays an important role in our method (about 4% increase in classification mAP).

Table 1

Quantitative comparisons between our method and other related methods. Here, cls stands for the mAP of classification while dir is the precision of pose estimation. HOG+SVM represents that we extract the HOG feature of the proposals and use SVM as the classifier. The input size of SSD is 500×500. Ours1 is the version of our method without data augmentation while Ours2 is the version with data augmentation.

MethodBaselineHOG + SVMSSD7Ours1Ours2
Cls (%)84.368.986.288.091.9
Dir (%)90.444.7*95.298.6
Time (ms)254*30200200

The method with the best performance is highlighted in boldface.

An asterisk represents the result which cannot be reported.

To avoid bias, we perform threefold crossvalidation and repeat each experiment three times and adopt the mean value as the result. The final mAP of classification are [91.5%, 91.9%, 92.6%], and precisions of pose estimation are [98.4%, 98.6%, 98.7%]. Thus, the previous performance does not have bias. For the reason of time-consuming training process of deep learning, we perform remaining experiments without cross validation.

3.2.2.

Performance in different situations

In Eq. (1), we use κ and λ to balance these three losses. The relationship between the mAP and the two parameters is shown in Table 2. We obtain the best result when κ=0 and λ=1 and the worst when κ=1 and λ=0. This means that the bounding-box regression is more relevant to classification than pose estimation. Naturally, the bounding-box regression can make the locations of the proposals more accurate, which is helpful for classification. However, the performance decreases when we set κ=1 (C versus C + O and C + B versus C + B + O). On one side, this may be caused by the nonoptimal value of κ as shown in Table 3, on the other side, this is the result of inadequate training images. Therefore, we set κ=0 and λ=1 for the remaining experiments.

Table 2

The mAP of classification when setting different κ and λ in Eq. (1). C: κ=λ=0 (only classification). C + B: κ=0,λ=1 (only classification and box regression). C + O: κ=1,λ=0 (only classification and pose estimation). C + B + O: κ=λ=1 (all are included).

MethodAverageAirfreighter_1Bomber_1FighterAEWAirfreighter_2Bomber_2
C (%)89.790.388.488.690.789.590.6
C + B (%)91.990.491.989.198.790.390.9
C + O (%)88.790.286.886.395.789.683.8
C + B + O (%)90.090.389.188.896.289.786.1

The method with the best performance is highlighted in boldface.

Table 3

The mAP of pose estimation and classification with different κ in Eq. (1). We set λ to 1.

κ=10.50.30.1
Dir (%)98.698.497.591.3
Cls (%)90.090.190.690.8

In Fig. 1, the 1×1 convolutional layer before the RoI pooling layer is used to compute reductions before the expensive FC layers. It reduces the number of channels from 512 to 256. This can reduce lots of parameters and yield better generalization as shown in Table 4. In addition to reducing parameters, the added convolutional layer is also helpful for transfer learning. In our framework, deep CNN is initialized with a model pretrained on natural images, whereas the subsequent layers are initialized from scratch for aircraft recognition on satellite images. Therefore, the added convolutional layer can be viewed as an intermediate layer between the feature extractor and the aircraft detector to enable transfer learning.

Table 4

The mAP of classification with/without 1×1 convolutional layer before the RoI pooling layer in Fig. 1.

ConvAverageAirfreighter_1Bomber_1FighterAEWAirfreighter_2Bomber_2
Yes (%)91.990.491.989.198.790.390.9
No (%)91.490.389.189.598.490.590.5

The method with the best performance is highlighted in boldface.

Adding gradient noise is also very beneficial to train our model. As shown in Table 5, we achieve the best result when η=0.01. The result of η=0.3 is even worse than no gradient noise. This is possibly caused by adding too large noise, which leads to numerical instability.

Table 5

The mAP of classification when setting η=(0,0.01,0.3) in Eq. (4). We do not add gradient noise when setting η=0.

ηAverageAirfreighter_1Bomber_1FighterAEWAirfreighter_2Bomber_2
091.1%90.3%86.9%90.1%99.0%90.3%90.2%
0.0191.9%90.4%91.9%89.1%98.7%90.3%90.9%
0.381.5%88.8%83.1%83.3%69.8%86.7%77.4%

The method with the best performance is highlighted in boldface.

4.

Conclusion

In this paper, we have introduced a unified DeepPlane network for aircraft detection and recognition. The main contributions of our work are twofold. First, a joint DeepPlane network which consists of the RPN and the multitask learning network is proposed. The proposed DeepPlane network has three properties: end-to-end, multitask, and easy generation. Second, an “inception” module is introduced to balance the gap between the large amount of parameters of CNN and the insufficient training aircraft images. Several regularization methods, such as dropout, dataset augmentation, and gradient noise addition, are also employed to prevent overfitting. Experiments on our collected aircraft dataset demonstrate that integrating detection with classification yields higher accuracy. Practically, CNN takes a large region as input and encodes the input into a fixed-length feature vector. Thus, the loss of detailed information is inevitable in the proposed model. Our future work will concentrate on taking features from different levels so that we can integrate detailed and contextual information effectively to further improve the performance.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants Nos. 91646207, 91438105, 91338202, 61403376, and 61375024.

References

1. P. F. Felzenszwalb et al., “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010). http://dx.doi.org/10.1109/LGRS.2013.2246538 Google Scholar

2. G. Liu et al., “Aircraft recognition in high-resolution satellite images using coarse-to-fine shape prior,” IEEE Geosci. Remote Sens. Lett. 10(3), 573–577 (2013). http://dx.doi.org/10.1109/LGRS.2012.2214022 Google Scholar

3. G. Cheng et al., “Object detection in remote sensing imagery using a discriminatively trained mixture model,” ISPRS J. Photogramm. Remote Sens. 85, 32–43 (2013). http://dx.doi.org/10.1016/j.isprsjprs.2013.08.001 Google Scholar

4. N. Yokoya and A. Iwasaki, “Object detection based on sparse representation and Hough voting for optical remote sensing imagery,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 8(5), 2053–2062 (2015). http://dx.doi.org/10.1109/JSTARS.2015.2404578 Google Scholar

5. G. J. Scott et al., “Entropy-balanced bitmap tree for shape-based object retrieval from large-scale satellite imagery databases,” IEEE Trans. Geosci. Remote Sens. 49(5), 1603–1616 (2011).IGRSD20196-2892 http://dx.doi.org/10.1109/TGRS.2010.2088404 Google Scholar

6. L. Zhang et al., “A multifeature tensor for remote-sensing target recognition,” IEEE Geosci. Remote Sens. Lett. 8(2), 374–378 (2011). http://dx.doi.org/10.1109/LGRS.2010.2077272 Google Scholar

7. W. Liu et al., “SSD: single shot multibox detector,” in European Conf. on Computer Vision, pp. 21–37 (2016). Google Scholar

8. J. Long, E. Shelhamer and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015). http://dx.doi.org/10.1109/CVPR.2015.7298965 Google Scholar

9. X. Chen et al., “Aircraft detection by deep belief nets,” in IEEE 2nd IAPR Asian Conf. on Pattern Recognition, pp. 54–58 (2013). http://dx.doi.org/10.1109/ACPR.2013.5 Google Scholar

10. H. Wu et al., “Fast aircraft detection in satellite images based on convolutional neural networks,” IEEE Int. Conf. of Image Processing, pp. 4210–4214 (2015). http://dx.doi.org/10.1109/ICIP.2015.7351599 Google Scholar

11. S. Ren et al., “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017).ITPIDJ0162-8828 http://dx.doi.org/10.1109/TPAMI.2016.2577031 Google Scholar

12. C. Szegedy et al., “Going deeper with convolutions,” in IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1–9 (2015). http://dx.doi.org/10.1109/CVPR.2015.7298594 Google Scholar

13. J. Deng et al., “Imagenet: a large-scale hierarchical image database,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR ’09), pp. 248–255 (2009). http://dx.doi.org/10.1109/CVPR.2009.5206848 Google Scholar

14. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 (2014). Google Scholar

15. J. R. Uijlings et al., “Selective search for object recognition,” Int. J. Comput. Vision 104(2), 154–171 (2013).IJCVEQ0920-5691 http://dx.doi.org/10.1007/s11263-013-0620-5 Google Scholar

16. C. L. Zitnick and P. Dollár, “Edge boxes: locating object proposals from edges,” in European Conf. on Computer Vision, pp. 391–405 (2014). Google Scholar

17. R. Girshick, “Fast R-CNN,” in Int. Conf. on Computer Vision (2015). http://dx.doi.org/10.1109/ICCV.2015.169 Google Scholar

18. A. Neelakantan et al., “Adding gradient noise improves learning for very deep networks,” arXiv:1511.06807 (2015). Google Scholar

19. Y. Jia et al., “Caffe: convolutional architecture for fast feature embedding,” in 22nd ACM Conf. on Multimedia, pp. 675–678 (2014). Google Scholar

20. Google, “Google Earth,”  http://www.google.cn/intl/zh-CN/earth/ (2015). Google Scholar

21. W. Hongzhen et al., “High resolution aircraft dataset 2016 (HRAD 2016),”  https://sites.google.com/site/hrad2016whz/ (2017). Google Scholar

22. M.-M. Cheng et al., “Bing: binarized normed gradients for objectness estimation at 300fps,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 3286–3293 (2014). Google Scholar

Biography

Hongzhen Wang received his BS degree from the Department of Automation, Institute of Engineering, Ocean University of China, Qingdao, Shandong, China, in 2013. He is currently pursuing his PhD degree at the Institute of Automation, Chinese Academy of Sciences, Beijing. His research interests include object detection, semantic segmentation, and deep learning.

Yongchao Gong received his BS degree in automation from the School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui, China, in 2012. He is currently pursuing his PhD at the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include pattern recognition, machine learning, and image processing.

Ying Wang received his BS and MS degrees from Nanjing University of Information Science and Technology, China, in 2005 and 2008, respectively, and received his PhD from the Institute of Automation, Chinese Academy of Sciences, China, in 2012. He is currently an associate professor at the Institute of Automation of Chinese Academy of Sciences. His research interests include computer vision, pattern recognition, and remote sensing.

Lingfeng Wang received his BS degree in computer science from Wuhan University, Wuhan, China, in 2007. He is currently an associate professor at the National Laboratory of Pattern Recognition of Institute of Automation, Chinese Academy of Sciences. His research interests include computer vision and image processing.

Chunhong Pan received his BS degree from Tsinghua University, Beijing, China, in 1987, his MS degree from Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Beijing, in 1990, and his PhD from the Institute of Automation, Chinese Academy of Sciences, in 2000, where he is currently a professor at the National Laboratory of Pattern Recognition. His research interests include computer vision, image processing, computer graphics, and remote sensing.

© The Authors. Published by SPIE under a Creative Commons Attribution 3.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.
Hongzhen Wang, Yongchao Gong, Ying Wang, Lingfeng Wang, Chunhong Pan, "DeepPlane: a unified deep model for aircraft detection and recognition in remote sensing images," Journal of Applied Remote Sensing 11(4), 042606 (1 September 2017). http://dx.doi.org/10.1117/1.JRS.11.042606 Submission: Received 21 February 2017; Accepted 3 August 2017
Submission: Received 21 February 2017; Accepted 3 August 2017
JOURNAL ARTICLE
10 PAGES


SHARE
KEYWORDS
Remote sensing

Satellites

Earth observing sensors

Satellite imaging

Data modeling

Lab on a chip

Object recognition

Back to Top