Using convolutional neural network to identify irregular segmentation objects from very high-resolution remote sensing imagery

Abstract. Convolutional neural network (CNN) has shown great success in computer vision tasks, but their application in land-use type classifications within the context of object-based image analysis has been rarely explored, especially in terms of the identification of irregular segmentation objects. Thus, a blocks-based object-based image classification (BOBIC) method was proposed to carry out end-to-end classification for segmentation objects using CNN. Specifically, BOBIC takes advantage of CNN to automatically extract complex features from the original image data, thereby avoiding the uncertainty caused by the manual extraction of features in OBIC. Additionally, OBIC compensates for the shortcomings of CNN whereby it is difficult to delineate a clear right boundary for ground objects at the pixel level. Using three high-resolution test images, the proposed BOBIC was compared with support vector machine (SVM) and random forest (RF) classifiers, and then, the effect of image blocks and mixed objects on classification accuracy was evaluated for the proposed BOBIC. Compared with conventional SVM and RF classifiers, the inclusion of CNN improved the OBIC classification performance substantially (5% to 10% increases in overall accuracy), and it also alleviated the effect derived from mixed objects.


Introduction
Object identification in very high-resolution (VHR) remote sensing imagery has always been a fundamental but challenging issue.In the past few decades, various methods for the identification of different types of objects have been proposed, including the template matching-based method, [1][2][3] knowledge-based method, [4][5][6] object-based image analysis (OBIA) method, [7][8][9] and machine learning-based method. 10,11Among them, the OBIA method can be easily combined with geographical information system (GIS) techniques, which allows for more complete mapping of land-use types for GIS analyses. 123][14] The first step in OBIA is to segment the images into relatively homogeneous regions (segmentation objects), 15 and then, the statistical information for the segmentation objects is employed for image analyses (e.g., object-based image classification, hereafter, OBIC).As compared with pixels, the segmented objects not only exhibit rich spectral and textural features, but also provide shape and contextual information, 16 which can improve the classification performance for various types of objects.
However, the sharp increase in the feature number for each segmentation object renders the determination of optimal features as an uncertain or subjective process.For example, Weston et al. 17 and Guyon and Elisseeff 18 pointed out that reducing feature dimensions could improve support vector machine (SVM) classification accuracy, whereas Melgani and Bruzzone 19 and Pal and Mather 20 deemed that SVM was insensitive to the number of data dimensions.Likewise, Duro et al. 21found that feature selection could improve the classification performance of the random forest (RF) classifier, 22 whereas Ma et al. 23 deemed that RF was a relatively stable classification model, as they found that there were no significant differences among its classification accuracies irrespective of the use of feature selection.Presently, the feature selection process is always associated with an uncertainty factor during OBIC using traditional classification models.Emerging deep learning 24 methods are famous for their ability to carry out automatic feature extraction on raw data, and therefore, such methods could potentially be used to optimize the process of feature extraction and selection in OBIC.However, deep learning methods have not been extensively tested in land-use type classifications, especially within the framework of OBIA.
As deep learning was proposed, 24 it has received extensive attention from many scholars because it can automatically generate complex and abstract high-level features in a hierarchical manner. 25High-level features have proven to be highly effective in representing complex objects (e.g., high-resolution images). 26The convolutional neural network (CNN) is one of the algorithms with most rapid development in deep learning and was specially designed for image classification tasks. 27,28Images served as the input at the lowest layer in the CNN's hierarchical structure, and each layer obtains the features of the upper layer through a convolution filter. 29Moreover, with increased hierarchical depth, features became more and more robust and complex.This allows for salient features of translation-, scaling-, and rotation-invariant data to be obtained. 30However, a major drawback is that the input of the CNN framework must be image blocks of a fixed size.This poses a certain challenge in terms of combining CNN with object-based remote sensing image classification because the minimum processing unit of OBIA is usually irregular segmentation objects.
Despite the above problems, the continuing success of CNN in the field of image recognition 31,32 has motivated researchers in the remote sensing community to investigate its potentials for OBIA.Guirado et al. 33 compared state-of-the-art OBIA methods with CNNbased methods for the detection of plant species of conservation concern and reasoned that adopting the CNN-based methods could further improve OBIA methods.Zhao et al. 34 proposed a two-step OBIC framework using a combination of handcrafted and deep CNN features.In their work, however, CNN only served as a feature descriptor of segmentation objects, which makes the process of feature selection in OBIA still uncertain.Liu et al. 35 implemented end-to-end classifications of wetland land cover under the OBIA framework and tested the classification performance of the model using different training samples.However, their work did not systematically assess the geometric relationship of the irregular segmentation objects to the input image blocks of the CNN; it only focused on the identification of wetland land cover.All of the above studies show that the CNN can effectively improve the OBIC classification performance in specific contexts, so work is urgently needed to systematically evaluate the availability of classifying irregular segmentation objects using CNN.
In a similar way, this paper considers that including CNN in an OBIA framework could take advantage of the benefits of both methods, e.g., OBIA segmentation to delineate homogeneous areas and CNN for classification.Hence, a blocks-based object-based image classification (BOBIC) method is proposed to combine OBIA with CNN.In this work, the multiresolution segmentation (MRS) algorithm was employed to generate highly irregular segmentation objects. 36Image blocks were subsequently generated according to the center of gravity (CG) of the segmentation objects, thereby combining irregular objects with the CNN.Furthermore, the differences between this method and conventional classifiers were compared systematically at three study sites, and the effects of segmentation object shape and mixed objects on the classification accuracy were also analyzed.The remaining parts of this paper are organized as follows: Sec. 2 introduces the three study sites that were used in the experiments.Section 3 elaborates on how to apply CNN in OBIA and the experimental procedures used in this paper.The experimental results are presented in Sec. 4, and Sec. 5 contains a discussion of the experimental results.Finally, Sec. 6 summarizes the entire paper.

Study Area
In this work, unmanned aerial vehicle (UAV) images and International Society for Photogrammetry and Remote Sensing (ISPRS) standard datasets corresponding to agricultural areas and urban areas, respectively, were employed for the experiments.Images for study site 1 were sourced from the high-resolution image acquisition project in Deyang City, Sichuan Province, China. 37This project adopted a fixed-wing UAV equipped with a Canon EOS 5D Mark II digital camera.At 80% heading overlap and 60% side overlap and with an average flight altitude of 750 m, the UAV captured raw image data for the built-up area and suburban area of Deyang City with a total area of 400 km 2 in August 2011.Furthermore, a digital orthophoto map (DOM) with a resolution of 0.2 m was finally obtained using digital photogrammetric techniques.In this work, a standard-sized UAV DOM (500 m × 500 m) [Fig.1(a)] was randomly selected, where crop (41%), woodland (46%), buildings (6%), roads (2%), and bareland (5%) were distributed.Study sites 2 and 3 employed Vaihingen and Potsdam datasets provided by the ISPRS Commission III, respectively.These datasets can be downloaded freely from the ISPRS website. 38The Vaihingen dataset contains a total of 33 aerial images of varying sizes (average size of 2494 pixels × 2064 pixels), 16 of which also have visually interpreted reference (labeled) polygons, and the spatial resolution for each aerial image is 9 cm.In this work, one image (region 26) was randomly selected from the 16 visually interpreted images for study site 2 [Fig.1(c)], where buildings (42%), woodland (29%), water (12%), cars (3%), and grass (14%) were distributed.The Potsdam dataset comprises a total of 38 aerial images (each image size was 6000 pixels × 6000 pixels), 24 of which have visually interpreted reference polygons, and the spatial resolution for each aerial image is 5 cm.Likewise, one image (region 07_12) was randomly selected from the 24 visually interpreted images for study site 3 [Fig.1(e)], where buildings (69%), woodland (9%), bareland (3%), cars (4%), and grass (15%) were distributed.Images of the three study sites and their corresponding visually interpreted polygon layers are shown in Fig. 1.

Methods
As mentioned in Sec. 1, traditional OBIA methods require a large number of image features to be empirically designed, which is time-consuming and often fails to lead to accurate representations.In contrast to traditional methods, the CNN can perform automatic feature extraction on raw images, and deep features extracted by the CNN are generally effective for complex image pattern descriptions. 31,32However, CNN often fail to capture the precise contour of real-world objects in the images, and suffer from the "pepper and salt" effect because the output features of CNN are highly abstract.Thus, it is natural to consider that including CNN in an OBIA Fig. 2 Flowchart of the comparison between the OBIC method and BOBIC methods.
framework can take advantage of the benefits of both methods, i.e., CNN for object classification and OBIA segmentation to provide accurate edge realizations.However, the CNN framework requires fixed-sized image blocks as input, which limits its development in the OBIA framework.In consideration of this issue, in this paper, we try to propose a BOBIC method to classify irregular segmentation objects using CNN. Figure 2 summarizes the technical roadmaps of OBIC and the proposed BOBIC.
As shown in Fig. 2, OBIA involves two steps, namely image segmentation and object classification.The proposed BOIBC method involves applying CNN to the object classification step so as to improve the OBIC method.Therefore, image segmentation is the common step of these two methods, and this is described in detail in Sec.3.1.Object classification is divided into two parts, namely OBIC (Sec.3.2) and BOBIC (Sec.3.3).Furthermore, the object classification process of the traditional OBIC method mainly includes the following two steps: feature calculation and selection (Sec.3.2.1)and classifier selection (Sec.3.2.2).The proposed BOBIC method can automatically perform the feature calculation and selection of images using the CNN, but there is a need to generate a unique image block corresponding to each segmentation object.The generation of image blocks for segmentation objects is elaborated on in Sec.3.3.1,and Sec.3.3.2presents the structure of the CNN used in this paper.In addition, the sampling and accuracy assessment methods are described in Sec.3.4.

Image Segmentation
0][41] MRS has been proven to be one of the rather successful segmentation algorithms in OBIA. 42,43In this paper, image segmentation was performed for three study sites in a unified manner using MRS implemented with eCognition 8.7 software (eCognition Software ® Definiens, 2011), 36 and subsequently, irregular segmentation objects were generated.The following three parameters need to be set for the MRS: color/shape ratio, smoothness/compactness ratio, and segmentation scale parameter (SSP).The color/shape ratio defines what percentage of the homogeneity of spectral values is weighted against the homogeneity of shape.The smoothness/compactness ratio is used to determine the smoothness or compactness of each object.In this work, to make the spectral information have a dominant role during segmentation, the color/shape ratio was set to 0.9/0.1.The smoothness/compactness ratio was configured to 0.5/0.5, because we did not want to favor compact or noncompact segments.
The most important parameter for MRS is the SSP, which controls the internal heterogeneity of each object.Specifically, use of a small SSP results in smaller and more homogeneous objects, i.e., fewer pixels per object.However, using an overly small object size (i.e., over-segmentation) may affect the quality of the information extracted from each object 44 and increase the computational burden of the subsequent classification process.Conversely, an overly large SSP (i.e., under-segmentation) will produce objects containing multiple different classes (i.e., this leads to the generation of mixed objects 45 ).Automated identification/selection of the "appropriate" SSP(s) for segmentation (i.e., those which can minimize under-and oversegmentation) is still an active research topic. 16,46,47In this research, two SSPs (50 and 110), selected based on visual analysis, were employed for image segmentation to enrich the experimental results.Additionally, if the area of a primary class that was encompassed by the segmentation object accounted for over 60% of the total area of this segmentation object, then this segmentation object was labeled with this class (otherwise the segmented object was left unlabeled).Here, the proportion of the primary class was set to 60% with reference to the research by Verbeeck et al. 48and Ma et al. 23 The numbers of segmentation objects for various classes at the three study sites are shown in Table 1.

Feature calculation
Features of segmentation objects need to be calculated to employ conventional OBIC algorithms (e.g., SVM or RF).In this paper, eCognition 8.7 software was adopted to calculate commonly used shape, textural, and spectral features.The shape features included the area, density, roundness, compactness, border index, shape index, main direction, elliptic fit, rectangular fit, and asymmetry; the textural features included the gray-level co-occurrence matrix (GLCM) entropy, GLCM std.dev., GLCM contrast, GLCM dissimilarity, GLCM homogeneity, GLCM mean, GLCM ang.2nd moment, and GLCM correlation that were computed according to the GLCM 49,50 as well as the gray-level difference vector (GLDV) entropy, GLDV contrast, GLDV mean, and GLDV ang.2nd moment that were derived from the GLDV; 51 the spectral features included the mean blue, mean green, mean red, max difference, standard deviation blue, standard deviation green, standard deviation red, and brightness.Considerable uncertainty exists concerning feature selection with regard to different classifiers. 52,53Hence, feature selection has not been performed for the above-mentioned features.

Selection of conventional classifiers
5][56][57] Hence, in this work SVM and RF classifiers were utilized to classify the extracted features in Sec.3.2.1.The SVM used the LIBSVM library that was developed by Chang and Lin, 58 and we employed the radial basis function (RBF) 59 as its kernel function.The RBF involved penalty parameter C and kernel parameter γ.The accuracy of each cross validation was tested by using the grid-search method, and thus, the parameters with the highest cross-validation accuracy could be identified as the penalty parameter and kernel parameter.The RF classifier used the "randomforest" package in R language.Roughly speaking, constructing an RF classifier requires the following two parameters: (1) n is the number of features when each decision tree is constructed, (2) k is the total number of decision trees.Based on the results obtained by Rodriguez-Galiano et al., 60 k was set to 479, and n was equivalent to one single random segmentation variable; the intent was to reduce the generalization error and the correlation between trees and prevent over-fitting in the classification process as much as possible.

Generation of image blocks for segmentation objects
Image blocks of a fixed size have to be generated for each segmentation object to use CNN in OBIC.The size of an image block is constrained by the depth of the CNN network and the capacity of computer memory. 61With respect to subsequent experiments in this work, supervised classification tests were conducted mainly using a small sample size, where the ultra-large scale CNN framework could not be adopted.Hence, 32 × 32 and 64 × 64 pixel shapes were selected as the size of the image block.In addition, in this paper the CG for the segmentation object served as the center point of an image block.Each segmentation object corresponded to one unique image block.In addition, the class of an image block was in good agreement with that of the corresponding segmentation object.Figure 3 shows a schematic of the generation of image blocks for irregular segmentation objects, where black lines denote the segmentation boundaries of irregular segmentation objects, red cross-points represent the CG of irregular segmentation objects, and red square boxes indicate the range of a sampled image block.
It can be seen from Fig. 3 that the CG of a convex polygon, in most cases, fell inside the polygon.However, with respect to a nonconvex polygon, its CG exhibited a certain shift.This presented a challenge with regard to the application of the proposed BOBIC method.Hence, we summarized in detail the geometric relationship of irregular segmentation to the input image blocks of the CNN.First, when the CG of a segmentation object fell within the segmentation object, there existed a total of the following three situations: Second, under circumstances where the CG fell outside a segmentation object, it was impossible for the segmentation object to encompass the image block.In addition, the CG was likely to either fall within the segmentation object of the same type or fall inside the segmentation object of a different type.No difference existed in the former case between situations 1 and 3.This was because the center point of the image block always fell on the land cover of the same type, and the class of the segmentation object that corresponded to the image block remained unchanged.Hence, this situation was not listed separately, i.e., the situation where the CG fell within the land cover of the same type was included in situations 1 and 3 correspondingly.Then, the remaining situations were as follows: 1.The CG fell outside the segmentation object, and it fell inside different types of segmentation objects, where the image block entirely encompassed the segmentation object.2. The CG fell outside the segmentation object, and it fell inside different types of segmentation objects, where the image block encompassed a portion of the segmentation object.
The above five situations with different types of land covers are shown in Fig. 4.

Convolutional neural network
The CNN consisted mainly of three different types of hierarchical structures, specifically, convolution layers, pooling layers, and fully connected layers.Convolution layers, also known as feature extraction layers, constitute the primary layers of CNN architecture.The input of convolution layers comprises a set of two-dimensional (2-D) feature maps of a fixed size.In the convolution phase, trainable filter W (convolution kernel) performs the convolution operation by using a sliding window technique. 62,63Assume the convolution kernel is i × j in size, and then, the output feature map Y that corresponds to X can be written as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 6 9 9 where m, n denote the row and column number of a hidden neuron in the 2-D feature map, b is a trainable bias parameter, and f represents the particular nonlinear activation function.
Pooling layers are down-sampling layers in the CNN architecture, which can enhance the spatial-invariance property of the convolutional architecture. 64A down-sampling operation was performed for each 2-D feature map normally through max pooling. 65The max pooling operation aims to compute the maximum value of a neuron within the local region, which is expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 5 6 7 where ði; jÞ denotes the size of the local region X, m, n represents the row and column number of a neuron inside the local region, and Y is the output of the max pooling operation, respectively.Fully connected layers generally constitute the last few layers of the CNN architecture, which accept all neurons in a 2-D feature map and connect them to one-dimensional neurons.With regard to a multiclass problem, the number of neurons for the last fully connected layer equals the number of classes for the final classification.In addition, the last fully connected layer is normally followed by the Softmax layer, 66 which can be used to obtain the discrimination probability for each class.The equation is given as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 4 3 4 where X i denotes the output of class i in the last fully connected layer, k is the number of classes, and Y i represents the discrimination probability for class i, respectively.In this work, the architecture of VGG-Net 67 was used as a reference.The end-to-end training was performed for image blocks of segmentation objects using the CNN architecture as shown in Fig. 5.
The CNN architecture shown in Fig. 5 is comprised of four convolution layers (blue layers as shown in Fig. 5).Each convolution layer used a 3 × 3 convolution kernel, and convolution operations were performed with stride 1 for the 2-D feature maps in the previous layer.The first two convolution layers produced 32-dimensional output, whereas the latter two generated Fig. 5 CNN architecture employed in this work.Image blocks that were generated in Sec.3.2.1 served as the CNN input, and the output of CNN was comprised of the classes of segmentation objects that corresponded to the image blocks, blue layers represent convolution layers (using ReLU as the activation function), red layers are pooling layers (using the max pooling layer), purple layers denote fully connected layers, and green layers are Softmax layers.
64-dimensional output.A rectified linear unit (ReLU) 25 can address the gradient disappearance phenomenon well. 68,69Therefore, ReLU was adopted as the activation function for each convolution layer.Every two convolution layers were followed by a 2 × 2 max pooling layer (red layers as shown in Fig. 5).The first purple layer in Fig. 5 shows a fully connected layer that was comprised of 512 neurons, whereas the number of neurons for the last fully connected layer (the second purple layer in Fig. 5) was equal to the number of land-use types in the three study sites, all being 5 in this work.Finally, the Softmax function was applied after the last fully connected layer, which allowed for the generation of the green class output as shown in Fig. 5.
To avoid the risk of overfitting, 70 the following strategies were adopted in this work: 1. Employ the dropout technique after the pooling layers and fully connected layers. 71The dropout technique aims to avoid co-adaptation of neurons during training.It randomly "shuts down" a given percentage of neurons during CNN training, thereby reducing the overfitting risk.In this work, the dropout percentage after the max pooling layer was set to 20%, and it was set to 50% after the fully connected layer.2. Apply the early stopping technique that monitors a certain value (normally the loss value).The CNN training stops when this value does not increase or decrease after multiple epochs.In this paper, the loss values of training samples were monitored.When these values were all <0.1 within 20 epochs, the CNN training was stopped.

Data augmentation can extend data without increasing the number of training samples.
The commonly used enhancement strategies include random image rotation, random image scaling, horizontal image shift, and noise injection.To maintain the high resolution of images, only random rotation was performed on the images.
In addition, all the weights in convolution layers and fully connected layers were initialized using the He normal distribution. 68In this work, the CNN was trained from scratch using the end-to-end method.

Sampling and Accuracy Evaluation
Regardless of whether the base unit of classification is a segmented object or an image block generated based on the segmentation object, it makes no difference from the perspective of sampling.Hence, the random sampling method was adopted in the experiments.Proportions amounting to 10%, 20%, 30%, 40%, and 50% of the total number of segmentation objects in three study sites were sampled as training sample sets, whereas the remaining samples served as test sample sets.The classification accuracy was derived by dividing the number of correctly classified segmentation objects in the test sample set by the total number of segmentation objects in the test sample set.Twenty-time random samplings were performed with respect to each sampling ratio, and then, statistics were collected for the classification accuracies with regard to 20time samplings.Finally, the mean value and standard deviation of classification accuracies were computed.
In addition, Welch's t-test 72 was used to test whether significant differences existed between two sets of data.Specifically, Welch's t-test was performed on the classification accuracies with respect to adjacent sampling ratios, thereby allowing us to assess whether significant differences existed in terms of the classification accuracies of adjacent sampling ratios.P-values were derived from the Welch's t-test, and significant differences were deemed to exist between two sets of data when the p-value was <0.05.

Results
This section contains a complete description of the classification performance of the conventional OBIC method and the proposed BOBIC method.First, to test whether the BOBIC method could achieve higher land-use type classification accuracy than the OBIC methods, we compared the two methods at the three study sites using five sampling ratios and two different SSPs (results presented in Sec.4.1).Second, as discussed in Sec.3.3.1, the geometric relationships between the segmented objects and image blocks were complex.Often the image block did not entirely contain its corresponding segmented object, which presented a challenge during the application of the proposed method.Therefore, the classification error rates of different geometric relationships were calculated in Sec.4.2 to assess the influence of these geometric relationships on classification accuracy.In addition, the mixed objects were a special but easily overlooked issue in the framework of OBIA.On the one hand, the classification accuracy of the mixed objects tended to be lower, because they often contained pixels belonging to many different land-use classes.On the other hand, the existence of mixed objects could not be avoided because of the limitation of the current segmentation algorithm.So, we counted the classification accuracy of mixed and pure objects in Sec.4.3 to evaluate the applicability of the proposed method to mixed objects.

Comparison of OBIC and BOBIC in Terms of the Classification Effect
Based on the sampling and accuracy evaluation methods described in Sec.3.4, final classification results were obtained using the OBIC method and BOBIC method, and these results are shown in Tables 2 and 3.In addition, the classification objects for SVM and RF classifiers were extracted features of the segmentation objects described in Sec.3.2.1, which represents OBIC; additionally, the classification objects for the CNN were image blocks that were generated using the CGs of segmentation objects in Sec.3.3.1,which represents the proposed BOBIC.Table 2 shows the mean value and standard deviation of classification accuracies for 20-time random samplings on five sampling ratios using four classification methods with a segmentation scale of 50.Meanwhile, Table 3 shows the mean value and standard deviation of classification accuracies for 20-time random samplings on five sampling ratios, using four classification methods with a segmentation scale of 110.
According to the results shown in Tables 2 and 3, the following observations can be made.(1) The classification accuracies of the proposed BOBIC on five sampling ratios were all superior to the OBIC method.(2) The classification accuracy of image blocks with 64 pixels × 64 pixels was obviously superior to that of image blocks with 32 pixels × 32 pixels.(3) The BOBIC method was characterized by better classification stability.The variance of its classification accuracies under corresponding sampling ratios remained less than that of the two conventional classifiers.
Based on the BOBIC experimental results presented in Tables 2 and 3, the Welch's t-test was conducted for adjacent sampling ratios (Sec.3.4), and these results are shown in Table 4. From a vertical perspective of Tables 2 and 3 as well as in combination with Table 4, when the sampling ratio increased from 10% to 20%, the classification accuracy of the BOBIC exhibited a marked increase (most of the p-values were all <0.05).With regard to the remaining adjacent sampling ratios, the improvement in classification accuracy did not exhibit an obvious pattern.
Graphical representations of the classification performance for the three study sites were prepared with respect to the optimal classification results of 20-time random samplings by using a sampling ratio of 50% (Fig. 6).It can be observed from Fig. 6 that, compared with the OBIC method (SVM and RF), the classification performance of the proposed BOBIC was more "clear-cut," i.e., it overcame the so-called "pepper and salt" effect.Specifically, different land cover types were characterized by more clear boundaries, e.g., woodland, farmland, and barren land in study site 1; water bodies and buildings in study site 2; and buildings, woodland, barren land, and grassland in study site 3, respectively.In summary, the proposed BOBIC method improved the overall classification performance of the traditional OBIC.

Classification Effect of Different Geometric Relationships between Image Blocks and Segmented Objects
The geometric relationship between image blocks and segmentation objects forms an important part of the proposed BOBIC method.So this section provides a further statistical analysis of the five situations summarized in Sec.3.3.1.Table 5 presents the number of segmentation objects in the three study sites under different situations.With a sampling proportion of 50%, the classification error rates for each situation were calculated, as shown in Table 6.
The following could be clearly observed from Tables 5 and 6. (1) The probability for the occurrence of situation 4 and 5 remained extremely low, but their error rates were very high.
(2) The error rate of situation 2 remained very low; however, the number of training samples for situation 2 was very small.(3) The numbers for situation 1 and 3 accounted for the vast majority of the total number of segmentation objects, and the error rates of these two situations were close.

Effects of the BOBIC Method on the Classification of Mixed Objects
The effects of the BOBIC method on the classification of mixed objects are discussed in this section.The ratio of the area of the primary class in a segmentation object to the total area of the segmentation object [referred to as the primary class proportion (PCP)] was employed as an indicator to measure the mixed degree of the segmentation objects.When the PCP was 100%, the segmentation object was a pure object.Lower PCP values reflect the greater mixed degree of the segmentation objects.Then, statistics were collected for the ratios of sample sizes for the different intervals of the PCP to the total sample size, as shown in Fig. 7. Smaller SSP values were associated with more severe over-segmentation.Therefore, the number of pure objects with a segmentation scale of 50 was obviously larger than that with a scale of 110 in Fig. 7.In addition, with decreasing levels of the mixed degree (increases in the PCP), the number of segmentation objects increased gradually.We used the classification model with a sampling rate of 10% described in Sec.4.1 to classify all segmentation objects in the study area, and then, we computed the classification accuracies for the different intervals of the PCP.The sampling ratio of 10% was selected to minimize the difference that different classifiers would impose varying levels of fitting on training samples.Figure 8 shows a combo line and column chart for the classification accuracies of the different intervals of the PCP at the three study sites.
First, as observed from Fig. 8, the classification accuracies of the BOBIC method over different intervals of the PCP were almost all superior to those of the SVM and RF classifiers, in particular with respect to image blocks of 64 pixels × 64 pixels.Second, the proposed method improved the classification accuracy of mixed objects substantially.Moreover, with an increased level in the mixed degree (decreases in the PCP), the BOBIC method demonstrated a more obvious advantage.Finally, the proposed method also exhibited more superior performance when classifying pure objects, in particular with respect to a segmentation scale of 50.The proposed BOBIC method exhibited better classification accuracy than the conventional OBIC method in the three study areas, and the results confirmed the feasibility of using the proposed method for land-use type classifications.We also found that the geometric relationship of image blocks to segmented objects was important for the proposed BOBIC method.This was because, in terms of remote sensing images, segmented objects of different land cover types would exhibit varying features.For example, the single area of vehicular segmented objects was normally small, whereas segmentation objects of rural roads were generally strip-shaped.Irregular shapes of segmented objects resulted in situations where the image block of a fixedsized often encompassed only a portion of the segmented object, or even was enclosed by the segmented object.In our experimental results, the numbers of situations 1, 2, and 3 accounted for the vast majority of the total number of segmented objects.Moreover, situations 2 and 3 did not exhibit a higher error rate than situation 1, which demonstrates that the classification accuracy of CNN would not be affected by the situation where the image block only encompasses a portion of the segmentation object.This finding further confirms the feasibility of using the proposed BOBIC method.Furthermore, another key point of the proposed BOBIC method was that it improved the classification effect of mixed objects, which can be attributed to the way that it generates samples, i.e., by generating image blocks using CGs of segmentation objects.First, the image block itself was a mixed object, which could substantially narrow the gap between mixed and pure segmentation objects.Second, owing to the fact that the CG was the center of object mass, the center point of the image block exhibited a tendency to fall on or approach the region of the primary class in the mixed object.Moreover, as the PCP became greater, this tendency became more pronounced.Certainly, only the CNN can overcome the fact that the complexity of VHR images can cause traditional human-dependent classification models to fail due to the limited representation power of handcrafted features, 34 thereby obtaining class information from complex image blocks.It can be concluded that the proposed BOBIC method was successful at applying the CNN to OBIC, which also proves the hypothesis of Guirado et al. 33 that stated that the inclusion of CNN-models could further improve OBIA methods.
Finally, we need to mention that there was a disadvantage in relation to the use of the proposed method in that the center point of an image block fell onto different types of land covers in a few rare cases (i.e., situations 4 and 5, and in particular, with respect to the road under situation 5, where its image block represented not a road but a building).As discussed in Sec.4.2, the error rates of situations 4 and 5 were very high, but the probability of the occurrence of situations 4 and 5 remained extremely low.This was because only if the boundary line between two types of land covers exhibited a larger curvature, the CG of land cover on the outward side of the boundary line (in the direction opposite to the side where the curvature center was located) fell within the land cover on the inward side of the boundary line (on the side where the curvature center was positioned).Meanwhile, the CG of land cover on the inward side of the boundary line still fell onto the land cover of the same type.Even so, how to generate more appropriate image blocks for the segmented objects of situations 4 and 5 will be an important focus topic for us in the future.

Conclusions
In this work, a blocks-based OBIC (BOBIC) method was proposed for applying a CNN to OBIC.Compared with traditional classification methods, the proposed method utilizes the ability of CNN to automatically extract high-level features, thereby achieving end-to-end classification for irregular segmentation objects within the framework of OBIA.To evaluate the feasibility of the proposed BOBIC method, we systematically summarized the geometric relationships of segmented objects to image blocks and tested the method at three study sites using two segmentation scales and two types of image block sizes.Experimental results showed that the BOBIC method could substantially improve the OBIC classification effect and alleviate the effect derived from mixed objects.However, there was a drawback to the proposed method in that erroneous samples could be generated when the boundary line between two types of land covers exhibited a large curvature, which will be the focus topic of our future research.In summary, the proposed BOBIC exhibited an excellent classification effect compared with the OBIC.Moreover, this approach successfully reduced the uncertainty associated with OBIA during classification, which is mainly comprised of uncertainty during feature selection and that of mixed objects.

Fig. 1
Fig. 1 Images of the study sites in this work and their corresponding reference layers.(a), (c), and (e) The images of the three study sites; (b), (d), and (f) the corresponding reference (labeled) layers of three study sites.

Fig. 3
Fig. 3 Schematic generation of image blocks for irregular segmentation objects.(Black lines denote the segmentation boundaries of irregular segmentation objects, red cross points represent the CG of irregular segmentation objects, and red square boxes indicate the range of a sampled image block.)

Fig. 4
Fig. 4 Five situations amongst image blocks and segmentation objects.("-" indicates that this situation does not exist with respect to the current land cover class, image blocks are enveloped by bright blue dotted boxes, bright green solid boxes depict the range of segmentation objects, and red points are the CG of segmentation objects.)

Fig. 6
Fig. 6 Graphical representations of the classification performance for different study sites using a sampling ratio of 50%.[For the different study sites, (a) is the vector graph of the fully correct classification, (b) is the vector graph classified by using the SVM classifier, (c) is the vector graph classified by using the RF classifier, (d) is the vector graph of the BOBIC with an image block size of 32 pixels × 32 pixels, and (e) is the vector graph of the BOBIC with an image block size of 64 pixels × 64 pixels.]

Fig. 7 Fig. 8
Fig.7The ratios of the segmentation object quantity for the different intervals of PCPs in the three study sites to the total quantity of segmentation objects.(The PCP represents the ratio of the area of the primary class in a segmentation object to the total area of the segmentation object.)

Table 1
The number of segmentation objects for various land-use types at three study sites; data were derived using segmentation scales of 50 and 110.

Table 2
The mean value and standard deviation of classification accuracies for 20-time random samplings based on different sampling ratios with a segmentation scale of 50 for three study sites.

Table 3
The mean value and standard deviation of classification accuracies for 20-time random samplings based on different sampling ratios with a segmentation scale of 110 for three study sites.

Table 4
Welch's t-test results for the BOBIC with respect to adjacent sampling ratios.<0.05 indicates that a significant difference exists between the two sets of data.Fu et al.: Using convolutional neural network to identify irregular segmentation objects. . .

Table 5
Number of segmentation objects under different situations.

Table 6
Classification error rates of segmentation objects under different situations.