Image segmentation is a fundamental problem in the field of computer vision. So far, abundant research has been published on this topic;12.3.4.–5 however, segmenting the complete foreground objects, which are not uniform in color or texture, remains a challenging task. In addition to the local low-level image features, such as color, texture, and spatial position, an increasing amount of studies focus on segmenting images using high-level visual information.
Cosegmentation methods suggested by Refs. 18.104.22.168.–11 employ foreground correspondence and jointly segmented objects, which have similar characteristics in a set of images. Rother et al.7 utilized histogram matching and a modified Markov random filed (MRF) framework formed by the difference of foreground region histograms. Sun et al.8 constructed an MRF framework, which reflected camera flash illumination changes in order to extract the foreground from the background. Kim et al.9 proposed a hierarchical framework for dividing the large image set into multiple subsets in order to perform segmentation by cosegmenting each subset separately with interimage connections. Inspired by the characteristic of linear anisotropic heat diffusion, Kim et al.10 suggested a cosegmentation model, in which the finite heat sources of temperature maximization corresponded to the maximized segmentation confidence. In Ref. 11, segmentation was modeled by an energy-minimization function, which combined local appearance and spatial consistency; however, in most existing studies on cosegmentation, only multiple images with common objects were handled, and different irregularly appearing objects were hardly dealt with.
In recent years, semantic segmentation aiming at assigning a semantic label to each pixel of a given image122.214.171.124.17.–18 has become a subject undergoing intense investigation in the field of computer vision. Especially, the techniques of deep neural networks have recently played an important role in the field of semantic segmentation. The segmentation accuracy has been greatly improved by applying the deep learning techniques,1920.21.–22 on the condition that the huge dataset is collected to train the network.
Semantic information, such as high-level visual information, can provide an important cue for the segmentation of a complete foreground from the image. In this study, inspired by semantic segmentation methods, we propose a segmentation mechanism for achieving a complete and accurate foreground boundary. Inspired by nonparametric methods, the initial semantic labels were obtained by maximizing the normalized label likelihood score.23,24 Then, the foreground and background semantic descriptors were defined according to the initial semantic labels. With the aid of the two semantic descriptors, a subset of training images with similar foreground to the input image, was obtained. Subsequently, the semantic labels were further refined via object affinity and a semantic codebook. Finally, image segmentation was achieved by means of semantic labeling. For postprocessing, we adopted the Grab-Cut method25 and used it to merge separate regions.
The remainder of this paper is organized as follows: Sec. 2 describes initial semantic labeling using the nonparametric method; Sec. 3 describes our image segmentation scheme via foreground and background semantic descriptors; and the experimental results are presented in Sec. 4.
Initial Semantic Labels Acquisition
Inspired by the nonparametric method,23,24,26 the initial semantic labels of the input image can be acquired as follows: in the beginning, an image subset is obtained from the training set by applying the global GIST feature descriptor, such that the image subset contains the most scenes similar to the input image.
The GIST descriptor can summarize the gradient information for local regions of an image, which provides a rough description of the scene. A GIST descriptor of the scene refers to the meaningful information that an observer can identify from a glimpse at the scene.27 The GIST can be represented at both perceptual and conceptual levels because it includes all levels of visual information. It can be constructed by two-dimensional Gabor wavelets.28 The Gabor wavelets of specific direction and scale can be considered as a local bandpass filter with respect to the corresponding direction and scale, whose response is exactly corresponding to the edges of specific directions in the image. At the beginning, the image is divided into patches. For each patch of size , the cascading of its convolution in each channel is defined as
The average convolution of specific direction and scale for patch is , then the GIST descriptor can be expressed as
By detecting and combining the edge information among local patches, the GIST descriptor can describe the overall distribution of gradient information within the image.
In this study, the initial semantic labels were assigned to each superpixel, instead of individual pixels, due to the spatial supports among pixels. Specifically, the superpixels, within both the input image and the images in , were obtained using the simple linear iterative clustering (SLIC) method.29 Then, we adopted three features, denoted as (): the scale-invariant feature transform (SIFT) descriptor,30 color mean in Lab color space, and central location of the superpixel, in order to describe each superpixel. denotes the set of superpixels of the input image, and denotes all the superpixels achieved from set . For each superpixel , its neighborhood was defined as a set of superpixels , which had the nearest Euclidean distance to in terms of the ’th feature . In this work, included its closest 15 superpixels.
Next, each superpixel was assigned a semantic label , where represented the set of semantic classes. The probability distribution of semantic labels was defined as the normalized label likelihood score. In this work, the normalized label likelihood score , of each superpixel , was expressed by nonparametric density estimates
Then, the initial semantic label for each superpixel was achieved by maximizing the normalized label likelihood scoreFig. 1.
Image Segmentation via Semantic Descriptors
With regard to the input image, it was intuitively known that the segmentation would be guided effectively if there existed a subset of training images, which would have a foreground similar to the input image. Therefore, the subset of training data, named semantic retrieval set, had to be determined by utilizing the initial semantic labels prior to complete segmentation.
Semantic Retrieval Set Determination via Foreground and Background Semantic Descriptor
In the beginning, the image could be divided into two segments according to its Lab color features by using the -means method.31 Considering that peripheral regions often appear as background in images, we assigned the peripheral segment, mentioned above, a background label “0”; a foreground label “1” was assigned to the other segment. Then, the foreground semantic descriptor and the background semantic descriptor were defined in order to obtain the semantic retrieval set , such that the images in the set would have the most similar foreground objects to the input image.
The semantic descriptors were defined in a spatial pyramid structure. The segment labeled “foreground,” in the image, would be divided into equal grids with respect to different levels in the spatial pyramid. At each level, the semantic histogram was calculated within each grid. For instance, four semantic histograms , , , and were obtained in the second level of the spatial pyramid, corresponding to the four equal grids, as shown in Fig. 2.
Then, the foreground semantic descriptor was defined as the concatenation of all the semantic histograms within each grid at each pyramid level
Similarly, we can also define the background semantic descriptor . Experiments showed that the spatial pyramid layer was a good compromise between capturing enough details and avoiding being sensitive to the noise.
In order to obtain the semantic retrieval set, we calculated the global GIST feature , foreground semantic descriptor , and background semantic descriptor , throughout the input image and all the images in the training set. Then, the similarity of the input image and the training set was defined as the Euclidean distance between the features1.
Semantic retrieval set Ψ selecting scheme.
|1. Initialize , and compute ;|
|2. Search for a subset in an ascending order of distance ;|
|3. Set ;|
|4. while none of background label of subset is the same as the input image’s background && do|
|8. Search a new subset in an ascending order of distance ;|
|10. end if|
|11. end while|
|12. return, compute , and obtain final subset .|
By applying Algorithm 1, the training images were arranged according to the ascending order of , which corresponded exactly to the similarity of the input image. Finally, the semantic retrieval set was obtained by selecting the images corresponding to the smallest . More images contained in set would provide more clues for labeling the input image, but they would also decrease the similarity to the input image; therefore, in this study, a maximum of four training images was selected for the formation of the semantic retrieval set .
Figure 3 shows the semantic retrieval set corresponding to the input “cow” image. Set also had “cows” appearing in the foreground, which meant that the coarse initial semantic labels were able to provide an effective cue on what semantic categories the foreground belonged to.
Semantic Labels Assignment via Object Affinity
Once set was obtained, the semantic labels would be reassigned to each superpixel of the input image via object affinity.
Suppose is a superpixel of the input image, and is a superpixel in the ’th image of ; that is, . We denoted the distance between and with respect to the ’th feature as . Then, the distance measure between and was defined as2. Subsequently, within image , the nearest neighborhood of in set was obtained via .
Obviously, the semantic labels of should have provided an important cue for assigning semantic labels to , due to their high feature similarity. Moreover, the labels of the superpixels neighboring to in the input image should have obeyed the smoothness constraint, which reflected the distribution of semantic labels in natural images. The above idea can be expressed as the concept of object affinity.
Thus, the semantic label likelihood of , determined by the labels of , was described as a Gaussian function
Considering that the distribution of semantic labels tended to be smooth throughout natural images, we adopted the agglomerative clustering method10 in order to cluster the superpixels with respect to the Lab color feature. The semantic label propagation of the neighboring superpixels was achieved via object affinity
Initial Semantic Labels Refinement via Semantic Codebook
Although the initial semantic labels were coarse, they still offered a strong cue about the distribution of semantic labels. Now that the semantic retrieval set could also provide the probability of labels for each superpixel, we compared the similarity between initial semantic labels and semantic labels generated from the semantic retrieval set. Generally, the higher the similarity of the two semantic labels, the higher was the reliability of the semantic labels.
Hence, the initial semantic labels were refined according to a semantic codebook, which was constructed for measuring the similarity between initial semantic labels and the semantic labels in the semantic retrieval set . The semantic codebook of set was set as , where was defined as the feature descriptor of all the superpixels labeled in the ’th (, 2, 3) feature channel (mentioned in Sec. 2) for a specific semantic class ; represented the set of semantic classes in set . Moreover, for any particular superpixel labeled , its feature () formed a codeword in the codebook.
For any superpixel assigned a label initially, if its initial label was included in , such that , the similarity of label was calculated only with respect to in the semantic codebook. However, if the initial label , the similarity was determined by examining all the codewords in . Specifically, the similarity for superpixel , which was initially labeled , was defined as
Finally, for the initial semantic label of the superpxiel , the semantic probability was refined as
From Semantic Labeling to Segmentation
In this section, we describe image segmentation by maximizing the linear combination of the object affinity and the refined probability of the initial semantic labels
The segmentation results with the semantic labels of sample images, shown in Fig. 4(b), also provided a bounding box indicating the possible foreground area. Finally, the Grab-Cut method25 was adopted as a postprocessing procedure to portray the boundary of the foreground precisely. The Grab-Cut method25 is a segmentation technique that uses graph cuts to perform segmentation. Before it is performed, a manually rectangular region of interest should be placed to indicate the location of the foreground in the image. The more precisely the rectangle could exactly encircle the object of interest, the more accurate the segmentation result is. If the rectangular region of interest is not perfectly placed, the good segmentation result cannot be obtained, as shown in Fig. 4.
In this work, the semantic information can provide a bounding box indicating the possible foreground area. The Grab-Cut method is performed to further merge the neighboring regions assigned to the same semantic label in the bounding box, and at the same time, to precisely portray the boundary of the foreground via color features in the Lab space. Moreover, after performing the Grab-Cut method, for the areas where the new labels are not consistent with the segmentation result by using Eq. (12), the higher will enforce these areas to remain their previous labels. In the experiments, we set the threshold as 1.4. Figure 5 shows the experimental results by using our method with Grab-Cut as the postprocessing procedure, and the results by using the Grab-Cut method with respect to a precisely placed artificial rectangular region of interest. Compared to Fig. 5(c), it shows that the our result with Grab-Cut as postprocessing procedure is more crisp and the boundary of the foreground is more precisely portrayed as shown in Fig. 5(d). Also in Fig. 5(d), the green leaves among the red flowers in the middle of the image is correctly labeled as the background, whereas it is wrongly labeled as the foreground by only applying the Grab-Cut method, as shown in Fig. 5(e).
To verify the effectiveness of the proposed method, we conducted experiments on the MSRC 21 dataset, which contained 21 different classes with 276 training images and 256 testing images. In the experiments using the MSRC 21 dataset, the image subset was allowed to include a maximum of 25 training images, such that enough scenes similar to the input image could be selected.
In the experiments, we tested all of the 14 image classes in the MSRC 21 dataset. The intersection-over-union score was adopted in order to evaluate the precision of our algorithm. Moreover, we compared our algorithm to other segmentation algorithms;2,6,1011.–12,18,22 the results are listed in Table 1. The higher intersection-over-union score corresponded to higher segmentation precision. In addition, we also listed the precision of our semantic labeling named as “semantic label,” and the segmentation results by using the initial labels with the Grab-Cut as postprocessing is named as “” in Table 1. “Ours” represents the results of our final segmentation with the Grab-Cut as postprocessing. The subscript represents the rank of segmentation accuracy by using different methods.
Segmentation results evaluation on intersection-over-union score.
|Class||Ours||Initial+Grab-Cut||Semantic label||Ref. 12||Ref. 11||Ref. 10||Ref. 6||Ref. 2||Ref. 18||Ref. 22|
We also compared our results with the technique of deep neural network.22 In the experiment, we directly evaluated the released pretrained model trained with PASCAL-context dataset on the MSRC dataset and computed the segmentation accuracy of nine overlapping classes between the two datasets. The quantitative results have been listed in Table 1. Admittedly, the average accuracy of our results is lower than that of Ref. 22, which involved a large amount of training samples. However, our results have achieved the best average accuracy among the hand-designed feature-based methods. In addition, the segmentation accuracies of several classes, such as “cow,” “dog,” and “sheep” are comparable to that of Ref. 22. We also achieved a better result on class “chair” compared to Ref. 22. Details on the experimental results are listed in Table 1. Figure 7 shows segmentation results by using our proposed algorithm. We also compared our results with Ref. 22 visually, as shown in Figs. 7(c) and 7(f). For the results by using,22 only the overlapping classes between the MSRC and PASCAL-Context datasets are shown. For example, segmentation results by using22 are not shown from the 1st row to the 4th row in Fig. 7, as the PASCAL-Context dataset does not include the classes of "face", "flower", "sign" and "house", which is also illustrated in Table 1.
In the experiments, there are still some images which are very challenging. Our mechanism of semantic labeling and foreground segmentation depends on the color and SIFT features. Consequently, our method would probably fail, if dealing with the images in which the color or texture distribution of the foreground is similar to the background, as shown in Fig. 8.
Conclusion and Future Work
In this paper, an image segmentation framework based on semantic information was proposed. Unlike traditional methods based on low-level features, we adopted semantic information in order to distinguish the foreground from the background. In our study, the initial semantic labels were obtained using the nonparametric method. By searching for similar images in the training data, the input image was segmented via the combination of object affinity and semantic labels. Experimental testing using the MSRC 21 dataset demonstrated that our method performed well. In future work, segmentation of video data by means of semantic information will be investigated.
This work was supported by the National Natural Science Foundation of China (No. 61005031).
Ding Yuan received her PhD in mechanical and automation engineering from the Chinese University of Hong Kong and has worked in the field of computer vision for over 15 years. She is now an associate professor at the Image Processing Center, School of Astronautics, Beihang University.
Jingjing Qiang is a postgraduate student at the Image Processing Center, School of Astronautics, Beihang University.