The human-vision system is able to automatically identify objects in static and dynamic scenes; this fundamental capability allows individuals to automatically concentrate on attractive and important targets in complex scenes. In the computer-vision community, the subject that simulates the human-visual attention system is referred to as salient region detection;1 the aim of the saliency model is to create an intensity map that represents its probability corresponding objectness. Since the estimated saliency is a higher level feature map, the model can be used for various image-processing and pattern-recognition applications, such as visual tracking,2 object segmentation,3,4 object recognition,5,6 image matching,7 and image/video compression.8,9,10,11
Although the study of saliency region detection is quite extensive and diverse, a common feature among most existing studies12,13,14,15,16,17 is that the models have been dependent on the contrast feature. Because the contrast feature reflects the human-visual system that automatically concentrates on uniqueness and rarity,1 it has been widely used for the detection of the salient region. To improve saliency map quality, recent saliency models have begun to employ simple spatial features such as boundary prior or background information as the secondary feature, leading to significantly better performance compared with that of previous models. However, the use of simple boundary prior as the secondary feature is very simple, fragile, and their integration process is mostly heuristic.17 To address these issues, soft-segmentation (SS)-wise saliency detection models15,17,18 were proposed, and significant progress has been made compared with those of other saliency models. The point of SS models is that an object’s saliency is interpreted by considering a homogeneous-region-level spatial model, which is also called “boundary connectivity (BC);” in these models, the undirected-weighted-graph model is employed to construct spatial weights between each super pixel. During the color contrast computation between patches (or super pixels), these spatial weights are used to weigh similar colors, and the weights on a constructed graph can be regarded as SS information. The models are quite solid compared with prior models for which simple background clues are considered; this is because they are considering cluster-(or segmentation) level background clues. Intuitively, the models are reasonable and robust because the human-visual system does not use only pixel-level clues to identify objects. However, the approach is still not enough to represent the human-visual system; for this reason, limitations are commonly observed when a high dissimilarity between objects (pixel-inside features) exists due to contrast feature usage. Note that the detailed description regarding the drawbacks of contrast feature was addressed in this study.19 Despite these limitations, the SS-wise models are designed to use the contrast as a main feature; at the same time, the BC model is used to assist the contrast feature. Although their background model can be used directly with hard-segmentation clues, they proposed a “soft” approach because an image segmentation itself is an unsolved problem.17
Aiming to solve the problem, hard-segmentation (HS)-wise saliency detection models have been presented in Refs. 120.–19. The models have shown that the spatial background clues based on the hard-segmented regions can be well expressed in terms of objectness instead of contrast feature. In the HS-wise models, multilevel hard-segmentation maps were constructed, and then the models computed spatial saliency in regard to the segmented maps using the robust background measurement; the models were significantly robust in the limitation of contrast feature usage. However, due to undesirable discontinuous artifacts, the HS-wise saliency model suffers from local noises. In this field, the mentioned difficulties are endemic and universal issues; a few examples are shown in Fig. 1. For the second example, we can see that the SS-based models tend to lose the foreground information of the object (left tree) because of their dependency on the contrast features. For the third example, although object regions are well defined, many local noise blobs are observed in the HS-based model due to undesirable discontinuous artifacts.
In this paper, we proposed a combination model reflecting both soft- and hard-segmentation techniques. The motivation behind such combination process was to overcome the above-mentioned limitations caused by existing models. Our proposed model has the following contributions: (1) a combination system that encompasses both hard and soft techniques is proposed here for the first time. It outperformed techniques of existing models; and (2) to achieve reliable hard-segmentation results, an iterative reweighting process, for which an influence of outlier segmentation maps is decreased, is proposed here for the first time. In addition, SS-wise saliency clues were employed as prior knowledge to improve the quality of segmentation maps.
This paper is organized as follows: in Sec. 2 related works are briefly described with its advantages and disadvantages; the details of the proposed model are described in Sec. 3; in Sec. 4, the proposed methods are evaluated against state-of-the-art approaches with four benchmark datasets; and in Sec. 5, a conclusion and some future work are presented.
Over the previous decades, a considerable number of studies regarding the visual-saliency model have been proposed based on various mechanisms and extensive reviews can be found in Refs. 21 and 22. In this section, we briefly review the related works based on several viewpoints.
Although handcrafted-saliency-detection models have been quite successful, its heuristic rules still present a limitation for a variety of challenging cases. Aiming to overcome this limitation, deep-learning23,24,25,26 based saliency models have recently been proposed. The common mechanism of deep-learning based models is that a discriminative feature between the foreground and background is automatically extracted and interpreted during the deep-learning training phase, and then the trained-network model is employed to compute the visual saliency. Note that the convolution neural network (CNN), which is effective for an image analysis, was usually used as the deep-learning algorithm. The CNN-based models have achieved better performance than the handcrafted-saliency models in a variety of challenging cases; however, a sufficient training dataset, a high-quality GPU, and considerable time are required for the learning part, and a failure-cause analysis is very difficult.19
Most saliency approaches12,15,16,17,18,10 were designed to employ contrast value as a main feature. The contrast-based saliency models consist of the following two types: global- and local-contrast-based models. The main mechanism of the global-contrast models computes the object’s saliency through the computation of the color contrast between each of the pixels and the mean value of an entire image. Although the global-contrast models are effective to detect salient regions of simple pattern images, these models have a limitation in a poor global contrast and a complex pattern image. The local-contrast-based models have been proposed to overcome the drawbacks of the global-contrast models. These models compute a salient region by considering the local neighborhoods of the pixels. Although these models are useful to an object’s saliency, they suffer from local noises when computing complex pattern images. Moreover, the window (kernel) sizes for different objects at different scales must be modified to optimize final salient region.1 As mentioned previously, the contrast that reflects a human’s visual attention system has been commonly employed as a standard feature for the most saliency models, but its extreme dependency on the most-highlighted region causes drawbacks when the object dissimilarity is high.19
Recently, the SS-wise saliency models were proposed and have shown excellent performance among the handcrafted-saliency models.15,17,18 The undirected weighted graph was constructed to obtain weight values between super pixels, and a robust boundary measure was employed as the spatial prior. In the SS-wise saliency model, the constructed graph can be regarded as soft clustering information, and it has the similar effect of analyzing hard-segmentation results. A graph model was presented in Ref. 15; the model incorporates local and global contrast, and these clues are combined by exploiting a robust background measure. Unlike the above models, the method of Ref. 16 directly builds hierarchical hard-segmentation maps using hierarchical-clustering techniques with three-heuristic thresholds; the saliency scores are calculated using both the local contrast clues and consistent-inference methods.16 Although various techniques to compute visual saliency have been applied, the soft segmentation-based models are consequentially based on the contrast feature. For this reason, the models suffer from contrast limitation.
In contrast to the SS-wise models, the aim of the hard-segmentation-wise saliency models120.–19 is to detect the salient regions without the contrast based on multilevel hard-segmentation maps; for this model, only the spatial features that represent the pixel variation and the location clues were adopted for the saliency-score computation. For this reason, the models are quite solid in contrast limitation. However, for the hard-segmentation phase, because heuristic or simple parameter selection techniques were adopted, undesirable-outlier segmentation maps are often generated, and it led to poor performance. In addition, the works120.–19 have used a simple and heuristic integration method to generate the final-fused saliency map without robust optimization processing. Although a optimization process, which is called “recursive processing,” was used to optimize the saliency map in the RRFC,19 this model is very time consuming and suffers from local noise due to its recursive process, which tends to reinforce the local noise when an initial saliency map has relatively strong noise saliency.
The proposed salient-object detection model is summarized in Fig. 2 and is fully presented in this section. The proposed model consists of the following four phases: (1) preprocessing, (2) SS-wise saliency, (3) hard-segmentation-wise saliency, and (4) saliency optimization. In the preprocessing phase, an input image is abstracted as a set of super pixels using the simple linear iterative clustering (SLIC) algorithm;27 given a set of super pixels , we mainly employed two types of regional features, which are average color (CIELAB) and centroid coordinates of super pixel patches. In the second and third stages, the SS- and HS-wise saliency clues are computed; in particular, to acquire a reliable-hard-segmented region, SS-wise saliency clues were employed as a priori knowledge and the iterative reweighting process is implemented to weight favorable segmentation maps during the HS-wise saliency model computation. In the last stage, the saliency clues were optimized using the objective function containing a robust background measure.
Background Prior Model
To compute object saliency corresponding to each segmented region or image patch, a robust background-measurement model called BC proposed in Ref. 17 is considered. The definition of BC is more robust compared with those of other boundary prior-based models that are heuristic, simple, and fragile. The definition of the BC method can be written as follows:Figure 3 shows the definition of the BC; the example has four clustering regions, and we can easily identify the foreground and background clusters. By the BC definition, the blue and red clusters have 0.83 and 0.63, respectively, and the white and gray clusters have 2.41 and 2.80, respectively. The model computes the cluster-based connection strength with the image boundary, and it returns higher values to background clues. In summary, the salient regions are much less connected to image borders than the background elements.
In this phase, the SS-wise saliency17 is computed using both the undirected-weighted-graph theory and the BC definition. In the first stage, the undirected weighted graph is constructed by connecting all adjacent super pixels; the “spanning area” of each super pixel p is defined by the following equation:28 based on the above definition, the background weight can be written as
When BC is large, it is close to 1, and its result represents a background probability. The foreground weight, which is called background-weighted contrast, is defined asFig. 4.
Zhu et al.17 have mentioned that the BC is intuitive, but it is difficult to compute directly because an image segmentation itself is a challenging and unsolved problem (i.e., parameter selection). For this reason, the study17 does not use the definition directly but applies it as a weight of the color contrast computation. However, as mentioned previously, the color contrast feature has a limitation when a high interobject dissimilarity exists. To overcome the limitation, the hard-segmentation-wise saliency models120.–19 were proposed, and the process of these models usually consists of three phases: multilevel segmentation-region construction, spatial-saliency computation, and optimization.
In this study, we use the hierarchical-clustering algorithm for the multilevel segmented-region construction; in consideration of time computation, this way is more effective than mean-shift29 usage, for which multilevel kernels30 are considered. To construct reliable segmented regions, we have considered the foreground weight as a sixth regional feature, and it led to improved segmentation quality. In the hierarchical-clustering process, threshold values (number of class) should be defined for hard-segmented region construction, and we empirically set its thresholds at in the experiment. After constructing hard-segmentation maps, we computed corresponding saliency maps using the BC. Unlike the SS-wise models in which an input unit is used for the patches, the robust background model is directly calculated without the color contrast computation, and the input unit in the HS-wise process is the hard-segmented regions ; so, it can be expressed as . To directly apply the background model to our work, the super pixel and the clustering region are now considered for the patch and the observed hard-segmented region , respectively20 in Eq. (1), and the HS-wise initial saliency map can be defined by the following equation:Fig. 5, and the multilevel saliency maps (in Fig. 5, second row) are linearly integrated to acquire . The visual results are shown in Fig. 5, where we can see the well-defined clustering maps regardless of the parameter T changes.
For the optimization process, just like the SS-wise saliency process, the results should be expressed as two maps representing foreground and background-weight maps. In the proposed method, the sigmoid function was employed to obtain the continuous (soft) weights; sigmoid functions to build the foreground and background maps are given by the following equation:120.–19 which require a processing time about 2 to 4 s.
Iterative Reweighting Process
In the HS-wise stream, performance is significantly controlled by the segmentation map’s quality; we, therefore, attempt to reduce influence of outlier segmentation elements, which cause performance degradation during the iterative processing. The pseudocode of the iterative processing is described in Fig. 6, and its process consists of the following phases.
1. Similarity scores between the and each HS-wise saliency map is computed using the 2-D correlation coefficient, and these scores are regarded as weight values for each .
2. We multiply each HS-wise saliency map by the corresponding weight, and then they are fused to update (weighted fusion).
3. Processes (1) to (2) are repeated until there no significant changes remain between the current and previous sources.
The proposed iterative reweighting process encourages statistical consistency, leading to decrease in the influence of outlier segmentation maps during the fusion process. The goal is to weight good segmentation maps and reject irregular sources; thus, the proposed iterative processing is not compute-expensive compared with the existing model,19 in which the mean-shift algorithm is repeatedly executed to improve segmentation results. Figure 7 shows that the visual performance of our saliency maps is enhanced with an increasing number of iterations.
In prior works, the saliency clues computed from the multilevel phases are combined heuristically using weighted summation or multiplication.120.–19 In the proposed method, we have employed a cost-function that is based on the error-minimization technique to optimize the final saliency region. Given the foreground and background weight maps, the objective cost function is defined by the following equation:31,17,1517 High encourages saliency to take the saliency value close to 1, and encourages saliency to move close to 0. The last smoothness term encourages continuous saliency values, and it is effective to remove local noise in both foreground and background regions. For every adjacent super pixel pair , the smoothness term is defined by the following equation:
The results of the optimization process are shown in Fig. 8; as can be seen in the following figure, the overall salient region is enhanced after the optimization process is implemented, and a significant improvement exists when comparing with a simple integration result [Fig. 8(f)].
The experiments were conducted on an Intel(R) Core(TM) i5 4670 with a CPU of 3.40 GHz and 12 GB of memory. The proposed model was evaluated on three benchmarks: MSRA, ECSSD, and MSOD; our performance is compared with those of the state-of-the-art methods, such as CHS,2 RC,12 SO,17 RFC,20 and RRFC,19 respectively. The relevant competitive models were selected based on the citations and their high performance. The MSRA-ASD12 dataset includes 1000 single-object images with a pixel-wise ground truth that is indicated from the MSRA10K dataset; the dataset is the most commonly used for the evaluation of salient-detection performance. The ECSSD16 contains 1000 images with complex patterns in both the foreground and background. The SED232 is a multiple-salient object benchmark, which consists of 100 images with more than two objects with a higher dissimilarity.
Setup and Evaluation Methods
The precision, recall, and -measure (), which are commonly used for a quantitative comparison of different models, were considered for the performance evaluation. For a reliable comparison of the various saliency-detection methods, the salient regions should be evaluated with a variation of the fixed-threshold values from 0 to 255; here, precision represents the percentage of salient pixels that correspond to the ground truth, whereas recall represents the ratio of the salient pixels that belong to the total number of ground truths. As discussed in Refs. 33, 19, and 34, the true negative counts are not considered for either the precision or the recall measure, and this means that these measures cannot be used for an evaluation of the nonsalient regions. For the quantitative comparison, we, therefore, used the -measure curve, for which various thresholds are considered, and the AUC, which is the area under the -measure curve, instead of the precision–recall curve; the -measure is calculated using the following equation:
The recall metric detects the percentage of true positive pixels in the saliency map through the use of the total number of true positives in the ground truth, and the precision metric provides the percentage of detected true positives as compared with the total number of positive pixels in the detected binary-motion mask. Since is set for most of the existing methods,12,120.19.–34,16,17 more weighting of the precision rather than the recall, , was also set for the quantitative-performance comparison involving the state-of-the-art methods. The performances of the saliency models were also evaluated according to the average precision, recall, and , which are commonly used in related areas to evaluate the performance; here, an image-dependent adaptive threshold value35 that is computed as twice the mean saliency was used to perform the saliency-map binarization. For a more comprehensive comparison, we also evaluated the saliency-detection models using the mean absolute error (MAE), whereby a result for the similarity between the continuous-saliency map and the ground truth , both of which had been normalized from 0 to 1, was provided. The MAE score is defined by the following equation:
Quantitative Performance Comparison
Our model was evaluated on the four datasets: MSRA-ASD, MSRA10K, ECSSD, and SED2 and our performance is compared with those of the state-of-the-art methods, such as CHS,16 RC,12 SO,17 RFC,20 and RRFC,19 respectively. The relevant competitive models were selected based on their performance. Note that we selected the parameters of the compared saliency models in accordance with their parameter settings that were already noted in the existing manuscripts.12,20,19,16,17 The quantitative comparisons are presented in Fig. 9, and their row and column reflect the benchmarks (from top to bottom: MSRA-ASD, MSRA10K, ECSSD, and SED2) and the evaluation methods (from left to right: -measure curve, -measure, and MAE), respectively.
Since the saliency map consisted of continuous intensity values, it should be evaluated with a variation of the fixed-threshold values from 0 to 255; we, therefore, used -measure curve for which various thresholds are considered. The proposed method is evaluated on both the MSRA-ASD and MSRA10K datasets. The MSRA-ASD dataset includes 1000 single-object images with a pixel-wise ground truth that is indicated from the MSRA10K dataset. Notably, even though the MSRA-ASD is made up of simple foreground and background images, in recent years, it is the most commonly used dataset for the evaluation of the salient-detection performance. To obtain more extensive experimental results, the MSRA10K dataset, which is composed of 10,000 single-object images with the pixel-level ground truth, is also used in this test. In Fig. 9 (MSRA-ASD and 10 K), in terms of the -measure curve, our model and the RRFC outperform those of other models; in the specific range between 0.3 and 0.8, our model is clearly outstanding. In particular, our model clearly outperforms those of the existing models in (second column, MSRA-ASD and 10K). The RRFC and our model have achieved favorable performance rates regarding the MAE, and this means that the model results in a well-defined background.
To overcome the simplicity of the MSRA, an ECSSD containing 1000 images with complex patterns in both the foreground and background is proposed in Ref. 16; however, although this dataset includes many semantically meaningful images, the images are structurally complex for a performance evaluation. In Fig. 9 (ECSSD-left), our curve is consistently higher than those of the existing models in the specific range between 0.2 and 0.9, and our model is also outstanding in terms of . However, our model results in an ordinary performance in the ECSSD when considering the MAE, and this result shows that the proposed model tends to fail the background region detection in the complex pattern image.
This dataset32 consists of 100 images containing exactly two objects, and the pixel-wise ground truth is also provided. In particular, some of the images have two challenging tasks as follows: first, the properties of the objects are radically different; and second, the objects are located in the image borders. The SED2 evaluations, therefore, allow for an immediate identification of the limitations of the existing approaches. Considering -measure curves, our model was clearly included in the high-performance group, but it is very ambiguous. However, we can see that the proposed model outperforms others in both the MAE and .
In consideration of the performance, the proposed model and the RRFC have achieved outstanding performance compared with those of other models; there may be a debate, however, regarding which model is better. Note that the proposed model is highly competitive when compared with the RRFC; the RRFC is very computation-intensive because of its recursive process. Given a typical image, the RRFC takes 4 s for testing; in addition, the time consumption of the RRFC is significantly irregular because more than average processing time is often required to reach a convergence state according to image states (i.e., some cases take 8 to 15 s for testing). Unlike the RRFC, the proposed model consumes just 0.35 s per image, and its overall processing time is more regular than that of the RRFC. The processing time results regarding the saliency models are shown in Table 1. In particular, the performance of the proposed model is generally outstanding in terms of the F-measure, and this phenomenon means that the model successfully detects the foreground region and the respective spatial locations in the scenes.
Computation time comparisons.
The visual comparisons regarding the four benchmarks (MSRA-ASD, 10K, ECSSD, and SED2) are shown in Fig. 10. The results of the proposed model are relatively accurate compared with those of the existing model; in particular, a great improvement is evident when a comparison is made with its previous models (SO and RFC). In relation to the SED2 benchmark, Fig. 10 shows that our model correctly and uniformly highlights multiple salient objects regardless of both the higher object dissimilarity and the number of objects, whereas the object regions of the existing models are not uniformly highlighted. In terms of the complex-pattern image, the proposed method not only successfully detects the object region, but it also clearly eliminates the background. In summary, the proposed model is generally superior regardless of the benchmark type; in particular, the outstanding rates show that the foreground clues of the proposed model are well highlighted compared with those of the existing models.
Analysis of Proposed Model
In this section, the manner in which the accuracy of the proposed model is affected by both the parameters and the partial functions is further analyzed. In the first stage, the performances are described in Table 2 according to the number of super pixels. As the super pixel number increases, the processing time also increases, and the results show that our model has archived a favorable performance of between 300 and 400 super pixels. Generally, the number of super pixels does not have a major influence on the final results. The results regarding the influence of the gradient value for normalization are described in Table 3, where the gradient is the most proper for the benchmark, and “” represents the result using the harmonic mean binary maps without the sigmoid function usage; as can be seen from the results, the continuous maps by the sigmoid functions are relatively advantageous for obtaining foreground and background weights compared with the harmonic mean value usage. As the gradient value was decreased, we can see that our model tended to output a favorable performance with the lower threshold values and a poor performance with the higher threshold values, whereas the opposite effect was achieved with the larger gradient values. This finding means that the saliency values of the foreground for which the small gradient values are considered were formed at a low-intensity position. The performance comparisons according to hierarchical cluster-number changes are described in Table 4, and these results show that the use of too many clusters causes performance degradation. Through analysis of the precision scores, we can easily assume that this phenomenon is because of over-segmented regions caused by higher threshold values. The results in Table 5 show that the combination result of two streams is advantageous over the independent use of each stream. In particular, the recall score is greatly improved after the combination process, and the HS-wise stream shows better performance than the SS-wise stream. From the experimental results, we can easily confirm the synergy effect by the proposed combination mechanism. With an increasing number of iterations, the performance of our model was exponentially improved (Fig. 11), leading to a convergence of performance regardless of any further increase. A considerable performance enhancement was observed between the first and second iterations in terms of -measure curve and MAE. In addition, a result stopped by the proposed stop condition (adaptive threshold) of recursive processing was clearly included in the convergence domain.
Performance comparison according to super pixel numbers.
|Super pixels||Precision||Recall||F-measure||Runtime (s)|
Performance comparison according to gradient value of sigmoid function.
Performance comparison according to cluster numbers in the HS-wise process.
|= [2 to 4]||0.833||0.792||0.829|
|= [2 to 8]||0.835||0.809||0.833|
|= [2 to 12]||0.827||0.814||0.826|
|= [2 to 16]||0.820||0.819||0.820|
|= [2 to 20]||0.816||0.822||0.816|
Performance comparison according to combination process.
In this paper, we proposed a combination model reflecting both soft- and hard-segmentation techniques. In particular, in the HS-wise stream, the iterative reweighting process was proposed to decrease influence of outlier segmentation maps, and the prior knowledge generated from the SS-wise stream was employed to enhance the segmentation map’s quality; the proposed model provides a favorable result compared with the existing model in terms of both performance and processing time. In addition, in the combination phase, the robust optimization function was used to fuse results from the two streams, and the result shows that the combination of two streams outperforms the independent use of each stream. The experimental results demonstrate that our model achieved superior performance in terms of the efficiency of the MAE and the superior -measure on benchmarks, which reflect simple, complex, and multiple objects. In terms of the limitations of the proposed model, the final result obtained using the iterative processing is heavily reliant on its initial state, and the weighted fusion method is very simple; furthermore, the hierarchical-clustering algorithm occasionally failed to detect optimal clusters when there was an insufficient feature distribution with unclear gradient, leading to poor segmentation results. For a future work, we plan to improve the performance of the proposed model using an adaptive fusion method36,37 and multiple clustering algorithms (ensemble technique); in addition, a theoretical analysis of the proposed model needed to be conducted.
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2017-2015-0-00378) supervised by the IITP (Institute for Information & communications Technology Promotion).
Kanghan Oh received his BS degree in computer science from Honam University, Korea, in 2010 and his PhD in electronic and computer engineering from Chonnam National University, Korea, in 2017. Currently, he is a postdoctoral researcher of Division of Electronics and Computer Engineering at Chonbuk National University, Korea. His research interests are object detection, neuroimaging, and document image processing.
Kwanjong You graduated from Chosun University, Department of Mechanical Engineering, in February 1988. In February 1991, he graduated from Inha University with a master’s degree in energy engineering. In August 2005, he received his PhD in engineering from Mokpo National University Graduate School of Engineering. Currently, he teaches as a professor at Chosun University Future Society Convergence University.