Methods used in microscopic autofocus systems (MAFSs) are generally based on maximizing a focus measure obtained from a captured image; such measure is evaluated as a function of the lens’ axis position. There are plenty of works that report algorithms to control the motion of the lens along the axis in order to efficiently find the best position: based on the evaluation of that focus measure, they obtain the best focused single image.1
In the case of a MAFS applied to cytology observation, when working with targets of high magnification—around —one of the principal issues is that in many cases, cells are in fact located on different levels in the slide, even corresponding to different ranges of the depth-of-field (DoF) of the lens. In these cases, the best focused image selected by a classical autofocus method will include unfocused parts, hence missing important information. A solution is to somehow combine a set of images captured with different DoFs to obtain a fully focused image.
Image fusion is the process of combining relevant visual information (i.e., important, complementary, and redundant) from multiple input images into a single resulting image. This should be achieved without introducing artifacts, in a way in which the resulting image contains more accurate, stable, and complete information23.–4 than the input images, therefore making it more suitable for human perception and for later processing operations (e.g., segmentation and feature extraction). For the autofocus application, this resulting image might be obtained by applying multifocus image fusion (MFIF) techniques: a set of images is captured from a static scene at different focus levels, and the focused objects in this set of images are fused together to create a sharp image with all those relevant objects fully focused.1,34.–5 While the reported MFIF methods are generally applied to fuse two images of a scene, the underlying techniques can be adapted to fuse a larger set of images, as we propose.
The context of our work is a project (see ACK section for details) to help early diagnosis of cervical intraepithelial neoplasia in the rural areas of the Coahuila State (Mexico). In these areas, for cultural reasons, the refusal among women to go to the capital persists until the symptoms of the disease are unbearable. The objective of the project is to enable the Papanicolau test, facilitating through its automation the taking of tissue samples in the rural area and the telematic sending of selected images of these samples for diagnosis by specialists. This requires capturing hundredths of focused images per tissue sample (see our autofocus contributions in Ref. 6) and analyzing of these images to identify and segment cervical nuclei (see our contributions on this area in Ref. 7). In this paper, we target the enhancement of the captured images via MFIF techniques.
MFIF techniques operate on a set of input images of a single scene, each with a different DoF. Overall, these images are first partitioned into generally homologous regions. Regions sharing a same location in the set of images are then evaluated, and the region with the highest focus measure is selected; finally, selected regions, usually from different images, are fused to compose the final focused image. There are many MFIF algorithms reported in the literature; a comprehensive review can be found in Refs. 8 and 9. The MFIF technique is broadly used in many application fields, such as microscopy,10 biology,11 and medical imaging.12
In this paper, we analyze and compare with works using focus measures like those we used in our base autofocus system (i.e., transformed-domain measures6), for a fair comparison. In this direction, the work in Ref. 1 applies an block discrete cosine transform (DCT) to two input images; then, it compares homologous blocks of both images using the variance of the coefficients and selects the block with the highest value; finally, it applies the inverse DCT to the image composed by the selected blocks. They also propose a variant which applies a consistency verification index for block selection, in order to enhance the resulting image quality. The work in Ref. 4 is similar to that in Ref. 1 but using a different measure to compare blocks. Methods based on blocks generally present artifacts, because parts of the focused cells in different levels of the DoF might belong to a same block. Kumar5 uses the discrete cosine harmonic wavelet transform (DCHWT), a multiscale technique (three levels, in this case). As these multiscale methods involve decimation, most pixels in the resulting image do not keep original pixel values of any of the source images. Recently, some fusion techniques operating in the gradient domain have been reported, as that in Ref. 3. This work, also using a multiscale approach, uses a focus measure based on the saliency structure of the image, and it is designed to operate on well-known images (i.e., flower, clock, pepsi, etc.) with just two objects (one focused and the other unfocused).
Some recent works present similar fusion techniques applied to combine different sources of information into a single image but not necessarily due to a multifocus situation. Liu et al.13 described a MFIF method that separates source images into “cartoon content” and “texture content” via an improved iterative reweighted decomposition algorithm; fusion rules are designed to separately fuse both types of content, and finally, the fused cartoon and texture components are combined. The technique naturally approximates the morphological structure of the scene. The work in Ref. 14 presents a medical application to diagnose vascular diseases; they use a type of wavelet transform combined with an averaging-based fusion model to fuse osseous and vascular information together; they present a rapid MFIF algorithm, less complex but still very effective, with very low memory requirements. For a similar objective, Dogra et al.15 propose an effective image fusion method also working on the wavelet domain, along with a preprocessing of the source images with a selected sequence of spatial and transformed-domain techniques to create a highly informative fused image for osseous-vascular 2-D data.
In this paper, we propose an MFIF method that analyzes sequences of up to 15 microscopy input images corresponding to different levels of DoF of a same “slide-scene.” We propose (Sec. 2) an object-based approach, which dramatically reduces the visibility of fusion-generated artifacts while keeping focused parts of the input images intact. To evaluate our results, we compare with five different existing techniques (Sec. 3) by testing over 50 realistic and practical Pap smear sequences of images, and over the two blurred microscopic images provided by Ref. 11. Finally, conclusions are presented in Sec. 4.
Proposed Multifocus Microscope-Image Fusion Method
Figure 1 illustrates the general flow of the proposed MFIF method, which is further detailed in the following subsections. The starting point is a set of images (15 in our experiments, but 2 in other works we compare to) captured with the lens in a varying position of the axis. The first step, which is not the topic of this paper, is the selection of the “best-focused image” of the set (see details in Ref. 6). Let be this set of input images (Fig. 1) and let be the best-focused image, being its index in the set. This image is first coarsely segmented to identify its main regions or objects, which are considered the main scene objects, each represented by a binary mask. Then, for each scene object or segmented region, its mask is applied to the set of input images, and a focus measure is obtained for that region in every image of the set. A preliminary image, which we name “combined image, ,” is then generated by replacing in the best-focused image each segmented region with the corresponding best-focused region in the set. Finally, the removal of the artifacts generated in the contours of these combined regions is performed by a total variational-based filter16 to obtain the “final focused image, .”
We use the mean-shift algorithm17 to obtain a coarse segmentation, , of the best-focused image, (Fig. 1) into regions or clusters. Mean-shift is a nonparametric technique for analyzing multimodal data that has multiple applications in pattern analysis,18 including its use for image segmentation. We start from the observation that cells have a predetermined size and colors that are always much darker than the background. We characterize each image pixel by a vector or , depending on whether the input images are RGB or gray: describes the pixel color, and its coordinates. We then run the mean-shift algorithm over this five-dimensional or three-dimensional distribution with a bandwidth value , which was selected so that cell regions and background are segmented in more than one cluster; this is required for the next assigning process to be effective. A proper selection of the parameter is somehow application dependent: it should be larger than the smallest nonfocused region. However, if this requirement is met, its effect on the results is negligible. Consider that block-based approaches also prefer unfocused regions to be greater than the block size, but there is usually no flexibility in the selection of this size.
Preliminary Image Fusion Based on the DCT Focus Measure
The next step is to generate a “combined image” (follow this process in Fig. 2), , which is a merging of the best focused parts of the set of input images. First, a focus measure is locally obtained for every image of the set, following the method described in Ref. 6: in brief, for every image of the input set, , a block DCT is performed, the sum of the absolute value of its 32 lower-frequency AC coefficients is calculated, and a same-size energy image, , is obtained by assigning each pixel the calculated energy of its corresponding block. This results in a set of DCT energy images, [Fig. 2(a)].
The topology of the image is the same of that of the segmented image, . Every cluster or region in is used to generate a mask, [Fig. 2(b)]. For every region, its corresponding mask is applied over every energy image, , and the mean energy of the masked region is calculated for every such energy image. The index, , of the energy image showing maximum energy for that region is obtained; then, the corresponding region of the image is initialized with the pixel values of the homologous region of the image from the set. In parallel, in the image (see Fig. 2), we keep for every region the absolute difference between the index and the index of the best-focused image, which somehow indicates the degree of out-of-focus of such region, or the object focus level, ranging from 0 (black-level: the region is best-focused in the image) to (white-level: the region is best-focused in the image with the worst global focus measure).
As opposed to other methods, such as Refs. 1 and 4, where the local focus comparison among the set of input images is performed block by block, we propose to compare region by region using the segmentation of the best-focused image to define such regions. This avoids highly visible block artifacts appear anywhere. Instead, contour artifacts might appear in the boundaries of the identified regions, being here much less visible. The visibility of these contour artifacts depends on the aforementioned degree of out-of-focus of each combined region. In the next subsection, we propose to use a total variational-based filter to eliminate the contour artifacts of the combined image, .
The next step is to generate the final focused image, , by attenuating the artifacts or false contours that may appear in the combined image, , due to merging regions from different input images. We propose to attenuate artifacts by applying a total variational-based diffusion method.16 This method will only be applied in the artifacts-prone areas, according to the information in the image, hence preserving or keeping intact most of the image pixels. Observe that the method aims to mitigate these false contours, not real object contours.
Generation of a mask of the artifact-prone areas
In Fig. 3, we show several examples of artifacts generated at the boundaries of the regions of three images. Our proposal is to process pixels only at the edges defined by , i.e., only at the boundaries between regions with different degree of focus, in order to obtain an image without artifacts on these boundaries while keeping original pixels in most of the resulting image. For this purpose, we first obtain an edges image from (see Fig. 4). Then, as the extent of the artifacts between adjacent regions is expected to be proportional to the difference between their degrees of focus, we perform an adaptive morphological dilation over the thresholded edges image, using a structural element with a size proportional to the intensity of every edge. The resulting mask, (see Fig. 4), will define where the following enhancement steps will be applied.
A main contribution of our method is that artifacts removal is only performed in the areas that may include them, hence preserving original pixels in most of the image, which is critical for medical imaging applications. Works in Refs. 1 and 4, as they perform image fusion over DCT blocks, are prone to generate block-artifacts, which are not later eliminated. In the multiscale methods, such as Refs. 5 and 3, the original pixels are not usually preserved in the fused image: the resulting image in Ref. 5 does not present artifacts due to the nature of the method, which modifies pixels intensity via averaging, resulting in a smoother image; the method proposed in Ref. 3 eliminates artifacts just in the “unknown zone,” which is a predefined area in the boundary generated between the two considered source images with two different focus levels.
Artifacts removal via total-variation filtering
Let us consider that the combined image , which contains contour artifacts in the contours defined by the mask, is a noisy image; let , the final focused image, be the desired sharp and clean image. We can then declare that , where is the aditive noise, which we assume concentrated in the pixels indicated by . We obtain from using a total variation filter. These filters were first suggested by Ref. 16 and are based on the minimization of an energy functional, subject to the constraint , where , , and , for this work, are, respectively, the gradient of the true image (), the gradient of observed image (), and the variance of the noise . Then, the iterative equation to obtain the desired clean image is , where , and is a regularization parameter which we set to to preserve the smallest structures.
In order to obtain for the first iteration, we apply a Laplacian filter to (Fig. 5). To estimate , we consider that its gradient equals that of except for the edge areas defined by . In these areas, for every pixel, we assume that equals the gradient, , of the source image showing maximum local variance around such pixel. The gradients of the source images are also obtained applying a Laplacian filter, and the local variance is computed in windows. Once we get for this first iteration, i.e., , we set and repeat the process until it converges to . Figure 5 shows an example of the evolution of the variance of the gradient difference, , and of the obtained image, , for every iteration.
To assess the potential of our approach, we compare the proposed method with the works reported in Refs. 1-1, 1-2, 5, 3 and 4, as we can see in Figs. 6–8 captions, and Tables 1 and 2. The code to run these reported algorithms was kindly provided by every author: the software is available together with the papers. The set of images used for the experimental evaluation, hereinafter the dataset, consist of the microscopic image pair from the MFIF reported in Ref. 11 (see Fig. 6) and a set of 50 Pap smear image sequences ( in RGB), each containing 15 images with different DoF and focused cells in several of them (see examples in Figs. 7 and 8).
The objective evaluation of a fused image is a difficult task because there is no universally accepted metric to evaluate an image fusion process.2 A frequent solution is the use of different metrics to test the fusion results from different viewpoints.19 Quality metrics for MFIF can be classified depending on the availability of the target image:20 metrics known as full-reference assume that a complete reference image (distortion-free) is available; however, in many practical applications, the reference image is not available; so, “no-reference” or “blind” quality metrics are used. As our dataset includes images captured in practical situations, we do not account for reference images. An alternative to these MFIF-based quality metrics is to evaluate focus metrics on the resulting fused image, as the aim in this scenario is to obtain a perfectly focused image. We describe below the metrics we have used.
No-reference metrics—The Petrovic metrics,21,22 based on gradient information, include three indicators: , which represents in a normalized way the total information transferred from the source images to the fused image (); and and , which evaluate the complement to , i.e., the loss of information, but just considering locations, where the gradient of the source images is greater () or lower () than that of the fused image. We have computed the indicator as an overall quantitative measure of the fusion quality. For images, is obtained according to2324.25.–26
Experiments and Discussion
The first experiment is conducted over two microscopic images kindly provided by Ref. 11, each showing different focused parts of the same object [see Figs. 6(I) and 6(II)]. We have applied to these source images the aforementioned five fusion algorithms and our proposed method. Figure 6 shows the resulting images along with a detail of each, in order to visually or qualitatively assess the performance of each method. Table 1 includes data with the quantitative evaluation of this first experiment.
Quantitative results for the first experiment: performance quality metrics for the final fused microscope image obtained by each method.
|1-1 Fig. 6 (a) Haghighat, 2011(1)||1-2 Fig. 6 (b) Haghighat, 2011(2)||5Fig. 6 (c) Kumar, 2013||3Fig. 6 (d) Zhou, 2014||4Fig. 6 (e) Phamila, 2014||I_ffFig. 6 (f)|
From a qualitative point of view, we observe that methods Refs. 1-1 and 4 [Figs. 6(a) and 6(e)] present highly visible block artifacts; these methods compare the DCT energy in homologous blocks, which generates comparison errors when the images contain nonsquare elements in different depths of field or when cervical cells are round. The enhancement proposed by Ref. 1-2 [Fig. 6(b)], based on a consistency verification index to decide which block is selected, removes block artifacts in this example but at the expense of a poor visual result. The multiscale approach proposed in Ref. 5 [Fig. 6(c)], which does not keep original pixel values, presents a noticeable contrast reduction. The method reported in Ref. 3 and the proposed method [Figs. 6(d) and 6(f)] yield acceptable visual results.
From a quantitative point of view (see Table 1), an interesting observation is to contrast the correlation between each measure of quality and the perceived visual result: the SD measure yields very good values for images with highly noticeable block artifacts [in case of Figs. 6(a) and 6(e)], because these artifacts increase image variance; the measure seems to be more in line with the qualitative findings.
Independently of these observations, Table 1 indicates that the proposed method behaves better in the light of both quality measures.
The second experiment targets the 50 sequences of Pap smear images obtained from the autofocus operation of a microscope. While reported works have focused on fusing two blurred images, many of them have applied their method in an iterative way to more than two input images, which is our practical context. Figures 7 and 8 show qualitative results for two of these sequences, and Table 2 and Fig. 9 compile the quantitative evaluation for the 50 sequences.
Quantitative results for the second experiment: performance quality metrics (mean and deviation) for the final fused microscope images obtained by each method applied to the 50 image sequences.
|1-1 Haghighat, 2011(1)||1-2 Haghighat, 2011(2)||5 Kumar, 2013||3 Zhou, 2014||4 Phamila, 2014||I_ff|
From a qualitative point of view, we clearly observe in Fig. 7 that the methods based on DCT blocks [Figs. 7(a), 7(b) and 7(e), 8(a), and 8(e)] cannot avoid generating block artifacts. We can also observe that the multiscale approaches [Figs. 7(c), 7(d), 8(c), and 8(d)], which somehow process original pixels so that their value is never directly transferred to the final image, suffer from a severe loss of definition when the technique is applied to a large number of source image (15 images, instead of 2, for this experiment): several of the objects of interest are averaged, resulting in a loss of information and even the loss of complete cells. This is the situation for the method proposed in Ref. 3 [Figs. 7(d) and 8(d)], which, while losing very few information and objects of interest, sometimes loses full objects because it only compares two areas or regions in the image (focused and unfocused).
From a quantitative point of view, Table 2 indicates that the proposed method also behaves better for this part of the dataset including 50 image sequences. Apart from the mean values of the quality indicators, Table 2 includes their standard deviation, which proves that the results obtained by the proposed method are also the most stable. Finally, Fig. 9 intends to further illustrate the stability of the tested methods that obtained better global results. We observe that the proposed method systematically outperforms other approaches in the light of these quality indicators.
This paper presents an object-oriented approach to the problem of obtaining a single focused image from a set of microscopic images captured from a single slide including objects that happen to be focused each in a different image of the set. The proposed MFIF method shows several specific advantages respect to other state-of-the-art methods: first, it is driven by a region-based segmentation, which prevents for the highly visible artifacts that may appear in block-based methods; second, it does not apply any kind of image transform, hence respecting the pixel-values of all focused regions, which is crucial for medical imaging applications; and finally, it includes a artifacts-removal technique, which only operates were required and adapts to the expected extent of the fusion-generated artifacts. Results, obtained over a representative dataset and compared to other published approaches, prove the validity of our proposal.
Pap Smear Images Sequences Dataset for Multifocus Image Fusion
Extra Material for Download
The extra materials are available for download (Ref. 27) and contain the following: the entire 50 cervical cells images sequences dataset; the region-based MFIF method proposed for comparison with the other MFIF methods (as a MATLAB interface); and the entire fusion results.
The authors declare that there is no conflict of interests regarding the publication of this paper.
We want to thank Dr. Maria de la Paz Hernandez for the provided Pap-smear samples. Also special thanks to Dr. Fomuy Woo, Dr. Maura Huerta, Dr. Victor Campos, and specialist Cytotechnologist Laura Meraz for their assistance in the ground truth generation (Hospital ISSSTE, México) and thanks to Ing. Edgar Valdez for their work in the Pap smear image acquisition.
Santiago Tello-Mijares Received his BS degree in electronic engineering in 2006 and his PhD degree in electrical engineering science in 2013, from Instituto Tecnológico de la Laguna, Torreón, México; and in 2017, the PhD degree in telecommunications and informatics engineering at Universidad Autonóma de Madrid, Madrid, Spain. He is actually titular professor at Postgraduate Department in Instituto Tecnológico Superior de Lerdo, Lerdo, Mexico. His research interests are biomedical image, artificial intelligence, and robotics.
Jesús Bescós received his BS degree in telecommunications engineering in 1993 and the PhD degree in the same field in 2001 from Universidad Politécnica de Madrid, Spain. He is a professor (since 2003) at the Universidad Autonóma de Madrid, where he codirects the Video Processing and Understanding Lab. His research interests include the analysis of video sequences, video indexing based on content, 2-D and 3-D machine vision.