## 1.

## Introduction

Image fusion is a process of combining several source images that are captured by multiple sensors or by a single sensor at different times. Those source images contain more comprehensive and accurate information than a single image. Image fusion is widely used in the field of military, medical imaging, remote sensing imaging, machine vision, and security surveillance.^{1}^{,}^{2}

In recent decades, many fusion algorithms have been proposed. Most of these methods can be classified into two categories: multiscale transform and sparse representation–based approach. The basic idea of multiscale transform–based fusion method is that the salient information of images is closely related to the multiscale decomposition coefficient. These methods usually consist of three steps, including decomposing source image into multiscale coefficients, fusing these coefficients with a certain rule, and reconstructing a fused image with inverse transformation. Multiscale transform-based fusion methods include the gradient pyramid,^{3} Laplacian pyramid,^{4} discrete wavelet transform (DWT),^{5} stationary wavelet transform (SWT),^{6} and nonsubsampled contourlet transform (NSCT).^{7} Image fusion by these methods is a multiscale approach for image representation and has fast implementation.

Image fusion with sparse representation method is based on the idea that image signals can be represented as a linear combination of a “few” atoms from learned dictionary, and the sparse coefficients are treated as the salient features of the source images. The main steps include (1) dictionary learning, (2) sparse representation of the source image, (3) fusion of this sparse representation by the fusion rule, (4) reconstruction of the fused image. Among them, steps (1) and (3) are the most critical factors in successful fusion. The fusion results among overcomplete discrete cosine transform (DCT) dictionary, the hybrid dictionary, and the trained dictionary are compared and studied in Refs. 8 and 9. The fusion results demonstrate that the trained method provides the best performances. Fusion rules of sparse representation–based methods are researched in Refs. 10 and 11. The former one pursues the sparse vector for the fused image by optimizing the Euclidean distances between fused image and source image. The latter one represents source image with the common and innovation sparse coefficients and combines them by the mean absolute values of the innovation coefficients. In Ref. 12, steps (1) and (3) are both studied. During dictionary learning stage, it is implemented by joint sparse coding and singular value decomposition (SVD). And for the new fusion rule, it combines the weighted average with the choose-max rule.

Both of the above fusion methods have their special advantages as well as some disadvantages. The multiscale transform–based methods are multiscale approaches for image representation and have fast implementation. However, the sparsity of coefficients that represent the image could be increased significantly in the low-pass subbands, where approximate zero coefficients are very few, i.e., they are unable to express low-frequency information of images sparsely, while sparse representation can effectively extract the underlying information of source images.^{9} If low-frequency coefficients are integrated directly, it will degrade the performance of the fused result because the low-frequency coefficients contain the main energy of the image.

In contrast, the second method allows for more meaningful representations from source images by learned dictionary, which are more finely fitted to the data,^{13} thus producing better performance. However, due to the limited number of atoms in a dictionary, it is difficult to provide the accurate representation of image details, such as edges and textures. Moreover, complexity constraints the atom size in the learned dictionary (a typical size is of the order of 64)^{14}. This limitation is the reason why patch-based processing is so often practiced when using such dictionaries. To avoid blocking-artifact, the step size usually is 1.^{8}9.10.11.12.^{–}^{13} However, along with the increase of image size, the number of image blocks grows exponentially, and a great deal of calculation is needed.

In this paper, we attempt to merge the advantages of the above two methods. An NSCT and sparse representation–based image fusion method is proposed, namely NSCTSR. We decompose the source images by NSCT to obtain the near sparseness of high-pass subband at multiscale and multidirection to represent image details. For the problem of nonsparseness of low-frequency subband in the NSCT domain, we train the dictionary for the low-pass coefficients of the NSCT to obtain more sparse and salient feature of source images in NSCT domain. Then the low-pass and high-pass subbands are integrated according to different fusion rules, respectively. Moreover, the proposed method can reduce the calculation cost by nonoverlapping blocking.

The rest of the paper is organized as follows: Sec. 2 reviews the theory of the NSCT in brief. Section 3 presents dictionary learning in NSCT domain. In Sec. 4, we propose the fusion scheme, whereas Sec. 5 contains experimental results obtained by using the proposed method and a comparison with the state-of-the-art methods. Section 6 concludes this paper.

## 2.

## Nonsubsampled Contourlet Transform

NSCT is proposed on the grounds of contourlet conception, which discards the sampling step during the image decomposition and reconstruction stages.^{15} Furthermore, NSCT presents the features of shift-invariance, multiresolution, and multidimensionality for image presentation by using a nonsampled filter bank iteratively. When the NSCT is introduced to image fusion, more information for fusion can be obtained and the impacts of misregistration on the fused results can also be reduced effectively. Therefore, the NSCT is more suitable for image fusion.^{16}

The structure of NSCT consists of two parts: nonsubsampled pyramid (NSP) and nonsubsampled directional filter banks (NSDFB).^{17} First, image is decomposed by NSP with different scales to obtain subband coefficients at different scales. And then those coefficients are decomposed by NSDFB and thereby subband coefficients are obtained at different scales and different directions. Figure 1 shows NSCT.

In NSCT, the multiscale property is accomplished by using two-channel nonsubsampled two-dimensional filter banks, which can achieve a subband decomposition similar to Laplacian pyramid. Figure 2 shows the NSP decomposition with $J=3$. Such expansion is conceptually similar to the one-dimensional nonsubsampled wavelet transform, which is applied in the à trous algorithm.^{17} The directional filter bank in NSCT is constructed by combining critically sampled two-channel fan filter banks and resampling operations as ${H}_{0}(Z)$ and ${H}_{1}(Z)$ shown in Fig. 2. A shift-invariant directional expansion is obtained with an NSDFB, which is constructed by eliminating the downsamplers and upsamplers in the DFB.^{18} Figure 3 illustrates the four-channel decomposition. There is a low-pass subband and $\sum _{j=0}^{J-1}{2}^{{l}_{j}}$ high-pass subband when image is decomposed by NSCT decomposition, where ${l}_{j}$ denotes the number of levels in the NSDFB at the $j$’th scale.

## 3.

## Sparse Representation in NSCT Domain

## 3.1.

### Sparse Representation for Image Fusion

Sparse representation is based on the assumption that a signal can be expressed as a sparse combination of atoms from dictionary. Formally, for a signal $y\in {R}^{n\times 1}$, its sparse representation is solved by the following optimization problem:

## (1)

$$\underset{x}{\mathrm{min}}{\Vert x\Vert}_{0}^{0}\phantom{\rule[-0.0ex]{1em}{0.0ex}}\mathrm{s.t}\phantom{\rule[-0.0ex]{1em}{0.0ex}}{\Vert y-\mathbf{D}x\Vert}_{2}\le \u03f5,$$Theoretically, the sparse representation globally expresses an image, but it cannot directly deal with image fusion. On one hand, computational complexity limits the atom size that can be learned;^{19} on the other hand, image fusion depends on the local information of source images. Thus, patch-based processing is adopted to make the sparse representation.^{20} A sliding window is used to divide source image, from left-top to right-bottom, into patches. Then, these patches are transformed into vectors via lexicographic ordering.

## 3.2.

### Dictionary Learning with K-SVD in NSCT Domain

One of the fundamental questions in sparse representation model is the choice of dictionary. The K-SVD algorithm has been widely used to obtain such dictionary via approximating the following problem:^{21}

## (2)

$$\underset{\mathbf{D},\mathbf{X}}{\mathrm{arg}\mathrm{min}}{\Vert \mathbf{Y}-\mathbf{DX}\Vert}_{F}^{2}\phantom{\rule[-0.0ex]{1em}{0.0ex}}\mathrm{s.t.}\phantom{\rule[-0.0ex]{1em}{0.0ex}}{\Vert {x}_{i}\Vert}_{0}^{0}\le T\phantom{\rule[-0.0ex]{1em}{0.0ex}}\forall i,$$Based on the theory above, we should learn a low-pass overcomplete dictionary ${\mathbf{D}}_{l}$ in order to sparsely represent images in NSCT domain. We begin our derivation by the following modification of Eq. (2):

## (3)

$$\underset{\mathbf{D},\mathbf{X}}{\mathrm{arg}\mathrm{min}}{\Vert {\mathbf{C}}_{S}-\mathbf{DX}\Vert}_{F}^{2}\phantom{\rule[-0.0ex]{1em}{0.0ex}}\mathrm{s.t.}\phantom{\rule[-0.0ex]{1em}{0.0ex}}{\Vert {x}_{i}\Vert}_{0}^{0}\le T\phantom{\rule[-0.0ex]{1em}{0.0ex}}\forall i.$$Here, we decompose the training image $\mathbf{I}$ by NSCT. Assuming that ${\mathbf{W}}_{S}$ is the NSCT analysis operator, ${\mathbf{W}}_{S}\mathbf{I}={\mathbf{C}}_{S}$, and ${\mathbf{C}}_{S}$ is the decomposition coefficient of NSCT.

Substituting ${\mathbf{W}}_{S}\mathbf{I}={\mathbf{C}}_{S}$ into Eq. (3), we can equivalently write

## (4)

$$\underset{\mathbf{D},\mathbf{X}}{\mathrm{arg}\mathrm{min}}{\Vert {\mathbf{W}}_{s}\mathbf{I}-\mathbf{DX}\Vert}_{F}^{2}\phantom{\rule[-0.0ex]{1em}{0.0ex}}\mathrm{s.t.}\phantom{\rule[-0.0ex]{1em}{0.0ex}}{\Vert {x}_{i}\Vert}_{0}^{0}\le T\phantom{\rule[-0.0ex]{1em}{0.0ex}}\forall i.$$The above formulation suggests that we can learn our dictionary in the analysis domain. A natural way to view the NSCT analysis domain is not as a single vector of coefficients, but rather as a collection of coefficient images or bands. Consider that the different subband images of NSCT contain information at different scales and orientations. We achieve this by training subdictionaries separately for each band.

## (5)

$$\forall b,\phantom{\rule[-0.0ex]{1em}{0.0ex}}\underset{{\mathbf{D}}_{b},{\mathbf{X}}_{b}}{\mathrm{argmin}}{\Vert {({\mathbf{W}}_{\mathrm{S}}\mathbf{I})}_{b}-{\mathbf{D}}_{b}{\mathbf{X}}_{b}\Vert}_{F}^{2}\phantom{\rule[-0.0ex]{1em}{0.0ex}}\mathrm{s.t.}\phantom{\rule[-0.0ex]{1em}{0.0ex}}{\Vert {x}_{i,b}\Vert}_{0}^{0}\le T\phantom{\rule[-0.0ex]{1em}{0.0ex}}\forall i,$$^{22}Therefore, in this paper, we learn dictionary in low-frequency subband only and the complete learning algorithm is described as follows:

1. Decompose each of the training-set images using NSCT and extract one low-pass and $B-1$ high-pass subbands;

2. Set the dictionary matrices to initial the low-pass dictionary ${\mathbf{D}}_{l}\in {R}^{n\times K}$;

3. Extract maximally overlapping patches of size $\sqrt{n}\times \phantom{\rule{0ex}{0ex}}\sqrt{n}$ from the low-pass band ${\mathbf{L}}_{k}\{k=1,2,\dots ,K\}$ of all training images, and each patch is ordered lexicographically as vector. Then, all the vectors in image ${\mathbf{L}}_{k}$ are constituted into one matrix ${\mathbf{V}}_{k}$ and $\underline{\mathbf{V}}=[{\mathbf{V}}_{1}{\mathbf{V}}_{2}\dots {\mathbf{V}}_{K}]$;

4. The overcomplete dictionary ${\mathbf{D}}_{l}$ is trained by solving the following approximation problem:

The above procedure is shown in Fig. 4.## (6)

$$\underset{{\mathbf{D}}_{l}{,\mathbf{X}}_{b}}{\mathrm{argmin}}{\Vert \underline{\mathbf{V}}-{\mathbf{D}}_{l}{\mathbf{X}}_{l}\Vert}_{F}^{2}\phantom{\rule[-0.0ex]{1em}{0.0ex}}\mathrm{s.t.}\phantom{\rule[-0.0ex]{1em}{0.0ex}}{\Vert {x}_{i,l}\Vert}_{0}^{0}\le T\phantom{\rule[-0.0ex]{1em}{0.0ex}}\forall i.$$

## 4.

## Proposed Image Fusion Scheme

Low-frequency information of images are reflected by the low-frequency subband, which includes the main image energy. If we integrate them directly, the important information is not easy to extract due to the low sparsity of the low-pass subband, whereas high-frequency information of images are sparse approximately. Consequently, we will design different rules for these subbands.

## 4.1.

### Low-Pass Subband Coefficients Fusion

The sparse vector of low-pass subband can be obtained by solving the following problem with ${\mathbf{D}}_{l}$, which was trained in Sec. 3.2:

## (7)

$$\underset{x}{\mathrm{min}}\Vert {x}_{i,l}{\Vert}_{0}^{0}\phantom{\rule[-0.0ex]{1em}{0.0ex}}\mathrm{s.t.}\phantom{\rule[-0.0ex]{1em}{0.0ex}}{\Vert \underline{\mathbf{V}}-{\mathbf{D}}_{l}{\mathbf{X}}_{l}\Vert}_{F}^{2}\le \u03f5,$$Then, the activity level of the $i$’th block in low-pass subband is ${\Vert {x}_{i,l}\Vert}_{1}$, which represents salient features of an image. The purpose of image fusion is to transform all the important information from input source images into fused image, so we use the following fusion rule:

1. By sliding window technique, each low-pass subband coefficient of source image ${\mathbf{L}}_{k}$ is divided into $\sqrt{n}\times \sqrt{n}$ patches with $\mathrm{step}\in [1,\sqrt{n}]$. Then, all the patches are transformed into vectors via lexicographic ordering and ${\{{\mathbf{V}}_{i}^{k}\}}_{i=1}^{[(M-\sqrt{n})/\mathrm{step}+1][(N-\sqrt{n})/\mathrm{step}+1]}$ are obtained.

2. Sparsely represent the vectors at each position, $i$, with different ${\mathbf{V}}_{i}^{k}$, using OMP and obtain $\{{x}_{i,l}^{1},{x}_{i,l}^{2},\dots ,{x}_{i,l}^{K}\}$.

3. Combine the sparse coefficient vectors using the max-activity level rule.

4. Steps 2 and 3 are applied to all the subband blocks. Thus, we can get ensemble of all fused coefficients ${\mathbf{X}}_{l}^{F}={\{{x}_{i,l}^{f}\}}_{i=1}^{[(M-\sqrt{n})/\mathrm{step}+1][(N-\sqrt{n})/\mathrm{step}+1]}$. Then, the vector of low-pass subband of the fused image can be calculated by ${\mathbf{V}}_{l}^{F}={\mathbf{D}}_{l}\times {\mathbf{X}}_{l}^{F}$, where ${\mathbf{V}}_{l}^{F}\in \phantom{\rule{0ex}{0ex}}{R}^{n\times \{[(M-\sqrt{n})/\mathrm{step}+1][(N-\sqrt{n})/\mathrm{step}+1]\}}$.

5. The low-pass subband of fused image ${\mathbf{L}}^{F}$ is reconstructed using ${\mathbf{V}}_{l}^{F}$. Each vector ${v}_{i,l}^{F}$ in ${\mathbf{V}}_{l}^{F}$ is reshaped into a block of size $\sqrt{n}\times \sqrt{n}$; then the block is added to ${\mathbf{L}}^{F}$ at its responding position. Thus, for each pixel position, the pixel value is the sum of several block values, which is divided by the adding times at its position to obtain the final reconstructed result.

## 4.2.

### High-Pass Subband Coefficients Fusion

NSCT not only provides multiscale analysis for images, but also captures minutiae features, such as the edge, linear features, and regional boundaries in high-pass subband of source images. We find out that there are several characteristics in the high-frequency coefficients: first, near sparsity. The detail components of the source image are usually expressed in all directions of same scale with large values, while the values of nondetails of images are practically nil. Second, the larger the absolute value of the subband coefficients is, the more edges and texture information it contains. The coefficients of an image are meaningful to emphasize and detect salient features. Besides, we notice that the strong edges have large coefficients on the same scale in all directions. Considering above factors, high-pass subband coefficients are integrated by the following steps:

The information of source images in the directional subbands with ${2}^{-l}$ scale is defined by

Fuse the high-pass subband coefficients to generate ${\mathbf{H}}_{l,h}^{F}(n,m)$ according to their information of directional subbands. The fused coefficients of ${2}^{-l}$ scale in $(n,m)$ pixel position is obtained as

## (10)

$${\mathbf{H}}_{l,h}^{F}(n,m)={\mathbf{H}}_{l,h}^{{k}^{*}}(n,m),{k}^{*}=\underset{k=1,\dots ,K}{\mathrm{arg}\mathrm{max}}|{\mathbf{H}}_{l}(n,m)|,$$## 4.3.

### Fusion Scheme

The proposed image fusion method is illustrated in Fig. 5, and the whole fusion scheme is as follows:

1. Dictionary learning in NSCT domain in accordance with Sec. 3.2 and low-frequency dictionary ${\mathbf{D}}_{l}$ is obtained.

2. Decompose the source images into one low-pass subband and a series of high-pass subbands, respectively.

3. Fuse low-pass subband by the process described in Sec. 4.1 with trained dictionary in step 1 and obtain the low-pass subband coefficients of fused image ${\mathbf{L}}^{F}$.

4. Select fusion NSCT coefficients for each high-pass subband from source images according to Sec. 4.2, that is ${\mathbf{H}}_{l,h}^{F},(l\in [1,J],h\in [1,{g}_{l}])$.

5. Reconstruct the fused image ${\mathbf{I}}^{F}$ based on the ${\mathbf{L}}^{F}$ and ${\mathbf{H}}_{l,h}^{F},(l\in [1,J],h\in [1,{g}_{l}])$ by taking an inverse NSCT transform.

## 5.

## Experiments

In this section, the proposed fusion algorithm is compared with four multiscale transform–based methods, including DWT, SWT, NSCT,^{7} and LPSSIM [LPSSIM is an image fusion method proposed by Ref. 4, which fuses Laplacian Pyramid coefficients of source images by using structural similarity metric (SSIM). So we abbreviate it as LPSSIM for simplicity], and four sparse representation-based methods, i.e., SR^{8} (tradition sparse representation), simultaneous orthogonal matching pursuit (SOMP),^{9} joint sparse representation (JSR),^{11} and method of optimal directions for joint sparse representation (MODJSR)-based fusion algorithms.^{12} The parameters for different methods and evaluation metrics are first presented. Second, the performance of the NSCTSR-based method is demonstrated in comparison with the eight fusion algorithms. Then, in order to reduce the calculation amount of sparse representation–based methods, the sliding step with sliding window is also discussed. Finally, an experiment on larger image sets is presented to demonstrate the universality of the proposed method.

## 5.1.

### Experimental Setup

In this experiment, for DWT- and SWT-based methods, the most popular setting, the max-abs fusion rule, is selected, and the wavelet basis is “db4” with three levels decomposition. We use “9-7” and “c-d” as the pyramid filter and the directional filter for NSCT,^{7} and the decomposition level is set to $\{{2}^{2},{2}^{2},{2}^{3},{2}^{4}\}$, all these parameters same as the proposed based method. The parameter $\alpha =1$, and LP decomposition is three in LPSSIM-based method. For the four sparse representation–based methods, the training set for the learned dictionary is constructed by 100,000 patches randomly selected from 50 images in Image Fusion Server;^{23} the patch size and dictionary size are set as $8\times 8$ and $64\times 256$, which are widely used in image fusion methods.^{8}9.10.11.^{–}^{12} We set the error tolerance $\epsilon =0.001$ at sparse coding and sparsity $T=10$ at dictionary learning.

We use five evaluation criteria: local importance quality index ${Q}_{0}$,^{24} weighted fusion quality index ${Q}_{W}$,^{25} edge-dependent fusion quality index ${Q}_{E}$,^{25} local similarity quality index ${Q}_{G}$^{4} and ${Q}_{AB/F}$,^{26} which evaluates the fusion algorithm in transferring input gradient information into the fusion result. All of these should be as close to 1 as possible. All the experiments are completed in the environment of a Pentium dual-core CPU 2.79 GHz PC with 2-GB RAM, operating under MATLAB R2012b.

## 5.2.

### Fusion Results

Image fusion experiments were carried out on different images. Figure 6 depicts a pair of medical images; the left image is computed tomography (CT) image, and the right one is magnetic resonance imaging (MRI) image. The CT image shows structures of bone, while the MRI image shows the areas of soft tissue details. Figure 7 shows the fused images by various tested methods, and the local amplification of these results is shown in Fig. 8 for easy observation. Figures 7(a) and 8(a) reveal that the DWT-based method produces more artificial images. From the right image in each set of Fig. 8, we can see that, motivated by the multiscale transform, the SWT-, NSCT-, and LPSSIM-based methods reserve the details more completely than SR-, SOMP-, and JSR-based methods. However, from the left side, it can be seen that SR-, SOMP-, and JSR-based methods have much clearer skeletal features than SWT, NSCT, and LPSSIM fused images, due to the sparse representation, which can extract the salient features of source images. What is more, the NSCTSR fused image exhibits better visual quality with much clearer soft tissues and bone structures than compared methods. Second is the method of optimal directions for joint sparse representation-based image fusion (MODJSR) fused image, which loses only some soft tissue details as can be seen in the left image in Fig. 8(h), while the details are also important for diagnosing. Table 1 reports the objective evaluation of various methods and the best results are indicated in bold. We can see that the NSCTSR-based method achieved the best results in four of the five evaluation metrics, i.e., ${Q}_{0}$, ${Q}_{W}$, ${Q}_{AB/F}$, ${Q}_{G}$. As for ${Q}_{E}$, the MODJSR method performed slightly better than our method.

## Table 1

The objective evaluation of various methods for medical images.

Methods | Q0 | QW | QE | QAB/F | QG |
---|---|---|---|---|---|

DWT | 0.5674 | 0.6933 | 0.4301 | 0.637 | 0.7198 |

SWT | 0.6257 | 0.7943 | 0.5159 | 0.7043 | 0.7557 |

NSCT | 0.6256 | 0.7701 | 0.5368 | 0.6861 | 0.7691 |

LPSSIM | 0.6352 | 0.8104 | 0.5565 | 0.6996 | 0.8141 |

SR | 0.6526 | 0.7829 | 0.5469 | 0.7255 | 0.8004 |

SOMP | 0.6676 | 0.7953 | 0.5486 | 0.7379 | 0.8140 |

JSR | 0.6043 | 0.7784 | 0.5128 | 0.6667 | 0.7662 |

MODJSR | 0.6681 | 0.8110 | 0.5606 | 0.7247 | 0.8117 |

NSCTSR | 0.6896 | 0.8209 | 0.5598 | 0.7298 | 0.8247 |

Note: DWT, discrete wavelet transform; SWT, stationary wavelet transform; NSCT, nonsubsampled contourlet transform.

A pair of multisensor images is considered. The left image in Fig. 9 shows buildings and the right one provides roads and chimney more salient and obviously. Different fusion methods are shown in Fig. 10; the local amplification of these results are in Fig. 11, in which it will be convenient to observe roofs, roads, lanes, chimney, and the contrast of fused images. Careful inspection of Figs. 10(a) and 11(a) shows that the DWT fused image has Gibbs effect in some degree. In Figs. 10(b) to 10(i) and 11(b) to 11(i), it can be seen that the NSCTSR fused images have better contrast than NSCT fused images, are more smooth than SWT and LPSSM fused images, and, furthermore, have more clearer lanes and edges of chimney than SR, SOMP, JSR, and MODJSR fused images. Intuitively, more detailed information and significant features of the source images are transferred into the fused image by NSCTSR-based method than others. To evaluate this visual inspection objectively, the values of five evaluation criteria are listed in Table 2. Obviously, our proposed method is superior to others for all five criteria, which is consistent with the results of subjective evaluation.

## Table 2

The objective evaluation of various methods for “input094” multisensor images.

Methods | Q0 | QW | QE | QAB/F | QG |
---|---|---|---|---|---|

DWT | 0.5742 | 0.7236 | 0.462 | 0.5155 | 0.7696 |

SWT | 0.6370 | 0.7733 | 0.5303 | 0.5654 | 0.7940 |

NSCT | 0.6546 | 0.7853 | 0.5621 | 0.5958 | 0.8189 |

LPSSIM | 0.6647 | 0.7838 | 0.5723 | 0.6067 | 0.7800 |

SR | 0.6568 | 0.7888 | 0.5701 | 0.6007 | 0.8208 |

SOMP | 0.6528 | 0.7923 | 0.5703 | 0.6059 | 0.8245 |

JSR | 0.6432 | 0.7715 | 0.5442 | 0.5883 | 0.7969 |

MODJSR | 0.6700 | 0.7967 | 0.5625 | 0.5962 | 0.8250 |

NSCTSR | 0.6707 | 0.7975 | 0.5739 | 0.6067 | 0.8279 |

Note: The bold values are the best results of individual evaluation criteria.

Analyzing the above results of subjective visual evaluation and objective indicators, we can see that the NSCTSR indicates image details more effectively than the sparse representation–based fusion method. The reason is the NSCT can extract high-frequency details of source images in multiscale and multidirectional ways. At the same time, compared with the multiscale transform–based image fusion, the NSCTSR can also extract the salient features of source images more sparsely and effectively. Consequently, the NSCTSR has better fusion performance.

## 5.3.

### Discussion on the Sliding Step

As already mentioned in Sec. 3.2, the fusion methods based on sparse representation with trained dictionary are all accomplished by sliding window scheme. In order to avoid blocking artifacts, the sliding step is set as 1. If the size of the source image is $256\times 256$ and the block is $8\times 8$ as usual, the patches for each source image is 62,001. Sparse coding for all of these patches is time-consuming.^{9}^{,}^{20} In the same way, when the input image is $512\times 512$, the block number is 255,025. If the step value is increased, the number of blocks can be reduced dramatically, thus increasing the speed. For instance, by tiling the nonoverlapping blocks, the step is 8, the number of patches is 1,024 for image of $256\times 256$ and 4,096 for image of $512\times 512$, and the calculation cost of nonoverlapping is only $\sim 1/60$ of the max-overlapping methods. Therefore, we discuss the sliding step with several sparse representation methods in this section.

The images are fused by DWT-, SWT-, NSCT-, and LPSSIM-based fusion methods and do not need sliding technology, and the results of SR, SOMP, JSR, MODJSR, and the NSCTSR-based method with moving $\mathrm{step}=\phantom{\rule{0ex}{0ex}}1,2,4,8$ are compared. Figures 12(c) to 12(k) show the fused outputs using the eight methods and the proposed method. It can be seen that the NSCTSR method has much better visibility than other methods whether on the overall visual effect of the image or image fine details (the building edge), which is consistent with previous section. Due to limited space, Figs. 12(l) to 12(p) exhibit only the effects of several sparse representation–based fusion methods with nonoverlapping, i.e., sliding step is 8, signed as NSCTSR_S8. From the figures, it is clear that fused results with SR-, SOMP-, JSR-, and MODJSR-based methods have obvious blocking artifacts, while the proposed method performs no blocking effect visually, which is because the fused image is reconstructed by NSCT inverse transform and the low-pass subband block effect has been progressively weakened.

From the objective evaluation of analysis in Table 3, the two top results are indicated in bold. We conclude that single methods based on sparse representation are usually better than the single transform methods based on multiscale, but the former methods perform best with the smallest moving step, which needs large calculation. The quantitative assessments of the proposed method are almost constant with the distinct window, which is more effective than traditional sparse representation–based methods.

## Table 3

The objective evaluation of various methods and some methods with the nonoverlapping block method. Two top results are indicated in bold.

Methods | Q0 | QW | QE | QAB/F | QG |
---|---|---|---|---|---|

DWT | 0.6319 | 0.7300 | 0.4915 | 0.5323 | 0.7431 |

SWT | 0.6797 | 0.7623 | 0.5488 | 0.5838 | 0.7740 |

NSCT | 0.6915 | 0.7958 | 0.5977 | 0.6119 | 0.8152 |

LPSSIM | 0.6691 | 0.7903 | 0.585 | 0.6231 | 0.8032 |

SR | 0.7098 | 0.7961 | 0.6092 | 0.6112 | 0.7933 |

SOMP | 0.7049 | 0.7927 | 0.6047 | 0.6297 | 0.7941 |

JSR | 0.6861 | 0.7683 | 0.5538 | 0.5937 | 0.7629 |

MODJSR | 0.6915 | 0.7746 | 0.5475 | 0.6034 | 0.8073 |

NSCTSR | 0.7121 | 0.8079 | 0.6119 | 0.6373 | 0.8192 |

SR_S8 | 0.6818 | 0.7878 | 0.5817 | 0.5771 | 0.7839 |

SOMP_S8 | 0.6752 | 0.7649 | 0.5541 | 0.5948 | 0.7702 |

JSR_S8 | 0.5429 | 0.5889 | 0.233 | 0.4143 | 0.6003 |

MODJSR_S8 | 0.5526 | 0.609097 | 0.2449 | 0.4089 | 0.6209 |

NSCTSR_S8 | 0.7119 | 0.8078 | 0.6117 | 0.6371 | 0.8091 |

The quantitative assessments of several fusion methods with different sliding steps are shown in Fig. 13. We can see that the quantitative assessments of JSR and MODJSR are most affected by sliding step, which is followed by SOMP and SR; the proposed method is almost unaffected and has the best fusion result in terms of evaluation criteria including ${Q}_{0}$, ${Q}_{W}$, ${\mathrm{Q}}_{E}$, and ${Q}_{AB/F}$. As for ${Q}_{G}$, the NSCT-based method is somewhat better than NSCTSR_8.

Similar observations are noted for the test case in Fig. 14. In this case, NSCTSR and NSCTSR_S8 are again able to provide the most visually pleasing fusion results. In Figs. 14(g) to 14(i), we can see that it is difficult for the single traditional fusion method based on sparse representation to reserve fusion detail features. The multiscale transform image fusion result in Figs. 14(c) to 14(f) has reduced contrast; it is useless without effective salient features. The fused image by NSCTSR can reserve the details and lines completely, and also highlight the significant information [Fig. 14(a) is bright and Fig. 14(b) is dark]. In the nonoverlapping block versions in Figs. 14(l) to 14(p), we also find that the proposed method is less affected by the block step than other sparse representation methods. From Table 4, it can be seen that the proposed method is still best on comprehensive comparison.

## Table 4

The objective evaluation of various methods and some method with the nonoverlapping block method.

Methods | Q0 | QW | QE | QAB/F | QG |
---|---|---|---|---|---|

DWT | 0.4945 | 0.6004 | 0.5271 | 0.5015 | 0.6948 |

SWT | 0.5300 | 0.6473 | 0.5823 | 0.5325 | 0.7450 |

NSCT | 0.5780 | 0.7235 | 0.6344 | 0.5806 | 0.7778 |

LPSSIM | 0.5487 | 0.7106 | 0.6358 | 0.5614 | 0.7838 |

SR | 0.5558 | 0.6554 | 0.6052 | 0.5692 | 0.7596 |

SOMP | 0.5578 | 0.6622 | 0.6188 | 0.5677 | 0.7625 |

JSR | 0.5515 | 0.6070 | 0.5647 | 0.4824 | 0.6498 |

MODJSR | 0.5798 | 0.7056 | 0.5995 | 0.5760 | 0.7774 |

NSCTSR | 0.5870 | 0.7431 | 0.6721 | 0.5961 | 0.7860 |

SR_S8 | 0.5333 | 0.6310 | 0.5518 | 0.5307 | 0.7322 |

SOMP_S8 | 0.5291 | 0.6277 | 0.5793 | 0.5449 | 0.7596 |

JSR_S8 | 0.4101 | 0.3587 | 0.2941 | 0.3350 | 0.5198 |

MODJSR_S8 | 0.4201 | 0.4359 | 0.3844 | 0.3727 | 0.5103 |

NSCTSR_S8 | 0.5792 | 0.7327 | 0.6694 | 0.5897 | 0.7784 |

Note: The bold values are the two best results of individual evaluation criteria.

In addition, the complexity of training dictionary in NSCTSR is almost the same as SR, SOMP, and JSR fusion methods, because they all use classical K-SVD dictionary learning method. Although the dictionary of NSCTSR is trained in NSCT domain, the low-pass subband image (coefficients) in NSCT domain is the same size as the source image and the complexity of NSCT decomposition is much smaller than K-SVD algorithm. The dictionary of MODJSR has lower complexity by joint sparse coding and dictionary update stage. The CPU time of the K-SVD and training dictionary in MODJSR and NSCTSR are 108.61, 74.59, and 124.27 s, respectively. However, the dictionary in sparse representation–based fusion method is usually pretrained by using a lot of samples as the number of source images is limited.^{9} Therefore, the complexity of fusion stage in Fig. 5 is more concerned. From the above experiments, it can be seen that the NSCTSR fusion methods with nonoverlapping step exactly decrease the calculation cost of fusion stage.

## 5.4.

### More Results on 20 Pairs of Images

In order to confirm the effectiveness of the proposed method, an experiment on larger image sets is presented. Twenty pairs of multisensor images 001 to 020 from Image Fusion Server are fused by the eight compared methods and NSCTSR, as shown in Fig. 15. Figure 16 illustrates the fused results by NSCTSR, and the step of former image in each set is 1, and the latter one is 8, which is nonoverlapping block approach. We can see that the two kinds of fused results are nearly the same in visual sensation. The objective evaluation of each pairs is calculated, and the average results are shown in Table 5. From Table 5, we observe that the NSCTSR and NSCTSR_S8 method are more effective and superior than other methods. The statistical values demonstrate the superiority of the proposed method.

## Table 5

Average of the metric over 20 pairs of images.

Methods | Q0 | QW | QE | QAB/F | QG |
---|---|---|---|---|---|

DWT | 0.6975 | 0.7384 | 0.5627 | 0.5603 | 0.7535 |

SWT | 0.7462 | 0.7729 | 0.6131 | 0.6002 | 0.7839 |

NSCT | 0.7632 | 0.7874 | 0.6309 | 0.6284 | 0.7919 |

LPSSIM | 0.7764 | 0.7896 | 0.6249 | 0.6394 | 0.7875 |

SR | 0.7665 | 0.7904 | 0.6391 | 0.6390 | 0.7965 |

SOMP | 0.7587 | 0.7987 | 0.6205 | 0.6452 | 0.7939 |

JSR | 0.7509 | 0.7798 | 0.6204 | 0.6165 | 0.7891 |

MODJSR | 0.7790 | 0.8071 | 0.6382 | 0.6473 | 0.8092 |

NSCTSR | 0.7887 | 0.8105 | 0.6414 | 0.6531 | 0.8168 |

NSCTSR_N8 | 0.7882 | 0.8101 | 0.6411 | 0.6527 | 0.8139 |

Note: The bold values are the two best results of individual evaluation criteria.

Table 6 reports the average computation (CPU) time of the above methods. The average CPU time of NSCT and LPSSIM are longer than that of DWT and SWT. The sparse representation–based image fusion methods are much slower than the multiscale transform–based methods because the sliding window scheme with max-overlapping blocks is time-consuming. However, the NSCTSR with nonoverlapping step only takes 15.91 s, without blocking artifacts, which is much faster than other traditional sparse representation–based methods. Although the proposed method takes more time than the muliscale transform-based methods, it gets better results as described above.

## Table 6

Average of CPU time of various methods.

Methods | DWT | SWT | NSCT | LPSSIM | SR |
---|---|---|---|---|---|

CPU time (s) | 0.53 | 0.79 | 13.79 | 19.62 | 85.96 |

Methods | SOMP | JSR | MODJSR | NSCTSR | NSCTSR_S8 |

CPU time (s) | 73.64 | 94.15 | 98.72 | 87.34 | 15.91 |

## 6.

## Conclusion

In this paper, we have proposed a fusion method (NSCTSR) based on NSCT and sparse representation. The major contributions of this paper are twofold. First, the salient features of the low-pass subband coefficients in NSCT can be effectively separated through trained dictionary with K-SVD. Meanwhile, the property of multiscale analysis is introduced in sparse representation–based fusion method to improve integrated details. Second, the proposed method with nonoverlapping step can largely decrease the calculation costs than traditional sparse representation methods, without blocking artifacts. The experimental results show that the proposed method has better performance than both multiscale transform–based methods and sparse representation–based methods in the visual effects and quantitative fusion evaluation measures. Furthermore, the NSCTSR is easy to be extended to the existing state-of-the-art NSCT-based image fusion algorithms.

## Acknowledgments

The authors would like to thank the anonymous reviewers and editors for their insightful comments and suggestions. This work is supported by the National Nature Science Foundation of China under Grant (61075014,61103062); Aeronautical Science Fund of China(NO.2013ZD53056); The Research Fund for the Doctoral Program of Higher Education(20116102120031); Aerospace Support Fund (2011XW080001C080001); The Doctorate Foundation of Northwestern Polytechnical University (CX201318); NPU Basic Research Foundation (JC201249).

## References

## Biography

**Jun Wang** received her BS degree in telecommunication engineering from North University of China, Taiyuan, China, in 2009 and her MS degrees in circuits and systems from Northwestern Polytechnical University, Xi’an, China, in 2012. She is currently pursuing a PhD degree at the School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China. Her research interests include image processing, sparse representation, and pattern recognition.

**Jinye Peng** received his MS degree in computer science from Northwestern University, Xi’an, China, in 1996 and his PhD degree from Northwestern Polytechnical University, Xi’an, in 2002. He is with the School of Electronics and Information, Northwestern Polytechnical University, as full-time professor, since 2006. His current research interests include image retrieval, face recognition, and machine learning.

**Xiaoyi Feng** received her MS degree in computer science from Northwestern University, Xi’an, China, in 1994 and her PhD degree from Northwestern Polytechnical University, Xi’an, in 2001. She is with the School of Electronics and Information, Northwestern Polytechnical University, as full-time professor, since 2009. Her current research interests include image retrieval, face recognition, and computer vision.

**Guiqing He** received her BS, MS, and PhD degrees in computer science from Northwestern University, Xi’an, China, in 2000, 2005, and 2009, respectively. She is with the School of Electronics and Information, Northwestern Polytechnical University, as associate professor. Her current research interests include data fusion and analyzing and processing of remote sensing image.

**Jun Wu** received his BS degree in information engineering from Xi’an Jiaotong University in 2001 and his MSc and PhD degrees both in computer science and technology from Tsinghua University in 2004 and in 2008, respectively. He is currently an associate professor in the School of Electronics and Information, Northwestern Polytechnical University. From 2008 to 2010, he was a research staff in the Intelligent Systems Lab Amsterdam of the University of Amsterdam, the Netherlands. During 2003 to 2004, he was a visiting student at Microsoft Research Asia. From August to October in 2005, he was a visiting scholar in the Department of Computer Science, University of Hamburg, Germany. His research interests are in machine learning, multimedia analysis, and multimedia information retrieval.

**Kun Yan** received his BS and MS degrees in circuits and systems from Northwestern Polytechnical University, Xi’an, China, in 2008 and 2011, respectively. He is with the Institute of Remote Sensing and Data Transmission, China Academy of Space Technology, Xi’an, as an engineer, since 2011. His research interests include processing of remote sensing data, data transmission, and pattern recognition.