Learning synthetic aperture radar image despeckling without clean data

Abstract. Speckle noise can reduce the image quality of synthetic aperture radar (SAR) and make interpretation more difficult. Existing SAR image despeckling convolutional neural networks require quantities of noisy–clean image pairs. However, obtaining clean SAR images is very difficult. Because continuous convolution and pooling operations result in losing many informational details while extracting the deep features of the SAR image, the quality of recovered clean images becomes worse. Therefore, we propose a despeckling network called multiscale dilated residual U-Net (MDRU-Net). The MDRU-Net can be trained directly using noisy–noisy image pairs without clean data. To protect more SAR image details, we design five multiscale dilated convolution modules that extract and fuse multiscale features. Considering that the deep and shallow features are very distinct in fusion, we design different dilation residual skip connections, which make features at the same level have the same convolution operations. Afterward, we present an effective L_hybrid loss function that can effectively improve the network stability and suppress artifacts in the predicted clean SAR image. Compared with the state-of-the-art despeckling algorithms, the proposed MDRU-Net achieves a significant improvement in several key metrics.


Introduction
Synthetic aperture radar (SAR) 1 is an active Earth observation system deployed on aircraft, satellites, or other flight platforms. Compared with the optical and infrared systems, SAR can provide all-time, all-weather, high-resolution, and wide-swath observation. It also has a certain ability to penetrate the Earth's surface and discover underground targets. Therefore, the SAR has advantages in disaster monitoring, 2 environmental monitoring, 3 ocean surveillance, 4 resource exploration, 5 surveying, and military applications.
However, due to the imaging mechanism of SAR, a large amount of speckle noise exists in the observed SAR images. 6 Speckle noise 7 is a kind of random multiplicative noise, which is formed by the mutual interference of radar echo phase. The speckle noise in SAR images appears as granular noise or black-and-white noise. The speckle noise in single-look SAR images follows a Gaussian distribution with zero mean, 8 while the speckle noise in multilook SAR images follows a gamma distribution with unit mean and variance 1∕ ffiffiffi ffi L p , 9 and in the 1∕ ffiffiffi ffi L p , the L is the number of looks. The existence of speckle noise reduces the resolution of SAR image and masks the detailed structure of targets. Because the image details are masked, the accuracy of SAR image classification, 10 segmentation, 11 and change detection 12 is reduced. In addition, speckle noise will also bring great difficulty to the phase unwrapping in SAR interferometry, which will affect the accuracy of interferometry. 13 Therefore, removing speckle noise is a significant research in SAR image field. *Address all correspondence to Gang Zhang, E-mail: gangzhang1989@126.com To remove the speckle noise from SAR images, many approaches have been proposed. The spatial filters [14][15][16][17] are first applied to remove the speckle noise in SAR images, but the edges of the filtered image are smooth. To solve this problem, the spatial filters are improved in two aspects. One is to use different filters for different scenes. 18,19 The other is to design adaptive sliding window filters. 20,21 The transform domain filters (TDFs) mainly include wavelet domain filters 22,23 and post-wavelet domain filters. [24][25][26] Although the despeckling performance of the TDFs is significantly higher than spatial filters, the complexity of TDF is very high. The filters based on the Markov random field model 27 can remove the speckle noise in the spatial and transform domains, but they require a lot of prior knowledge of SAR images and speckle noise. Owing to simple ideas and superior performance of nonlocal mean (NLM) filters, [28][29][30][31] NLM filters have been widely used to reduce speckle noise. But the filtered images will contain artificial textures because of the block effect.
With the development of convolutional neural networks (CNNs), some researchers [32][33][34][35] have tried to use CNN to complete image despeckling tasks. However, these CNN methods still have some problems. First, CNN-based despeckling methods require a large number of the noisyclean image pairs, where clean images are used as labels. But the clean SAR images are difficult to obtain. To construct the noisy-clean image pairs, they [32][33][34][35] usually add simulated speckle noise to the optical images. The predicted clean SAR images contain optical interference. In fact, they do not use the real SAR images, and all training data are generated by optical images. This approach cannot be applied to actual work. Second, to preserve more image details, they 33,34 cropped a large SAR image into many small patches. But the cropping operation will destroy the structure, texture, and other information of the image. Therefore, to address these problems, inspired by noisy-to-noisy paradigm, 36 we propose a network called multiscale dilated residual U-Net (MDRU-Net). The MDRU-Net is an improved version of U-Net. 37 The MDRU-Net can be trained directly by using noisy-noisy image pairs. Unlike the previously mentioned despeckling CNN methods, [32][33][34][35] which use small patches (i.e., 40 × 40) as input, the input size of the MDRU-Net is 256 × 256.
Our main contributions are listed as follows: • To solve the problem of lack of clean SAR images, we put forward MDRU-Net. The network does not require clean SAR images during training, and its input is noisy-noisy SAR image pairs. • We design a multiscale dilated convolution (MDC) module that uses multiple dilated convolutions to extract and fuse multiscale features for protecting more SAR image details. • To reduce the difference between the shallow and deep features in fusion, we plan five different dilation residual skip (DRS) connections to narrow the distinctness. • We propose an effective loss function called L_ hybrid loss function to suppress artifacts and improve the stability of the network. •  With the gradual maturity of CNN, intelligent applications of SAR are made possible. However, speckle noise is a major obstacle affecting the intelligent interpretation of SAR images. How to use CNN to effectively and quickly remove speckle noise becomes the primary task of intelligent interpretation. Chierchia et al. 32 first proposed a despeckling CNN for SAR images (SAR-CNN). The SAR-CNN was inspired by the denoising CNNs, 38 which worked very well in reducing additive white Gaussian noise. However, the SAR-CNN adopted a coupled logarithm and exponential transforms in the process of removing speckle noise. So it is not an end-to-end learning network. To solve this problem, Wang et al. 33 designed an image despeckling CNN (ID-CNN), which consisted of eight convolutions and a division residual layer. Zhang et al. 34 presented an SAR image despeckling network with dilated residual structure (SAR-DRN). They adjusted the dilation rate of the dilated convolution to increase the network receptive field and capture more image details. Francesco et al. 35 utilized U-Net to remove the speckle noise of SAR images and they demonstrated the performance of the skip connection.

Dilated Convolution
In the image semantic segmentation task, to aggregate multiscale context information without losing image resolution, Yu and Koltun 39 developed a convolutional network module called dilated convolution. The dilated convolution can increase network receptive field without increasing the weight. Figure 1 illustrates the operation of the dilated convolution on the feature map. The size of the feature map is 9 × 9 and the red dots represent the original weight of the kernel. The yellow blank blocks represent the expanded weight with the value of 0. Liu et al. 40 planned a multibranch residual module with dilated convolutions to extract multiscale features so that the classification and identification of spacecraft electronic load signals can be solved. Yang et al. 41 designed an end-to-end dilated inception network (DINet) to predict visual saliency maps. The dilated inception module of the DINet used dilated convolutions with different dilation rates in parallel, which not only can significantly reduce the computational load but also can enrich the diversity of the receptive field in the features. Zhang et al. 42 presented a multiscale single-image super-resolution network with dilated convolutions. This network effectively increased the receptive field of the network by adjusting the dilation rate. In the SAR image despeckling task, the SAR-DRN 34 only utilized seven dilated convolutions and its despeckling performance exceeds the SAR-CNN 32 with 17 traditional convolutions.

Skip Connection
In CNN, continuous convolution and pooling operations are used to extract deep features. As a result, much detail information of the image is lost. To solve this problem, many methods [32][33][34][35] obtain small patches through the cropping operation, and these small patches are used as training data. However, the cropping operation can destroy the structure, texture information of the image. The raised skip connection 43 can enforce the networks to reconsider low-level features, which are going to fade away when the low-level features feed forward. Qi et al. 44 proposed a convolutional encoder-decoder network with skip connections to improve the predictive performance of the saliency maps. They used a skip connection between the encoder and the decoder to transfer the hierarchical features. Tong et al. 45 designed a dense skip connection in a very deep network. The dense skip connection not only alleviated the problem of gradient disappearance but also accelerated the efficiency of super-resolution image reconstruction. Ronneberger et al. 37 presented a U-Net to obtain very accurate segmentation results. Francesco et al. 35 conducted a skip connection ablation experiment on the SAR images. They found that the greater the degree Fig. 1 The operation of the dilated convolution on the feature map: (a) the dilation rate is 1 and the kernel size is 3 × 3; (b) the dilation rate is 2 and the kernel size is enlarged to 5 × 5; and (c) the dilation rate is 3 and the kernel size is broadened to 7 × 7.
of compression of the input image, the more important the skip connections are, and the more obvious the despeckling effect of the SAR image is.

Loss Function
The selection of the loss function affects the convergence speed and the optimization degree. Zhao et al. 46 demonstrated the effect of the loss function in detail. In the SAR image despeckling task, most research studies 32,34,35 use the mean squared error (MSE) as the loss function. The ID-CNN 33 used a mixture loss function, which includes the MSE and the total variation (TV) loss function. 47 The MSE loss function is a differentiable convex function that enables the network to achieve the global optimality. But it has the following drawbacks. First, MSE can penalize the noise outliers too much, which will easily cause the CNN exploding gradient problem. Second, if MSE is used as the loss function, the predicted clean image will have artifacts. Assume that the noisy-clean image pairs are fx i ; y i g N , i ¼ 0;1; 2; · · · ; N − 1, where N expresses the total number of training image pairs. The x i and y i are the noisy image and the clean image, respectively. The size of x i and y i is W × H. Here x i can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 5 3 1 where n i is the speckle noise. The predicted image of the despeckling CNN can be expressed as where F ðx i ; ΦÞ is the despeckling CNN and the Φ is the weight of the despeckling CNN. Therefore, the MSE loss function of the noisy-clean method can be written as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 4 3 3 wherex i w;h and y i w;h represent the pixel value in the ðw; hÞ position, respectively. Thex i and y i are i'th predicted image and clean image, respectively. The mean absolute error (MAE) loss function is a nonconvex function and its optimization process is a suboptimization. Compared with the MSE, it is less penalizing the noise outliers. The MAE loss function of the noisy-clean method can be given as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 3 2 1 It can be seen from Eq. (4) that the derivative of MAE at 0 is not unique, which can affect the stability of the network.
The TV loss function 47 is a regular term loss function that reduces the difference between adjacent pixels to ensure the smoothness of the image. It can be formulated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 2 2 2 where p and q are computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 1 6 0 3 Proposed Method

Noisy-to-Noisy Training
Lehtinen et al. 36 had demonstrated that denoising networks can be learned by mapping a noisy image to another noisy image. The performance of a denoised network trained with noisy-noisy image pairs is similar to that of noisy-clean image pairs. This study is significant for the speckle noise suppression in SAR images.
The previous despeckling CNN methods use MSE and MAE. Equations (3) and (4) represent MSE and MAE, respectively. The noisy-to-noisy training requires the following loss functions: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 6 2 1 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 5 5 8 where n i1 and n i2 are two independent noise samples. Whether it is noisy-clean training or noisy-noisy training, their optimization process is to minimize the loss function.
When the proposed MDRU-Net is trained, the input of the network is a pair of noisy-noisy SAR images. The first noisy image is a real SAR image, and the second noisy image is a corrupted SAR image. The corrupted SAR image is simulated by adding 4-look speckle noise to the real SAR image. When MDRU-Net is tested, only one real SAR image is needed (see Sec. 4.1, for detailed usage of our model on different datasets).  Table 1, where w, h, and c represent the width, height, and channel of the feature map, respectively.

Multiscale Dilated Residual U-Net Architecture
The encoder of the MDRU-Net consists of two convolutions, five MDC modules, and five max-pooling layers. The two convolutions, Conv1_1 and Conv1_2, are 3 × 3 × 64 convolutions. The five max-pooling layers, p1-p5, are to sample the SAR image and obtain hierarchical features step by step. The output of p5 is the deep semantic features of the SAR image. The kernel size of each max-pooling layer is 2 × 2 and the stride is 2. The five MDC modules, M1-M5, are the proposed MDC module for extracting and fusing multiscale semantic features. The output of the encoder is the 8 × 8 deep semantic feature maps.
The decoder of the MDRU-Net is composed of five upsampling layers, five concat layers, four MDC modules, and three convolutions. The upsampling layers, U1-U5, are bilinear interpolation. The upsampling layer is used to extend the feature map. The scaling factor of each upsampling layer is 2. The five concat layers, C1-C5, are used to fuse the shallow and deep semantic features in the channel dimensions. The shallow features come from the encoder and are passed to the decoder by the DRS connections. The deep features come from the output of the last layer (M5) in the encoder. The four MDC modules, M6-M9, are used to blend deep and shallow semantic features of SAR images. The three convolutions are Conv2_1, Conv2_2, and Conv2_3. The Conv2_1 and Conv2_2 are traditional convolutions and the size of their kernel is 3 × 3. The last layer, the Conv2_3, is the output of the MDRU-Net. Its output is a predicted clean SAR image with the size of 256 × 256.
In MDRU-Net, five DRS connections are used. The DRS connection is used to reduce the difference between the different level features in the network and to copy the shallow features of the encoder to the decoder.
Note that the M1-M9 are the proposed MDC module and will be demonstrated in Sec. 3.3. The detail description of the DRS connection can be seen in Sec. 3.4. Except for Conv2_3, the traditional convolutions and dilated convolutions in MDRU-Net are followed by a rectified linear unit layer.

Multiscale Dilated Convolution Modules
Many methods have been used to make full use of the image information or features to improve the performance, such as increasing network depth, 48 increasing network width, 49 or applying the new loss function. 46 However, they did not take into consideration that objects in the image were similar in different regions. Furthermore, in most image denoising CNNs, they use Table 1 The features of input and output for each layer. traditional convolution to extract the semantic features of an image. Once the network structure is determined, the receptive field of the network is fixed. Therefore, we design the MDC module to address these problems. The MDC module consists of multiple dilated convolutions with different dilation rates and a sum-fusion layer. The multiple dilated convolutions are used to extract multiscale features and the sum-fusion layer is used to fuse multiscale features. The fusion method splices on the channel dimension. It is worthy to note that the MDC module improves the receptive field of the network without increasing the network parameters. At the same time, different dilation rates can be set where the large dilation rate allows the network to capture global features and the small dilation rate is used to capture local features.
In the MDRU-Net, nine MDC modules are used. We design five different structures. The five structures of the MDC modules are displayed in Fig. 3. The configuration of the MDC modules is listed in Table 2, where m, r, and channels mean the number of the dilated convolutions, dilation rates, and channels of the dilated convolution, respectively. In the MDC modules, as the SAR image features become smaller and smaller, the smaller dilation rate is used to focus on the local features.

Dilation Residual Skip Connections
The skip connection not only pass the detailed information of the image 44 but also speed up the training. 45 In many literatures with skip connection, 37,[43][44][45]48 they directly combine shallow and deep features and do not consider the difference between the two features. To reduce the difference, we design a new skip connection structure called DRS connection. The DRS connection consists of dilated convolution and residual block, which are called dilated residual (DR) block. The DR block is composed of a 3 × 3 traditional convolution and a 3 × 3 dilated convolution. The dilation rate of the dilated convolution is α. The value of α is an positive integer and can be set to any value. Note that the value of α is limited by the input feature size of the dilated convolution. For example, if the input feature size of dilated convolution is 11 × 11 and the original kernel size of dilated convolution is 3 × 3, the value of the dilation rate ranges from 1 to 5. The dilation rate can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 5 8 5 where int represents the round operation, K d is the dilated kernel size, and K o is the original kernel size. If the value of α exceeds 5, the dilated convolution loses its effectiveness. When α is 1, the dilated convolution is the same as traditional convolution, so the dilated convolution cannot increase the receptive field of the network. As α increases, the receptive field of the network will gradually increase. As the receptive field increases, the network can cover more image information. In this way, the networks can pay more attention to the global features of the image, and the despeckling performance of the network can increase.
There are five DRS connections in the MDRU-Net, which are called S1-S5. Each DRS connection contains one or more DR blocks. The detailed configuration of the five DRS connections is shown in Table 3, where Conn.1 is the input of the DRS connection and Conn.2 represents the output of the DRS connection in the MDRU-Net. The blocks represents the number of DR blocks in the DRS connection. In all DR blocks, the number of convolutional channels is 32. Figure 4 displays the framework of the S1 connection in detail, where α is 1, 2, 3, 4, and 5 in the five DR blocks. The S1 is used between the encoder and the decoder in the MDRU-Net. By performing a continuous convolution operation on the input noise image, the S1 connection can effectively reduce the difference between the shallow features and the deep features.

L_ hybrid Loss Function
We have discussed the MSE, MAE, and TV loss function and knew their strengths and weaknesses. The two noisy SAR images are x i1 and x i2 , the clean SAR image is y i , and the predicted clean SAR image is F ðx i1 ; ΦÞ. The proposed L_ hybrid loss function can be given as where C MSE , C MAE , and C TV can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 1 1 6 ; 4 9 9 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 1 1 6 ; 4 5 2 where L TV is given in Eq. (5). The η is a variable. The value of η can be set as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 1 1 6 ; 4 1 5 where n represents the batch size and x k1 and x k2 are the k'th image pair in each batch size. The L_ hybrid loss function can improve the stability and generalization ability of the network.

Experimental Evaluation
In this paper, the experiments have been performed on a personal computer with Ubuntu 16.04. The hardware is an Intel Xeon(R) CPU E5-2620v3, an NVIDIA Quadro M6000 24GB GPU, and 48 GB of RAM. The software tool is PyCharm, the version of Python is Python 3.6, and the deep learning framework is TensorFlow 1.10.

Datasets
Three public datasets, UCML, 50 SEN-1, 51 and SEN-2 51 datasets, are used to demonstrate the performance of the proposed methods. The UCML dataset 50 is composed of the optical remote sensing images. UCML is released by the UC Merced computer vision laboratory. The dataset is obtained from the large-scale U.S. Geological Survey national map urban area imagery series, and the dataset contains 21 scene data for research purposes. Each scene has 100 images and the size of each image is 256 × 256 × 3.
The SEN1-2 dataset 51 consists of SEN-1 and SEN-2 datasets and it is the optical-SAR image pairs generated from the Sentinel-2 and Sentinel-1 satellites. It has 282,384 images. These images are land scenes in spring, summer, autumn, and winter. We divide the SEN1-2 dataset into a real SAR image subdataset (SEN-1) and an optical image subdataset (SEN-2). The two subdatasets have 141,192 images, respectively. The image size of the SEN-1 is 256 × 256 × 1 Fig. 4 The detailed framework of the S1 connection. The gray 3 × 3 is a traditional convolution. The yellow 3 × 3 is the dilated convolution with r ate ¼ α. and the image size of the SEN-2 is 256 × 256 × 3. In our experiments, to ensure the fairness of the experiment, 2100 images are randomly extracted from the SEN-1, named mini SEN-1 (mSEN-1). Meanwhile, 2100 images are randomly extracted from the SEN-2, named mini SEN-2 (mSEN-2).
Next, according to the noisy-noisy training method, the training image pairs are constructed.

Training data of the simulated synthetic aperture radar images
We use two datasets, UCML and mSEN-2, as the simulated SAR images to demonstrate the despeckling performance of the proposed methods. First, we process the images of the UCML dataset into grayscale images. Then, we randomly divide 2100 images of the UCML dataset into 1400 images as the training set, 200 images as the validation set, and 500 images as the testing set. Finally, we add two kinds of simulated speckle noise to each image of the training set and obtain the training image pairs fx i1 ; x i2 g N . The x i1 and x i2 are all noisy images. The x i1 is the input image and the x i2 is the ground-truth image (the noisy image). The mSEN-2 has the same processing method as UCML dataset. An example of training image processing samples for UCML and mSEN-2 datasets is shown in Fig. 5.

Training data of the real synthetic aperture radar images
We used the mSEN-1 dataset as the real SAR images to verify the despeckling performance of the proposed methods. First, we randomly selected 1400 images from the 2100 images in the mSEN-1 dataset as the training set, 200 images as the validation set, and 500 images as the testing set. Then, we corrupted the training set and generated the noisy-noisy image pairs for training. The real SAR training image pairs are fx i1 ; x i2 g N . The x i1 represents the real SAR image and is used as the input image of the networks. The x i2 implies the corrupted image of the mSEN-1 dataset and is used as the ground-truth image (the noisy image) of the networks. In our experiments, the corrupted method is to add simulated speckle noise to real SAR images. As shown in Fig. 6, an example of the processed mSEN-1 dataset is listed. The reference image of the mSEN-1 is the grayscale image of the mSEN-2.

Quality Assessment Criteria
To evaluate the despeckled SAR images, we choose the signal-to-noise ratio (SNR), the peak signal-to-noise ratio (PSNR), the structural similarity index (SSIM), 52 the despeckling gain (DG), 53 the equivalent number of looks (ENL) 34 and the edge preservation index (EPI) as assessment criteria.
The SNR is the ratio of signal strength to noise intensity. Assume that x i and y i are the noisy image and the clean (reference) image, respectively. The output of the despeckling network is F ðx i ; ΦÞ. Let F ðx i ; ΦÞ ¼x i mean thatx i is the despeckled image. SNR is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 1 1 6 ; 4 7 5 10 log 10 where C MSE ð·Þ is given in Eq. (11), and the M is the number of testing set. The PSNR is the most widely used objective measure of image quality. It represents the ratio between the maximum signal power and the noise power. The PSNR measures the similarity between the despeckled image and the reference image. The PSNR is written as where SSIM i can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 8 ; 1 1 6 ; 2 0 4 where μx i , σx i , μ y i , and σ y i represent the mean and standard deviation of the imagesx i and y i , respectively. The σx i y i is the covariance of the imagesx i and y i . The C 1 and C 2 are constants, and the role of C 1 and C 2 is to avoid SSIM calculation errors when the mean and standard deviation of the image are both 0. The DG 53 is a new paradigm for the objective assessment of SAR despeckling methods and its calculation requires a noisy image, a despeckled image, and a reference image. The DG can be given as Fig. 6 An example of the training data after processing on the mSEN-1 dataset: (a) the reference image, (b) the real SAR image, and (c) the corrupted image, respectively.
10 log 10 C MSE ðx i ; y i Þ C MSE ðx i ; y i Þ : (19) From the calculation formulas of the above four assessment criteria, they all need a reference image. However, the mSEN-1 dataset is a real SAR image dataset. It lacks clean image as the reference image when calculating the indices. Therefore, in order to objectively evaluate the despeckling performance of the real SAR images, the clean grayscale images of mSEN-2 are used as the reference images for mSEN-1.
The ENL 34 is a common indicator, and it is used to evaluate the speckle noise intensity of SAR images. The ENL can be defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 1 1 6 ; 6 0 9 The EPI is used to evaluate the edge preservation ability of the despeckled image in the horizontal or vertical directions. The value range of EPI is [0, 1]. The higher the EPI value, the stronger the edge preservation ability of despeckling network is. The EPI can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 1 ; 1 1 6 ; 5 2 2 EPI ¼ where j · j represents the absolution operation. The DN V1 and DN V2 are the pixel values of adjacent pixels on the vertical direction, respectively. The DN H1 and DN H2 are the pixel values of adjacent pixels on the horizontal direction, respectively.

Implementation Details
We use the prepared noisy-noisy image pairs to train network and use the Adam algorithm 54 as an update algorithm for network parameters. The Adam algorithm is a stochastic optimization method proposed by Diederik and Jimmy, 54 which is integrated in many deep learning platforms such as TensorFlow, Caffe, and PyTorch. In Adam algorithm, there are three main parameters, which are β1, β2, and ϵ. In our experiments, we used default values of three parameters provided by Adam algorithm. 54 The default values are β1 ¼ 0.9, β2 ¼ 0.999, and ϵ ¼ 10 −8 . The learning rate is not fixed and has a smooth reduction in our experiments. Assume that the maximum number of training iterations is I and the current number of iterations is t. The current learning rate (cur_lr) is represented as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 2 ; 1 1 6 ; 2 9 3 cur_lr ¼ lr where lr is the initial learning rate. The ξ is a constant with a range [0, 1], and ξ controls the starting position where the learning rate start to decrease. In our experiments, I is 50,000, lr is set to 0.0001, ξ is set to 0.3, and the batch size (n) is 4. The explanation for choosing these values of the above parameters is as follows. When we train the despeckling network, I is set to 100,000, and a model is saved for every 5000 iterations. By testing each model, we find that the network has converged at 50,000 iterations, and the test results are optimal. Therefore, the maximum number of iterations is set to 50,000.
In the selection of the lr, we test 0.01, 0.001, and 0.0001. When lr is 0.01, the loss value of network appears NaN (Not a Number). When lr is 0.001, the loss value oscillates shapely. When lr is set to 0.0001, the loss value can quickly converge.
With the deepening of the network training, the optimization of the network requires a smaller learning rate. The ξ is a parameter that controls the start position where the learning rate starts to decrease. By observing the change of the loss value, the loss value begins to oscillate up and down slightly at 15,000 iterations, and the loss value does not decrease. Therefore, we set ξ to 0.3. After 15,000 iterations, the learning rate starts to decrease and the loss value starts to decrease.
The n is set based on the GPU memory size. When n is set to 5, there is insufficient memory during network training. When n is set to 4, the network can train normally. It is worth noting that the larger the n, the better the despeckling performance of the network is.

Despeckling performance of the U-Net
To prove that the U-Net 37 can use the noisy-noisy training method to remove speckle noise in the simulated and real SAR images, we first calculate the values of the SNR, PSNR, SSIM, and ENL of the reference image and the input image. These values are the result of the model "No." As shown in Eqs. (19) and (21), the calculation of the DG and EPI indicators requires a despeckled image. Therefore, the values of DG and EPI are not given in model "No." Then, in order to adapt to the three datasets, we make three major modifications to the original U-Net. 37 The first one is to modify the input image size of the original U-Net from 572 × 572 to 256 × 256. The second one is to change all convolutions from unpadded to padding. The third one is to remove the cropping operation. It is worth noting that the U-Net mentioned later represents the modified U-Net. Finally, we use the constructed noisy-noisy image pairs fx i1 ; x i2 g N to train the U-Net and obtain the despeckling models. The experimental results of the U-Net on the three testing sets are shown in Table 4, where ↑ means that the larger the value, the stronger the despeckling ability of the network is. The No means directly calculate the values of input image and reference image without using any despeckling method. From the experimental results, it can be seen that the U-Net using noisy-noisy training method can effectively remove the speckle noise and improve the quality to some extent in the simulated and real SAR images.

Ablation experiment of the multiscale dilated convolution modules
We have demonstrated that the U-Net can indeed remove the speckle noise without clean SAR data in simulated and real SAR images. However, the despeckling performance of the U-Net is limited. To improve the despeckling performance of the U-Net and verify the proposed MDC module, we replace the convolutions in U-Net with the MDC modules. It is worthy to note that the first and last convolutions in U-Net are reserved. The MDC modules are illustrated in detail in Sec. 3.3. The experimental results on the three testing sets are shown in Table 5. The bold black body is the better experimental results and the MDCs represent that MDC modules are used in U-Net. It can be seen from the experimental results that the MDC modules used in the U-Net can greatly improve the despeckling performance. Compared to the U-Net, the PSNR of the three testing sets increased by 5.602, 1.441, and 5.002 dB, respectively. The other assessment criteria have increased too.

Ablation experiment of the dilation residual skip connections
To demonstrate the despeckling performance of the DRS connection, we replace the skip connections in U-Net with the DRS connections. The proposed DRS connections have been introduced in Sec. 3.4. The experimental results on the three testing sets are shown in Table 6. The DRSs represent that DRS connections are used in U-Net. By comparing the experimental results, it can be seen that the DRS connections significantly improve the despeckling ability of the U-Net in simulated and real SAR images. The main reason is that the proposed DRS connection allows each level of semantic features to experience the same number of convolution operations. The DRS connection can improve the fusion efficiency when fusing features and increase the despeckling performance of the despeckling networks. Therefore, the DRS connection can effectively decrease the difference between deep and shallow semantic features and help the despeckling network models to improve the ability of removing speckle noise. Compared with the experimental results of the U-Net, the UCML dataset increased 5.679 dB, the mSEN-1 dataset increased 1.335 dB, and the mSEN-2 dataset increased 4.957 dB on the PSNR.

Despeckling performance of the L_ hybrid loss function
To explain the improvement brought by the L_ hybrid loss function, we first use MSE and MAE loss functions to train the U-Net, respectively. Then, we replace the MAE and MSE loss with the L_ hybrid loss function. The detailed analysis of the L_ hybrid loss function can be found in Sec. 3.5.
The experimental results obtained on the three testing sets are shown in Table 7. We find that the MAE loss function has better despeckling performance than the MSE, while L_ hybrid loss function has higher despeckling performance than the MAE.

Despeckling performance of the multiscale dilated residual U-Net
In this section, we verify the despeckling performance of the proposed MDRU-Net for simulated and real SAR images. The training data of the MDRU-Net are noisy-noisy image pairs.

Compared with the State-of-the-Art Despeckling Methods
To compare the despeckling performance with the MDRU-Net, we select the refined Lee filter 10 (RLF), the improved sigma filter 17 (ISF), the probabilistic patch-based (PPB) filter, 30 the threedimensional block matching (BM3D) filter for SAR image despeckling (SAR-BM3D), 28 the SAR-CNN, 32 and the SAR-DRN. 34 Note that the RLF, ISF, PPB, and SAR-BM3D are the traditional despeckling algorithms and are widely used to filter SAR images. The SAR-CNN and SAR-DRN are the state-of-the-art despeckling CNNs for SAR images and their training data are the noisy-clean image pairs. To ensure the fairness of the experiment, according to the method of selecting training data by SAR-DRN, 34 we randomly select 400 images from the UCML dataset 50 as training data, and then perform data augmentation on the selected images. The final training data are 1600 images by rotating, flipping, and mirroring. To construct the noisy-noisy training image pairs, we add simulated speckle noise to the clean images. The difference from SAR-DRN 34 is that the training image pairs of the MDRU-Net are noisy-noisy image pairs and the image size is 256 × 256. After training, the SAR image despeckling model is obtained. During the testing phase, we use the airplane, highway, and buildings as our testing set. The testing set is the same as that used in SAR-DRN. 34 We only compared the case where the speckle noise level is 8. In the selected three scenes, we add simulated speckle noise to them. The experimental results of airplane, highway, and buildings are shown in Table 9. The meaning of MDRU-Net (N2C) is that the input of the training network is noisy-clean image pairs, and the MDRU-Net (Ours) is the method proposed in this paper to use noisy-noisy image pairs during training.
From the experimental results, it can be found that even if the noisy-noisy image pairs are used to train the MDRU-Net, the despeckling performance is significantly higher than other algorithms. On the PSNR, the MDRU-Net obtained 31.82, 32.34, and 32.87 dB in airplane, buildings, and highway scenes, respectively. Compared with the SAR-DRN, 34 the MDRU-Net (Ours) increased approximately 3.81, 0.56, and 6.34 dB on the three scenes, respectively.
In addition, by comparing the experimental results of MDRU-Net (N2C) and MDRU-Net (Ours), the despeckling performance of MDRU-Net (N2C) and MDRU-Net (Ours) is very close. Therefore, the MDRU-Net (Ours) is recommended to remove the speckle noise in real SAR images.

Conclusion
In this paper, the despeckling network MDRU-Net for SAR images is proposed. The MDRU-Net can use the noisy-noisy image pairs to train in the absence of clean SAR images.
The MDRU-Net consists of an encoder, a decoder, and multiple DRS connections. The encoder acts as a feature extractor, which extracts deep semantic features of the SAR images. The decoder is responsible for restoring a clean SAR image. To protect more details of SAR images for extracting deep semantic SAR features, the MDC module is designed. MDC module is used in the encoder and decoder. The MDC module has five types and contains multiple dilated convolutions. However, there is a great difference between shallow and deep semantic features in fusion. To reduce the difference between the two semantic features, the DRS connection is raised. The DRS connection not only reduces the difference but also protects more important details. To make up for the drawbacks of MAE and MSE, we propose the L_ hybrid loss function. The L_ hybrid loss function not only improves the stability of the despeckling network but also suppresses the artifacts in predicted clean SAR images. We do extensive experiments on the simulated and real SAR images. The experimental results illustrate that the proposed method achieves state-of-the-art despeckling performance in several key metrics. Gang Zhang is a PhD student at the Space Engineering University. He received his master's degree from the Xidian University and his bachelor's degree from Xi'an Polytechnic University. His research interests include synthetic aperture radar image processing, pattern recognition, and deep learning.
Zhi Li is a professor at the Space Engineering University. He received his PhD from the Institute of Geology of the China Earthquake Administration in 2003. He received his BE and master's degrees from the National University of Defense Technology in 1994 and 1997, respectively. He has authored or co-authored more than 60 papers and 12 books. His research interests include space system applications and artificial intelligence.
Xuewei Li is a PhD student at Beijing University of Posts and Telecommunications. She received her bachelor's and master's degrees from Xi'an University of Technology. Her current research interests include image aesthetic assessment, image processing, and machine learning.
Yiqiao Xu is a PhD student at the Space Engineering University. He received his master's degree from the Electronic Engineering Institute and his bachelor's degree from the Anhui University of Science and Technology. His research interests include signal processing, image pattern recognition, and deep learning.