Image inpainting using frequency domain priors

In this paper, we present a novel image inpainting technique using frequency domain information. Prior works on image inpainting predict the missing pixels by training neural networks using only the spatial domain information. However, these methods still struggle to reconstruct high-frequency details for real complex scenes, leading to a discrepancy in color, boundary artifacts, distorted patterns, and blurry textures. To alleviate these problems, we investigate if it is possible to obtain better performance by training the networks using frequency domain information (Discrete Fourier Transform) along with the spatial domain information. To this end, we propose a frequency-based deconvolution module that enables the network to learn the global context while selectively reconstructing the high-frequency components. We evaluate our proposed method on the publicly available datasets CelebA, Paris Streetview, and DTD texture dataset, and show that our method outperforms current state-of-the-art image inpainting techniques both qualitatively and quantitatively.


Introduction
In computer vision, the task of filling in missing pixels of an image is known as image inpainting. It can be extensively applied for creative editing tasks such as removing unwanted/distracting objects in an image or generating the missing region of an occluded image or improving data availability for satellite imagery. The main challenge in this task is to synthesize the missing pixels in such a way that it looks visually realistic and coherent to human eyes.
Traditional image inpainting algorithms [1][2][3][4][5][6][7][8][9][10][11] can be broadly divided into two categories. Diffusion-based image inpainting algorithms [1][2][3][4] focus on propagating the local image appearance into the missing regions. Although these methods can fill in small holes but produce smoothed results as the hole grows bigger. On the other hand, patch-based traditional inpainting algorithms [5][6][7][8][9][10][11] iteratively search for the best-fitting patch in the image to fill in the missing region. These methods can fill in bigger holes, but they are not effective either in inpainting missing regions that have complex structures or in generating unique patterns or novel objects that are not available in the image in the form of a patch.
Recent research works on image inpainting [12][13][14][15][16][17] leverage the advancements in generative models such as Generative Adversarial Networks (GANs) [18] and show that it is possible to learn and predict missing pixels in coherence with the existing neighboring pixels by training a convolutional encoder-decoder network. In this paradigm, generally speaking, the model is trained in a two-stage manner -i) in the first stage, the missing regions are coarsely filled in with initial structures by minimizing traditional reconstruction loss; ii) in the second stage, the initially reconstructed regions are refined using an adversarial loss. Although these methods are good in generating visually plausible novel contents such as human faces, structures, natural scenes in the missing region, they still struggle to reconstruct high-frequency details for real complex scenes, leading to a discrepancy in color, boundary artifacts, distorted patterns, and blurry textures. Additionally, the reconstruction quality of previous methods deteriorates as the size of the missing region increases. The above problems can be attributed to the following reason. Existing methods use only spatial domain information during the learning process similar to diffusion like techniques to obtain information from the mask boundary. Thus as the mask size increases, the interior reconstruction details are lost and only a low-frequency component of the original patch is estimated by these methods.
To alleviate the above problem, we resort to frequency-based image inpainting. We show that image inpainting can be converted to the problem of deconvolution in the frequency domain which can predict local structure in the missing regions using global context from the image. Qualitative analysis shows that our proposed frequency domain image inpainting approach helps in improving the texture details of missing regions. Previous methods make use of only spatial domain information. Therefore, the reconstruction of the information close to the mask boundary is good compared to the interior region since the local context is available only in the boundary regions. In contrast, a frequency-based approach would take information from the global context in the image because to Discrete Fourier Transforms (DFT) that considers all pixels for computing the frequency components. As a result, it captures more detailed structural and textural content of the missing regions in the learned representation. Due to these reasons, we propose a two-stage network consisting of i) deconvolution stage and ii) refinement stage. In the first stage, the DFT image from the original RGB image is computed. Each frequency component in the DFT image captures the global context thus forming a better representation of the global structure. We employ a Convolutional Neural Network (CNN) to learn the mapping between masked DFT and original DFT, which is a deconvolution operation obtained by minimizing the 2 loss. While DFT based deconvolution can reconstruct the global structural outline, we observe that there exists a mismatch in the color space of the first stage output. Therefore, in the second stage, we fine-tune the output of the first stage using adversarial methods to match the color distribution of the true image. Figure 1 shows an example of the reconstructed output using our method where Figure 1b) shows the DFT map of our first stage reconstruction obtained from the deconvolution network). This additional frequency domain information is later used by the refinement network to obtain the final output as shown in Figure 1c). Our main contributions in this paper can be summarized as follows: 1. We introduce a new frequency domain-based image inpainting framework that learns the high-frequency component of the masked region by using the global context of the image. We find that the network learns to preserve image information in a better way when it is trained in the frequency domain. Therefore, adding the frequency domain and spatial domain information certainly improves the inpainting performance compared to the conventional spatial domain image inpainting algorithms. To enable better inpainting, we train the network using both frequency-domain and spatial domain information which leads to a better consistency of inpainted results in terms of the local and global context.

2.
We validate our method on benchmark datasets including CelebA faces, Paris Streetview, and DTD texture datasets, and show that our method achieves better inpainting results in terms of visual quality and evaluation metrics outperforming the state-of-the-art results. To the best of our knowledge, this is the first work that explores the benefits of using frequency domain information for image inpainting.

Traditional Inpainting Techniques
Diffusion-based image completion methods [1][2][3][4] are based on Partial Differential Equations (PDE) where a diffusive process is modeled using PDE to propagate colors into the missing regions. These methods work well for inpainting small missing regions but fail to reconstruct the structural component or texture for larger missing regions.
Patch-based algorithms, on the other hand, are based on iteratively searching for similar patches in the existing image and paste/stitch the most similar block onto the image. Efros and Freeman [5] first proposed a patch-based algorithm for texture synthesis based on this philosophy. These algorithms perform well on textured images by assuming that the texture of the missing region is similar to the rest of the image. However, they often fail in inpainting missing regions in natural images because the patterns are locally unique in such images. Moreover, these methods are computationally expensive because of the need for computing similarity scores for every target-source pair. For more accurate and faster image inpainting, several optimal patch search based methods were proposed by Drori et al. [6] (fragment-based image completion algorithm) and Criminisi et al. [7] (patch-based greedy sampling algorithm). Another optimization method to synthesize visual data (images or video) based on bi-directional similarity measure was proposed by Simakov et al. [8]. Afterward, these techniques were expedited by Barnes et al. [9] who proposed PatchMatch, a fast randomized patch search algorithm that could handle the high computational and memory cost. Later such patch-based image completion techniques were improved by Darabi et al. [10] by incorporating gradient-domain image blending, He et al. [11] by computing the statistics of patch offsets and Ogawa et al. [19] by optimizing sparse representations w.r.t. SSIM perceptual metric. However, these methods rely only on existing image patches and use low-level image features. Therefore they are not effective in filling complex structures by performing semantically aware patch selections.

Deep Learning-based Inpainting
Recently CNN models [20] have shown tremendous success in solving high-level tasks such as classification, object detection, and segmentation as well as low-level tasks such as image inpainting problem. Xie et al. [21] proposed Stacked Sparse Denoising Auto-encoders (SSDA), that combines sparse coding and deep networks pre-trained with denoising auto-encoder to solve a blind image inpainting task. Blind image inpainting is harder because in this case, the missing pixel locations are not available to the algorithm and it has to learn to find the missing pixel location and then restore them. Kohler et al. [22] showed a mask specific deep neural network-based blind inpainting technique for filling in small missing regions in an image. Chaudhury et al. [23] attempted to solve this problem by proposing a lightweight fully convolutional network (FCN) and demonstrated that their method can achieve comparable performance as the sparse coding based K-singular value decomposition (K-SVD) [24] technique. However, these inpainting approaches were limited to very small sized masks. More recently, adversarial learning-based inpainting algorithms have shown promising results in solving image inpainting problems because of their ability to learn and synthesize novel and visually plausible contents for different images such as objects [12], scene completion [13], faces [25] etc. A seminal work by Pathak et al. [12] showed that their proposed Context Encoder network can predict missing pixels of an image based on the context of the surrounding areas of that region. They used both standard 2 loss and adversarial loss [18] to train their network. Later, Iizuka et al. [13] demonstrated that their encoder-decoder model could reconstruct pixels in the missing region that are consistent both locally and globally, by leveraging the benefits of dilated convolution layers, a variant of standard convolutional layers. Similar to [12] this approach also uses adversarial learning for image completion, but unlike [12] it could handle arbitrary image and mask size, because of the proposed global and local context discriminator networks. Recently, Yu et al. [14] introduced the concept of attention for solving an image inpainting task by proposing a novel contextual attention layer and trained the unified feedforward generative network with reconstruction loss and two Wasserstein GAN losses [26,27]. They showed that their method can inpaint images with multiple missing regions having different sizes and located arbitrarily in the image. Later, Liu et al. [28] proposed a partial convolution layer with an automatic mask-update rule, that can handle free-form/irregular masks. Here, the mask is updated in such a way that the missing pixels are predicted based on the real pixel values of the original image where the partial convolution can operate. Song et al. [15] showed that it is possible to perform image inpainting by using segmentation information. To this end, they proposed a model that predicts the segmentation labels of the corrupted image at first and then fills in the segmentation mask so that it can be used as guidance to complete the image. Nazeri et al. [16] introduced an edge generator that at first predicts the edges of the missing regions and then use the predicted edges as a guidance to the complete the image. Yu et al. [17] proposed a gated convolution based approach to handle free-form image completion.

Frequency-domain Learning
Recently enabling the network to learn information in the frequency domain has gained popularity because the frequency domain information contains rich representations that allow the network to perform the image understanding tasks in a better way in comparison to the conventional way of using only spatial domain information. Gueguen et al. [29] proposed image classification using features from the frequency domain. Xu et al. [30] showed that it is possible to perform object detection and instance segmentation by learning information in the frequency domain with a slight modification to the existing CNN models that use RGB input. In this paper, we propose to use  frequency-domain information along with spatial domain information to achieve better image inpainting performance.

Proposed Method
Given a corrupted input image, our aim is to predict the missing region in such a way that it looks similar to the clean images to human eyes. In this paper, we propose a frequency-based non-blind image inpainting framework that consists of two stages: i) frequency domain deconvolution network and ii) refinement network. The overall framework of the proposed method is shown in Figure 2. In the first stage, we compute the DFT of the masked image (both magnitude and phase information) and the original RGB image and train a CNN for deconvolution to learn the mapping between the two signals by minimizing the 2 loss. Here we formalize the problem of inpainting in the spatial domain as deconvolution in the frequency domain. We employ the feed-forward denoising convolutional neural networks (DnCNNs) [31], a manifestation of deconvolution, which uses residual learning to predict the denoised image. The motivation behind this DFT-based deconvolution operation is to learn a better representation of the global structure that can serve as guidance to the second network. In the second stage, we use the spatial domain information (of the masked image and the mask) and train a generative adversarial network (GAN) based model [18] by minimizing an adversarial loss along with 2 loss. The motivation to incorporate this stage is to fine-tune the output of the first stage by refining the structural details and matching the color distribution of the true image in a local scale. The various components of our model are explained in the following subsections.

Problem Formulation
Let us consider I in as the corrupted/incomplete input image, I gt as the ground truth image, and I 1 pred as the predicted output image after the first stage. At first, we calculate the DFT of I in and I gt as I f in = DFT(I in ) and I f gt = DFT(I gt ). Let us consider a mask function in spatial domain as M, with its frequency domain counterpart as M f . A masked image is represented as I in (x, y) = I gt (x, y) M(x, y) where denotes element-wise multiplication. Our contribution in this paper is to analyze this relation between the frequency domain signals of I in , I gt , and M. For example, if we consider a mask of size (2W, 2H), the power spectral density for the DFT of mask signal can be given as where k = 0, 1, ...(N − 1) represents the discrete frequency, with N being the number of samples. The frequency domain representation of the mask signal is shown in Figure 3, which depicts a decaying pulse from the origin. By the convolution-multiplication property of DFT, we can show that the multiplication of mask with the image in spatial domain is equivalent to convolution of mask with image in frequency domain ( Figure 3). Mathematically, this is represented as where denotes the convolution operation and the masked frequency signal is the output of the convolution of the mask and clean image DFT signal. Therefore, we perform a deconvolution operation to predict the missing region of the incomplete image. Let F(I in ; θ) be the Deconvolutional neural network that converts I in to I 1 pred , such that I 1 pred = F(I in ; θ). After calculating I f in and I f gt , we train the network to learn the mapping between them, to predict the first stage output. We denote frequency domain representation as I 1f pred where I 1f pred = F(I f in ; θ). Next, we perform an inverse DFT of the first stage output and get the predicted output image I 1 pred = IDFT(I 1f pred ).

Network Architecture
To perform the deconvolution operation in the frequency domain, we adopt a CNN model having 17 layers similar to [31]. This deconvolution network contains three types of layers as shown in Figure 2. The first layer is a Conv layer with ReLU non-linearity where 64 filters of (3x3x3) size are used. Next layers (2 nd -16 th ) are a combination of Conv layer, a batch normalization layer [32] and a ReLU layer, where 64 filters of (3×3×64) size are used. The last layer is a Conv layer, where 3 filters of (3x3x64) size are used to reconstruct the output. Details of our first stage deconvolution network is given in Table 1.

Training
To train our deconvolution network, we use 2 loss that minimizes the distance between the DFT of ground-truth image I f gt and the DFT of inpainted image I 1f pred , which is given by After training the first stage deconvolution network, we compute the inverse DFT of I 1f pred which is used as a guidance to train the refinement stage as shown in Figure 2. The reason to choose the frequency domain in the first network is because it contains rich information [30,33] for high frequency preservation.

Refinement Network
The refinement network is a GAN based model [18] that has shown promising results in generative modeling of images [34] in recent years. Our refinement network has a generator and a discriminator network, where the generator network takes the output of the first stage (frequency domain deconvolution module), the original masked image, and the corresponding binary mask (spatial domain information) as input pairs, and outputs the generated image. The discriminator network takes this generator output and minimizes the Jensen-Shannon divergence between the input and output data distribution to match the color distribution and structural details of the output image to the true image.

Network Architecture
Generator: We adapt the generator architecture from Johnson et al. [35] that has exhibited good performance for image-to-image translation task [36]. Our generator network is an encoderdecoder architecture having three convolution layers for downsampling, eight residual blocks [37], and three convolution layers for up-sampling. Here, Conv-2 and Conv-3 layers are stride-2 convolution layers that are responsible for down-sampling twice, and Conv-4 and Conv-5 layers are transpose convolution layers that are responsible for up-sampling twice back to the original image size. We use instance normalization [38] and ReLU activation function across all layers of the generator network.
Algorithm 1 Training the refinement network. 1: while Generator G has not converged do 2: Sample batch images I in from training data; 3: Generate random masks M; 4: Construct combined input (I in , M, and I 1 pred ); 5: Get masked region prediction I 2 pred = G(I in , M, I 1 pred ); 6: Generate inpainted image by modifying the masked region I pred ← I in +I 2 pred (1 − M);

7:
Update G with 1 loss and adversarial critic loss; 8: Update discriminator critic D with I in , I pred ; 9: end while Discriminator: We adapt the discriminator network from [36,39] which is a Markovian discriminator similar to 70×70 PatchGAN. The advantage of using a PatchGAN discriminator is that it has fewer parameters compared to a standard discriminator because it works only on a particular image patch instead of an entire image. Furthermore, it can be applied to any arbitrarily-sized images in a fully convolutional fashion [36,39]. We apply sigmoid function after the last convolution layer which produces a 1-dimensional output score that predicts whether the 70×70 overlapping image patches are real or fake. To stabilize the discriminator network training, we use spectral normalization [40] as our weight normalization method. Moreover, we use leaky ReLUs [41] with slope of 0.2. The details of our second stage refinement network (generator and discriminator network) and output size of each layer is given in Table 2.

Training
After obtaining the first stage output, we feed it to the refinement network along with the spatial domain information (of the masked image and the mask). While training, the generator of the inpainting network G takes a combination of input image I in , image mask M, and the first stage output image I 1 pred and generates I 2 pred = G(I in , M, I 1 pred ) as output. Then by adding I 2 pred to the input image, we get completed image as I pred = I in + [I 2 pred (1 − M)]. The training procedure of the refinement stage is described in Algorithm 1. We train our refinement module by using two loss functions: a reconstruction loss and an adversarial loss [18]. Here for the reconstruction loss, we use 1 loss [12] that minimizes the distance between the clean/ground-truth image I gt and the completed/inpainted image I pred , which is given by where I pred ← I in + G(I in , M, I 1 pred ) (1 − M). For the adversarial loss, we follow the minmax optimization strategy, where the generator G is trained to produce inpainted samples from the artificially corrupted images such that the inpainted samples appear as "real" as possible and the adversarially trained discriminator critic D tries to distinguish between the ground truth clean samples and the generator predictions/inpainted samples. The objective function can be expressed as follows where P r is the real/ground truth data distribution and P g is the model/generated data distribution defined byx = G(I in , M, I 1 pred ). Thus, our overall loss function for the refinement stage becomes where λ 1 = 1, λ 2 = 0.1. The weighted sum of these two loss functions compliments each other in the following way: i) The GAN loss helps to improve the realism of the inpainted images, by fooling the discriminator. ii) The 1 reconstruction loss serves as a regularization term for training GANs [14], helps in stabilizing GAN training, and encourages the generator to generate images from the modes that are close to the ground truth in an 1 sense.

Implementation Details
Our proposed model is implemented in PyTorch. 1 In our experiments, we resize the image to 64×64 and linearly scale the pixel values from range [0, 256] to [−1, 1]. For the first stage, we initialize the weights by using He initialization [42] and use SGD optimizer with weight decay of 0.0001, the momentum of 0.9, and mini-batch size of 128. To train the first stage network we decayed the learning rate exponentially from 10 −1 to 10 −4 for 50 epochs. For the second stage, both our Generator G and Discriminator D are trained together using the following settings: i) G and D learning rate of 10 −4 , and 10 −5 respectively, ii) optimized using Adam optimizer [43] with

Experiments
In this section, we evaluate the inpainting performance of our proposed method on three standard datasets: CelebFaces Attributes Dataset (CelebA) [44], Paris StreetView [45], and Describable Texture Dataset (DTD) [46]. For our experiments, we use both regular and irregular masks. Regular masks refer to square masks having fixed size consisting of 25% of total image pixels and are randomly located in the image. For irregular masks, during training, we use the masks from the work of Liu et al. [28], where the irregular mask dataset contains the augmented versions of each mask (0, 90, 180, 270 degrees rotated, horizontally reflected) and are divided based on the percentage of mask size on the image in increments of 10% such as 0-10%, 10-20% etc.

Qualitative Evaluation
Figures 4 and 5 compare the inpainting results of our method with previous image inpainting methods: PatchMatch (PM) [9], Context Encoder (CE) [12], Contextual Attention (CA) [14], and Generative Inpainting (GI) [17], for regular masks on CelebA and Paris StreetView datasets. The last six columns of these figures demonstrate the magnitude spectrum of the DFT map obtained from different methods [9,12,14,17], our method (first stage reconstruction) and the ground truth image. We can see that previous methods (PM) copy incorrect patches in the missing regions,  whereas others (CE, CA, GI) sometimes fail to achieve plausible results and generate distinct artifacts. However, our method can restore the missing regions with sharp structural details, minimal blurriness, and hardly any "checkerboard" artifacts. Moreover, the inpainting results using our method look the most similar to the ground truth images. We conjecture that in the presence of frequency domain information, the network efficiently learns the high-frequency details, which enables it to preserve the structural details in the restored image. This can be confirmed from the DFT maps where we see that our deconvolution network learns to predict the missing region in such a way that the DFT map of our first stage reconstruction looks similar to that of the ground truth image. Later the refinement network uses this frequency domain information to produce better inpainting results. We also show the performance of our proposed method on CelebA and Paris StreetView dataset for irregular masks. Figure 6 shows the inpainting results using Generative Inpainting (GI) [17] and our proposed method for different percentage (10-50%) of mask size. Our method can generate photo-realistic images having similar texture and structures as the original clean images even when a large region (50-60%) of the image is missing.  [44], Paris Streetview [45], and DTD texture dataset [46] for different inpainting models: PatchMatch (PM) [9], Context Encoder (CE) [12], Contextual Attention (CA) [14], Generative Inpainting (GI) [17], and Ours. The best results for each row is shown in bold. − Lower is better. + Higher is better.

Quantitative Evaluation
We report the quantitative performance of our method in terms of the following metrics i) peaksignal-to-noise ratio (PSNR); ii) structural similarity index (SSIM) [47] and iii) mean absolute error (MAE). Table 3 demonstrates the comparison in metric values on the CelebA, Paris StreetView, and DTD dataset for the state-of-the-art inpainting methods and our method. Our method outperforms previous methods in terms of these metrics on both regular and irregular masks. This proves the effectiveness of using frequency domain information. Note that, we obtain the metrics for Context Encoder [12] by using the 1 and adversarial loss in our network settings.

Ablation Study
We perform an ablation study to investigate the role of our frequency deconvolution network and to analyze the effect of different loss components used to train our model. Figure 7 shows the inpainting results using only 1 loss, 1 with adversarial loss and our proposed method of incorporating frequency domain information (DFT component). We can see blurry reconstructions in Figure 7a) when we use only 1 loss in the spatial domain. However, inpainting performance improves to a certain extent if we add the adversarial loss component. Nevertheless, in Figure 7b) we can still find structural and blurry artifacts on the reconstructions. Figure 7c) demonstrates the inpainting results of our proposed method of training the model using both frequency and spatial components. We can see in that using our method the model can perform significantly better by restoring fine structural details. Therefore, we can conclude that training the model along with frequency-domain information certainly helps the network to learn high-frequency components and restore the missing region with better reconstruction quality.

Conclusion
We presented a frequency-based image inpainting algorithm that enables the network to use both frequency and spatial information to predict the missing region of an image. Our model first learned the global context using frequency domain information and selectively reconstructed the high-frequency components. Then it used the spatial domain information as a guidance to match the color distribution of the true image and fine-tuned the details and structures obtained in the first stage, leading to better inpainting results. Experimental results showed that our method could achieve results better than state-of-the-art performances on challenging datasets by generating sharper details and perceptually realistic inpainting results. Based on our empirical results, we believe that methods using both frequency and spatial information should gain dominance because of their superior performance. In the future, we want to extend this work to using other kinds of frequency domain transformations e.g. Discrete Cosine Transform and solve other kinds of image restoration tasks e.g. image denoising.

Acknowledgments
This paper is partially financially supported by JSPS KAKENHI Grant (JP19K22863).