Manipulation and generation of synthetic satellite images using deep learning models

Abstract. Generation and manipulation of digital images based on deep learning (DL) are receiving increasing attention for both benign and malevolent uses. As the importance of satellite imagery is increasing, DL has started being used also for the generation of synthetic satellite images. However, the direct use of techniques developed for computer vision applications is not possible, due to the different nature of satellite images. The goal of our work is to describe a number of methods to generate manipulated and synthetic satellite images. To be specific, we focus on two different types of manipulations: full image modification and local splicing. In the former case, we rely on generative adversarial networks commonly used for style transfer applications, adapting them to implement two different kinds of transfer: (i) land cover transfer, aiming at modifying the image content from vegetation to barren and vice versa and (ii) season transfer, aiming at modifying the image content from winter to summer and vice versa. With regard to local splicing, we present two different architectures. The first one uses image generative pretrained transformer and is trained on pixel sequences in order to predict pixels in semantically consistent regions identified using watershed segmentation. The second technique uses a vision transformer operating on image patches rather than on a pixel by pixel basis. We use the trained vision transformer to generate synthetic image segments and splice them into a selected region of the to-be-manipulated image. All the proposed methods generate highly realistic, synthetic, and satellite images. Among the possible applications of the proposed techniques, we mention the generation of proper datasets for the evaluation and training of tools for the analysis of satellite images.


Introduction
Manipulation and generation of synthetic images by means of deep learning (DL) architectures are receiving increasing attention due to the demand of large labeled datasets for artificial intelligence (AI) applications. 1,2The usage of synthetically generated images for the entertainment industry and even for malevolent disinformation campaigns is also growing.Moreover, satellite images are receiving increasing attention in several application areas including meteorological forecasts and the monitoring and detection of natural disasters.As a result, the number of commercial satellites is constantly growing and the accessibility of imagery is increasing on a daily basis.][5] The malevolent uses of synthetic and manipulated satellite images are also possible.The development of tools for the detection and localization of manipulated satellite images [6][7][8][9][10][11] also requires the availability of adequate training and test datasets.The goal of this paper is to describe a number of methods to generate globally and locally manipulated satellite images.
While deep neural networks have been successfully applied to the generation and tampering of natural images and multimedia content, their use to generate synthetic satellite images has received limited attention.In addition, the tools developed for media applications cannot be used directly to generate synthetic satellite images, due to the different nature of such images in terms of semantic content, number of bands, bit, and spatial resolution.In this paper, we present a number of DL architectures for the generation of synthetic and manipulated satellite images, focusing on two different kinds of manipulations: full image modification and local splicing.We demonstrate the validity of the proposed methods using Sentinel-2 images. 12enerative adversarial networks (GANs) are popular DL architectures widely used for both synthetic image generation 13 and image style transfer. 14In this paper, we use them for the global manipulation of satellite images.In particular, we use them for transferring the style of satellite images in such a way to change their overall content and semantic meaning.In contrast, we do not report any efforts to generate satellite images from scratch (as done, for example, in Ref. 15).To be specific, we apply two different kinds of image style transfer.The first one, referred to as land cover transfer, aims at changing the land cover of the manipulated images from vegetation to barren and vice versa.To do so, we rely on a properly trained version of the cycleGAN architecture. 16The goal of the second global transformation is season transfer, whereby summer satellite images are transformed into winter ones and vice versa.For such transformation, we use the pix2pix GAN architecture. 14The final goal of these transformations is to obtain fully synthetic images that can be used to construct large labeled datasets.
With regard to local tampering, we present two types of transformer-based image generation techniques. 17In particular, we generate the manipulated images by inserting synthetic splices into irregular regions of the target image, making sure that the spliced boundaries are not visible.Transformers were initially used for natural language processing applications. 18However, in the last few years, they were also used to process still images. 17,19The first architecture we present is the image generative pretrained transformer (iGPT), 19 which is an autoregressive network trained to predict pixels without observing the entire region of the 2D input image.The iGPT is trained to generate the missing part of a satellite image from its neighboring pixels.In this way, parts of the image can be removed without introducing noticeable artifacts.The second method is based on a vision transformer, originally developed for image classification. 17We use the vision transformer to create synthetic regions (splices) of an image by modifying the output of the last layer of the transformer.The final goal of the local image manipulation architectures is the creation of a large labeled dataset of images with synthetically generated splices.
The remainder of this paper is organized as follows.In Sec. 2, we provide a brief review of the state-of-the art of satellite images generation and manipulation using DL networks.In Sec. 3, we list the datasets used throughout our work.Then in Sec. 4, we describe the architectures for the generation of synthetic manipulated images.In Sec. 5, we provide some examples of the images produced by the proposed architectures.Finally, in Sec.6, we summarize the results of our work and make some final remarks.

Relevant Work
1][22] Lately, they have also been used to generate synthetic images with nonfacial content. 23Even more recently, these techniques have started being used to generate synthetic remote sensing images.In Ref. 15, a progressive GAN 20 was trained on all 180k samples of SEN12MS dataset 24 to generate 13-band s images that imitate Sentinel-2 level-1C products. 25ANs have also been used to generate different types of satellite images that are correlated with the input images.For example, Fuentes Reyes et al. 3 used a GAN to generate optical images starting from synthetic aperture radar (SAR) images.They did so by training a cycleGAN architecture with 512 × 512 patches of SAR images as input (domain A) and optical images as a reference (domain B).They constrained the output of the network to be a gray-scale image similar to SAR images.0][31][32] A pix2pix architecture was used in Ref. 5 to generate satellite images starting from 256 × 256 historical maps and RGB optical satellite reference images.
Despite the increasing interest in the use of GANs for satellite images, only a few works have used GANs to change the semantic content of existing images, which is the goal of this paper.To the best of our knowledge, the only papers dealing with this issue are Refs.15, 33, and 34.Abady et al. 15 proposed an image-to-image translation approach to change the land-cover of a satellite image.Specifically, a NICE GAN 35 is applied to achieve land cover transfer on four-band [RGB and near-infrared (NIR) bands] images of 480 × 480 resolution.The land cover transfer regards the transformation of vegetation terrains to barren and vice versa.A similar task is addressed in Ref. 33 where the authors exploit a CycleGAN architecture to translate 10 bands of a Sentinel-2 level-1C image, namely the 10-and 20-m bands, from barren to vegetation and vice versa.Finally, in Ref. 34, a method is presented for the creation of synthetic images having the urban structure of a given city (i.e., Tacoma in Washington, USA) but with the landscape features of another city (i.e., Seattle in Washington, USA and Beijing, China).A cycleGAN architecture is used to transfer the style of cartoDB basemap to satellite images and vice versa.The cycleGAN model is trained on a specific city B to generate simulated satellite images from the basemap of another city A.
The techniques for global manipulations proposed in this paper focus on land-cover and season transfer.With regard to land cover transfer, the method we propose is a highly improved version of the technique described in Ref. 15.The new approach is based on CycleGAN instead of NICE GAN and produces transferred images with very good quality in both directions (vegetation to barren and vice versa).In Ref. 15, instead, the transfer was successful only in one direction (vegetation to barren), whereas in the other direction (barren to vegetation,) the quality of the transferred images was poor. 15As to season transfer, this is a new kind of manipulation that has never been addressed before.In particular, we propose an architecture whose application transforms a satellite image taken in the winter (summer) into its summer (winter) counterpart.
With regard to local splicing, to the best of our knowledge, no work has been proposed in the literature performing local tampering of remote sensing images using generative models.Therefore, our paper represents a first attempt in this direction.

Datasets
In this section, we describe the datasets we have used in our experiments to demonstrate the validity of the techniques we have developed.The datasets have been used to train the architectures proposed in this paper and to assess their performance.All the datasets are obtained starting from Sentinel-2 products; 12 however, our methods can also be used on imagery from other satellites, we have chosen Sentinel-2 images because of their availability for research goals.
For global manipulations, we have used Sentinel-2 level-1C images, whereas for local manipulations, we have used both Sentinel-2 level-1C images and Sentinel-2 level-2A images.Sentinel-2 level-1C images consist of 13 bands, with band 2 representing the green channel, band 3 the blue channel, and band 4 the red channel.Band 8 is one of the NIR channels.Bands 2 (green), 3 (blue), 4 (red), and 8 (NIR) have a spatial resolution of 10 m, with size 10;980 × 10;980.Other six bands (bands 5, 6, 7, 8a, 11, and 12) have a spatial resolution of 20 m, for a size of 5490 × 5490 pixels.Finally, bands 1, 9, and 10 have a spatial resolution of 60 m, with size equal to 1830 × 1830.Sentinel-2 level-2A is the bottom-of-atmosphere product obtained by applying atmospheric correction to top-of-atmosphere level-1C products.Its bands are similar to those of the level-1C products, except for band 10, which is not present.All the Sentinel-2 level-1C images were downloaded directly from the ESA Copernicus hub, 36 whereas Sentinel-2 level-2A images were taken from the dataset SEN12MS. 24In Table 1, we summarize the datasets that we have used in our research.

Alps Dataset
We designed the Alps dataset to train and test the architecture for the season transfer manipulation.The alpine area, in fact, is characterized by marked differences between winter and summer, with winter images largely covered by snow, and summer images containing large areas of green vegetation.For this dataset, we only selected the RGB and NIR bands, with ground sampling distance (GSD) equal to 10 m, for a total size of 10;980 × 10;980 resolution and 16 bits per pixel.We collected images representing exactly the same area taken at two different months, each month representing a different season.In this way, in addition to using the images for training the season transfer architecture, we also have a way to compare the results of the season transfer with ground-truth images.In particular, we selected images taken in June 2019 for the summer dataset and in December 2019 for the winter dataset.
The description of the procedure we followed to build the dataset is described in the following.To start, we selected only images with limited cloud coverage.Since it was not possible to get only images with 0% cloud coverage, we limited the search to images with a cloud coverage lower than 9%.We extracted the bands of interest (RGB and NIR) from the downloaded products as jp2000 images.We used the gdal software library 37 for reading and writing raster and vector geospatial data formats.Specifically, we used the gdal retile command to tile the downloaded images from their initial size into several 512 × 512 images.Then we removed the edge tiles.Finally, we paired the images of areas existing in both the winter and summer domains.As a result, we built a dataset of 3936 pairs of images with 512 × 512 resolution.In Fig. 1, we show the RGB version of some sample images of the Alps dataset.

Scandinavian Dataset
To validate the effectiveness of the season transfer architecture on different kinds of landscapes, we built another dataset with marked differences between winter and summer.The dataset includes images from Scandinavia and was built similarly to the Alps dataset.In this case, the selected date range includes June 2020 for the summer and February 2020 for the winter.For both domains, we selected only images with 0% cloud coverage.The final dataset consists of 9000 paired images of size 512 × 512. Figure 1 shows some RGB examples of the images from the Scandinavian dataset.

Land Cover Datasets
The second kind of global manipulation we have considered aims at transforming barren covers into vegetation areas and vice versa.For this reason, we created the land cover dataset by selecting images, in which one of the two types of covers is predominant with respect to the other.To do so, we first selected the areas of interest based on the statistics provided by the organization for economic co-operation and development. 38With regard to the areas with a high percentage of vegetation, we extracted images from Congo, Salvador, Montenegro, Gabon, and Guyana.The cloud coverage was set to 0%, and the range of dates spanned from June 2019 till December 2019.For the barren domain, we selected the areas of interest in South and Central America.For both domains, we used a linear discriminant analysis classifier 39 to make sure that after cropping the images to 512 × 512 patches, they contain, respectively, a great percentage of vegetation and barren soil.For each domain, we collected 10,000 images.For the vegetation domain, the average percentage of vegetation pixels in an image is 98% with a maximum of 100% and a minimum of 60%.For the barren domain, the average percentage of barren pixels in an image is 82% with a maximum of 100% and a minimum of 63%.In Fig. 1, we show some RGB examples for the two different domains.

SEN12MS
The SEN12MS 24 dataset was created to provide a large-scale satellite dataset for developing DL-based methods.As opposed to the previous datasets, SEN12MS has a larger variety of images regarding spatial coverage, diversity, and number of available samples.The dataset contains 180,662 triplets of multispectral patches, dual-pol SAR image patches, and MODIS land cover maps collected from Sentinel-1 and Sentinel-2 satellite imagery.The images span all seasons with an approximately equal number of images captured during winter, spring, summer, and fall.The images' locations vary with respect to sea level elevation, climate, latitude, and urbanization level.Each patch has a resolution of 256 × 256 pixels and we only used the RGB channels of the multispectral patches.In our experiments, we used images from Africa, Europe, Asia, Australia, and South America, for a total of ∼120; 000 images. Figure 1 shows some RGB images from SEN12MS.

World Dataset
Eventually, we collected images from various regions in the world by downloading them from the Copernicus Hub to construct the world dataset.The images were captured in 2018 and have an equal number of images across seasons.We only used the RGB bands, which are sampled at 10-m GSD with image size equal to 10;980 × 10;980.The images can contain clouds up to 2% of the total number of pixels.From these images, we extracted nonoverlapping patches of size 512 × 512.The world dataset contains 285,768 images.We used the world dataset images to generate images containing spliced objects as described in the subsequent sections.
Figure 2 shows four examples of world dataset images.

DNN Models
In this section, we describe the architectures we used to create the synthetic images.We start with the architectures for global manipulations followed by those targeting local manipulations.

Season Transfer
As we said, with regard to global manipulation, we considered two different objectives.The first objective was to transfer images taken in the winter to their summer counterpart and vice versa.Paired images, with synchronized input and ground truth images, are not difficult to get in this case since the same location is usually available for download in both seasons.For this reason, and since using paired images facilitates training, we used the pix2pix architecture. 14Pix2pix is a variant of traditional GANs, where instead of generating an image from noise, a conditional GAN (cGAN) 40   which is from winter to summer or from summer to winter, so to be able to apply bidirectional transfers, we had to train two models.

Architecture
The generator part of the pix2pix season transfer architecture consists of 8 U-Net blocks with skipped connections, 41 and the input size of the first layer is 512 × 512 × 4. Each block is made up of two convolutional layers, two batch normalization layers, a leaky ReLU activation function layer with drop out 0.2, and an ReLU activation function layer with drop out 0.2.As for the discriminator, we used seven convolutional layers, each followed by batch normalization and leaky ReLU activation.The input of the discriminator is the input image x concatenated either with the ground truth image y or the generated image x 0 .Both the generator and the discriminator have been trained using Adam optimization. 42he loss function used to train the pix2pix architecture is made up of two partial losses.The first one is the adversarial binary cross entropy loss L adv : E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 5 5 5 where D represents the discriminator network, G represents the generator network, x is the to-betranslated image, and y is the ground truth image of x in the target season.The second partial loss aims at minimizing the L 1 distance between the generated image (GðxÞ) and its ground truth counterpart (y): ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 4 7 5 The objective of the generator is to minimize a weighted combination of L adv and L 1 , that is: where λ should be set to adjust the relative weights of the partial losses.Finally, the discriminator's objective is to distinguish real and synthetic images, that is to maximize L adv .

Training
We trained the pix2pix architecture on the Alps and the Scandinavian datasets.Both datasets are characterized by extensive snow coverage in the winter and large green vegetation areas in the summer.The main difference between the datasets is that Alps images are mostly mountainous, whereas the majority of the Scandinavian dataset consists of meadows.For the Scandinavian dataset, 6000 images were used for training, 2000 for testing, and 1000 for validation.Although for the Alps dataset, 2800 images were used for training, 787 images for testing, and 349 images for validation.In total, we trained four models, corresponding to two different transfer directions for each dataset.Training a model took about 4.5 days on one NVIDIA GeForce RTX 2070 with Max-Q Design.To create the season transferred images, the pix2pix model was, separately, trained on the Alps and Scandinavian datasets.The optimization parameters selected for the Adam optimizer were β 1 ¼ 0.5 and β 2 ¼ 0.999.We set the learning rate to 0.0001.The number of selected filters is 64, and the slope of the leaky ReLU was set to 0.2.Each model was trained for 200 epochs with a batch size equal to 1.The weight for the L1 loss function λ was set to 100.

Land Cover Transfer
The second global manipulation we have considered is land cover transfer, whereby barren images are transferred to vegetation images and vice versa.A noticeable difference with respect to season transfer is that, in this case, we have no ground truth images since in most cases the barren (res.vegetation) version of a vegetation (res.barren) image does not exist.This requires that a different architecture and training strategy be used.In particular, we opted for the CycleGAN architecture, 16 which, unlike pix2pix, does not require the availability of paired images for training.The CycleGAN architecture consists of two generators and two discriminators.Let us assume that the goal of the CycleGAN is to transfer images from a domain A to a domain B and vice versa.Each generator translates the images in one direction.Specifically, the first generator transfers images from domain A to domain B, whereas the second generator transforms the images in the opposite direction.In this way, each generator can act as an additional constraint for the other.The basic idea behind CycleGAN is to enforce a cyclic consistency in such a way that when the output of the first generator is used as an input for the second, the image produced by the second generator should be as close as possible to the original input image (thus avoiding the need for paired images belonging to the two domains).Note that, unlike with pix2pix, it is not necessary to carry out two separate training for the two directions of the transfer since the two generators that are part of cycleGAN architecture provide the models for the two directions.

Architecture
The exact architecture we have used to implement the land cover cycleGAN is described in the following.The generators are implemented by means of a residual network 43 with six residual blocks and skip connections.The input size we used is 512 × 512 × 4. Each residual block has a convolutional layer, a batch normalization layer, and a leaky ReLU activation function layer.Regarding the discriminator, we used seven convolutional layers, each followed by batch normalization and a leaking ReLU activation.For both networks, Adam's optimizer was used.Figure 4 shows the different losses of a cycleGAN architecture.Training is obtained by finding a good trade-off between three partial losses.The first partial loss is the typical adversarial GAN cross entropy loss (L adv ) defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 4 2 3 where G v2b is the generator that translates the images from vegetation to barren, G b2v is the generator that transfers the images from barren to vegetation, D b and D v are the discriminator networks that classify images as real or fake for the barren and vegetation domains, respectively, and b and v are generic barren and vegetation input images.The second loss is the cyclic consistency loss defined by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 3 3 1 whose goal is to make sure that vegetation (res.barren) images that are translated to barren (res.vegetation) and then back to vegetation (res.barren) are as close as possible to the original input.
Finally, an extra constraint is added to ensure that when a generator is fed with an image belonging to the output domain, it leaves the image as is since no transformation is actually necessary.Such a goal is achieved by defining a third loss, namely the identity loss, as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 2 2 7 The goal of the generators is to minimize an overall loss combining the three partial losses described above: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 1 7 0 min where α 1 , α 2 , and α 3 are the weights of the adversarial, cycle, and identity losses, respectively.In contrast, the discriminators aim at distinguishing between real and synthetic images, each in its domain, a goal that is achieved by solving the following optimization problem: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 9 7 max

Training
The dataset we used to train the cycleGAN architecture described in the previous section is the land cover dataset with 8000 images from each domain used for training and 2000 images were kept for testing.Training the model took about 10 days on one NVIDIA GeForce RTX 2070 with Max-Q design GPU.For this task, the cycleGAN model was trained for 200 epochs with an input size of 512 × 512 × 4. For each network of the model, the Adam optimizer was used with β 1 set to 0.5, β 2 to 0.999 and the learning rate set to 0.0001.The number of filters used is 32 and the slope for the leaky ReLU was set to 0.2.The batch size was constrained to 1.The GAN adversarial loss weight α 1 was set to 1 and the cyclic consistency weight α 2 was set to 5, whereas the identity loss weight α 3 was set to 3.

Splicing with iGPT
The next manipulation we considered is local splicing.We used iGPT to generate synthetic image splices that then we inserted into images from the world dataset.The iGPT was trained on the SEN12MS dataset.The goal in this case was to remove some parts of an image from the world dataset and replace them with content generated by iGPT, by paying attention to enforce the consistency of the spliced patch with the surrounding of the removed part and the rest of the image.The overall splicing process is depicted in Fig. 5.
IGPT 19 is a transformer-based unsupervised image classification and image generation model.Early transformer-based models, such as BERT, 44 RoBERTa, 45 and T5, 46 could be used directly with 1D sequences in any form, but were not easily extendable to 2D data, such as images.A model that is able to work with 2D data, namely the GPT-2 47 model, has been introduced recently (a GPT-2 model trained on images is known as iGPT).To identify the spliced areas, we apply watershed 48 unsupervised segmentation to the images of the world dataset and randomly select several of the largest segments as the regions where the splices generated by iGPT had to be inserted.Then we use iGPT to generate the to-be-inserted splices based on the mask given by the watershed segmentation.

Architecture
IGPT 19 is very similar to the GPT-2. 47An important difference between the two architectures is the activation function.Specifically, a quick Gaussian error linear unit (GELU) is used in iGPT, instead of the GELU used in GPT-2.GELU combines some useful features of the most commonly used activation functions.It randomly multiplies its input by one and randomly sets some of the activations to zero.It can be approximated by the following equation: ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 3 2 3 Quick GELU is similar to GELU but with a lower computationally complexity, being defined as IGPT also differs from GPT 2 in the number of normalization layers, which is much lower for iGPT, thus decreasing significantly the number of operations.The reader may refer to Ref. 19 for more details.The input of GPT-2 is a sequence of pixels U ¼ x 1 ; : : : ; x n .Although its objective is to maximize the following likelihood: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 1 1 6 ; 1 9 0 where P is the conditional probability modeled by a neural network with parameters Θ, and k is the size of the context window.The transformer inside the iGPT creates a model of the probability density function of the current pixel x i , given the previously observed pixels x 1 ; : : : ; x i−1 as shown in the following equation: where in general, π i indicates a permutation of the pixel sequence.In our case, we simply let π i be the identity permutation, that is π i ¼ i. 19 θ contains all the other parameters of the neural network used during training.The 2D image is transformed into a 1D sequence by lexicographical ordering.
The input of the decoder part of the transformer is a sequence of discrete pixels x 1 ; : : : ; x i−1 , and the output is a d-dimensional vector as shown in Fig. 6.The decoder is implemented by a stack of L blocks, in which the l'th block produces an intermediate embedded vector h l 1 ; : : : ; h l d .IGPT uses the same formulation as GPT-2 for the transformer decoder block, in which we input h l in the order seen in Eqs. ( 13)-( 15) to obtain h lþ1 : E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 1 1 6 ; 2 6 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 1 1 6 ; 2 2 4 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 1 1 6 ; 2 0 1 The final layer of the transformer decoder is followed by a normalization layer and projection logits (real numbers that with unlimited range) as a parameter for the conditional probability distributions of each sequence element.In the final step, the output vector is reshaped into a 2D image.In our implementation, we used eight layers to construct the iGPT with two heads, we also set the embedded vector dimension to 16, as suggested in the original iGPT paper. 19

Training
IGPT model was trained for 20,000 epochs using the SEN12MS dataset.We had to train the model for a large number of epochs to generate realistic spliced objects.The validation loss was decreasing throughout the entire training process.The size of the patches extracted from the SEN12MS dataset was 28 × 28 × 3.At each iteration, we extracted one random patch for each image.The model was trained using the negative log likelihood loss function: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 1 1 6 ; 7 1 1 Lðx; yÞ ¼ where x n is the predicted pixel value and y n is the target pixel value.The target pixel values are the pixel values in the SEN12MS dataset.
For training, we used the Adam optimizer with β 1 ¼ 0.5; β 2 ¼ 0.999 and learning rate equal to 0.003.β 1 and β 2 are the initial decay rates used to calculate the first and second moments of the gradient.The batch size was set to 64.IGPT was trained on the SEN12MS dataset described in Sec.3.4.Training went on for 4 months on one GPU.6747 spliced images were generated in ∼3 weeks.

Splicing with Vision Transformer
The vision transformer 17 was originally developed for image classification by replacing the convolution layers with a transformer model. 49Specifically, transformers tend to have inductive bias when trained on large-scale datasets.The vision transformer aims at resolving this issue and provides good results for image classification by scaling the dataset to a smaller size and reducing the amount of training data.In our work, we modified the vision transformer so to use it for synthetic splice generation.In particular, we edited the last layers of the vision transformer, so to generate an image instead of a classification score.The modified layer has a size of 3 × 256 × 256.
We use the modified vision transformer to generate synthetic image splices to be inserted into images from the world dataset.The modified vision transformer was trained on the SEN12MS dataset.The goal here is to remove some regions of an image from the world dataset and replace them with content generated by the modified vision transformer.The content should be consistent with the surrounding of the removed part and the rest of the image.Similar to iGPT, we use watershed 48 segmentation on images in the world dataset and randomly selected several of the largest segments to insert the splices.The whole procedure is shown in Fig. 7.

Architecture
The vision transformer inputs are image patches as shown in Fig. 8.In our technique, the original image size is 128 × 128, whereas the size of the patches inside the transformer is 64 × 64.Then the dimensionality of the input image patches is reduced using a linear projection: T i ¼ W Îi , where W ∈ R D×N is a linear mapping function learned during training, Îi ∈ R N is the flattened Fig. 7 Spliced images created using vision transformer.
i'th image patch, and T i ∈ R D is known as the i'th image token.Therefore, an image token is a linear projection of the input patches.
The vision transformer incorporates a traditional transformer used on one-dimensional sequences.The transformers use self-attention modules to incorporate long range information, which can contain information about all the inputs, 49 let us call these inputs "position-aware tokens."The position-aware-tokens indicate where the image patch is located in the input image.The self-attention modules are invariant to the order of the position-aware tokens, so the order with which we input the list of "position-aware tokens" into the transformer is irrelevant.We create the position-aware tokens by concatenating the image token (input image patch projection) and the positional embeddings.The positional embeddings incorporate positional information for each different input image token. 17We created these positional embeddings by numbering the order of the patches with 64 × 64 size.These positional embeddings, P i ∈ R D for i ∈ f0;1; : : : g, are used to add positional information about the input patches to the transformer.We input these position-aware tokens into the transformer to produce a "transformer output."We created the output image by reshaping the "transformer output" to the shape of the original image.

Training
We trained the vision transformer with images from the SEN12MS dataset, segmented using the watershed algorithm as shown in Figs.7 and 8.The training loss is defined as the difference between the reconstructed and the original SEN12MS images.The vision transformer was trained for 4000 epochs on 128 × 128 × 3 patches extracted from the 124,511 images of the SEN12MS dataset.For training, we used the Adam optimizer with β 1 ¼ 0.5; β 2 ¼ 0.999, and learning rate equal to 3 × 10 −6 .The batch size was set to 1.The vision transformer incorporated a Linformer. 50The Linformer reduces the memory used in the transformer self-attention module, by reducing the space complexity.The Linformer we used has an internal dimension of 2048, its sequence length is 65 with depth 12.The Linformer uses 1024 heads and introduces a low-rank matrix to approximate the self-attention part in the transformer.In this way, the space and time complexity of the model is reduced to OðnÞ.More detailed information about this architecture can be found in the original Linformer publication. 50We used the trained vision transformer to generate 285,768 spliced world images.The model was trained for 2 months on one GPU, and the spliced images were generated in ∼2 weeks.

Results
In the following, we show some examples of global and local image manipulations generated with the models described in the previous sections.For global manipulations, we used γ-correction with γ set to 2 on the RGB bands of the images to visualize them properly since otherwise they would be too dark for human inspection.In addition to γ correction, for the season transfer task, we applied image stretching on the R, G, and B bands of each image.The original images displayed in this section (without corrections) can be accessed in Ref. 51.The quality of the season transferred images can be judged by comparing them with the available ground truth.(Although comparing the transformed images with the ground truth is a reasonable way to judge the quality of the transfer, it is worth reminding that no unique ground truth exists for the season transfer, given that the same region may assume different aspects in different days of the same season, or across different years.)For the land cover transformation, we judge the quality of the produced images by means of an objective spectral measure such as the normalized difference vegetation index (NDVI) 52 and classifying the image pixels by means of a general purpose classifier.

Season Transfer
In Fig. 9, we show an example of season transfer for the Alps dataset.Real winter and real summer images are shown in columns (a) and (c), respectively, whereas the synthetic generated images are displayed in columns (b) and (d).We report the color image obtained by putting together the RGB bands (first row) and the single-image bands (rows 2 to 5).As it can be seen, the synthetic images produced by the pix2pix model approximate very well the real images, and in any case, they provide a realistic view of the framed region in the target season.A similar example for the Scandinavian dataset is shown in Fig. 10.Even in this case, the synthetic images are very close to the real ones.

Land Cover Transfer
In Fig. 11, we show an example of a land cover transfer from barren to vegetation and vice versa.We notice that the generated vegetation bands are darker than the input barren images and consistent with the real vegetation.In the same way, the generated barren image is lighter than the vegetation input as it should be for a real barren terrain.To give an objective measure of the effectiveness of the land cover transfer (in the absence of ground truth images), we utilized the NDVI index that is usually adopted to estimate the vegetation content of satellite multispectral images and defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 7 ; 1 1 6 ; 1 8 6 where nir is the near-infrared band (band 4 in our case) and red is the RED band (band 3).We then used the NDVI index to classify each pixel into one of four classes. 53Specifically, pixels for which the NDVI is lower than −0.1 are classified as water pixels, as barren when NDVI ∈ ½−0.1; 0.1, low vegetation when NDVI ∈ ½0.1; 0.4, and high vegetation when NDVI is larger than 0.4.In Table 2, we report the result of the pixel classification into the above 4 classes for 2000 real vegetation images, 2000 real barren images, 2000 synthetic barren images (GAN barren), and 2000 synthetic vegetation images (GAN vegetation).We confirm that   the majority of the pixels of real vegetation images to the high vegetation class, whereas the majority of the pixels of real barren images to the barren class.For the synthetic vegetation images, most pixels are classified as high vegetation, even if no pixels of the input images belong to such a class.As for the synthetic barren images, most pixels were classified as barren and a few as low vegetation, even if the most input pixels belonged to the high vegetation class.In addition, we compared the results we got with those obtained by applying the NICE GAN model presented in Ref. 15.To do so, we applied the NICE GAN model to the same pristine images of each class and then we computed the percentage of pixels classified into the four terrain classes similarly to what we did for the GAN images generated by our model.The results show that in the case of GAN vegetation, our model achieves a stronger transfer with only 1.1% of the pixels classified as barren, whereas with 15 8.3% of the pixels remained in the barren class.As for the GAN barren, our model also shows a stronger transfer capability with more than 97% of the pixels classified as barren, whereas by applying the model described in Ref. 15, a larger number of pixels are still classified as vegetation.

IGPT
In Fig. 12, we show some images containing splices generated by the iGPT model.Figure 12(a) shows the original images, (b) the replacement masks (spliced region), and (c) contains the manipulated images.We can note that the model can generate splices, which blend well into the images.The spliced images contain areas with different climates, vegetations, and urbanization levels.For example, the top urban region in Fig. 13(a) has been synthetically generated, whereas the bottom green vegetation part belongs to the original image.In the same figure, the bottom vegetation part in Fig. 13(b) corresponds to a synthetic region, whereas the surrounding barren pixels are part of the original image.

Vision Transformer
In Fig. 14, we show some examples of the spliced images generated by the vision transformer.We noticed several differences in the splices generated by iGPT and vision transformer.iGPT tends to generated more diverse objects.For example, in Fig. 13(a), the original image was mainly vegetation and rural areas while the generated spliced region is an urban area.IGPT was able to generate splices with a pixel level detail for varying spliced regions, however, it took a long time (5 to 30 min) to generate these splices.The vision transformer tends to generate splices very similar to their surroundings.For example, in Fig. 15(a), the original image was a rural housing area, whereas the generated splices were vegetation, which is consistent with the rest of the image that is mainly occupied by vegetation.The vision transformer sometime had some difficulties to generate very detailed spliced regions, however, creating a spliced image took several seconds in contrast to the iGPT long generation time.

Conclusion
DL generative techniques are able to generate realistic images that can even deceive human inspection.Although, so far, most attention has been given to the generation of face images, we expect that the generation of synthetic satellite images will gain more and more interest in the future.However, conventional techniques used to generate face images cannot be directly applied to create synthetic satellite images, due to the particular nature of satellite multispectral imagery.In this paper, we presented a number of DL architectures aiming at generating labeled synthetically manipulated satellite images.We focused on two kinds of manipulations: full image modification and local splicing.With regard to full image modification, we adapted two GANs commonly used for style transfer applications, to implement two different kinds of transfer: (i) land cover and (ii) season transfer.As to local manipulations, we presented two architectures for local splicing.All the proposed methods can generate highly realistic images, opening the way for several uses across different application scenarios.As future work, we plan to examine whether the synthetic images generated by our architectures can be distinguished from real images by means of specific image forensic detectors.

Fig. 2
Fig. 2 Example of images from the world dataset.

Fig. 8
Fig. 8 Block diagram of the vision transformer.

Fig. 12
Fig. 12 Example images generated by the iGPT architecture: (a) the original world image, (b) the spliced regions, and (c) the manipulated images.

Figure 14 (
a) contains the original images, (b) contains the replacement masks (spliced region), and (c) contains the manipulated images.We generated the spliced regions as in Sec.5.3.As it can be seen, the vision transformer generates realistic spliced regions.The spliced images contain areas with a different climate including desert, Mediterranean, continental, tundra, and different levels of vegetation.In addition, the spliced regions blend well into the surrounding areas.For example, the bottom forest area in Fig.15(a) has been synthetically generated, whereas all the other green vegetation parts belong to the original image.In the same figure, the top desert part in Fig.15(b) corresponds to a synthetic region, whereas the other surrounding desert pixels are original.

Fig. 14
Fig. 14 Examples of images generated by the vision transformer architecture: (a) the original world image, (b) the spliced regions, and (c) the manipulated images.

Table 1
Datasets used for our experiments.

Table 2
Percentage of pixels classified into four terrain classes based on NDVI.