Self-supervised embedding for generalized zero-shot learning in remote sensing scene classification

Abstract. Generalized zero-shot learning (GZSL) is the most popular approach for developing ZSL, which involves both seen and unseen classes in the classification process. Many of the existing GZSL approaches for scene classification in remote sensing images use word embeddings that do not effectively describe unseen categories. We explore word embedding to describe the classes of remote sensing scenes to improve the classification accuracy of unseen categories. The proposed method uses a data2vec embedding based on self-supervised learning to obtain a continuous and contextualized latent representation. This representation leverages two advantages of the standard transformer architecture. First, targets are not predefined as visual tokens. Second, latent representations preserve contextual information. We conducted experiments on three benchmark scene classification datasets of remote sensing images. The proposed approach demonstrates its efficacy over the existing GZSL approaches.


Introduction
The advancements in remote sensing platforms focus on the acquisition of high-resolution imagery and provide challenges in understanding the abundant volumes semantically.Scene classification is a preliminary task that is helpful in analyzing such volumes of remote sensing images at the coarser level.It aims to assign a label to a given scene from a set of predefined categories based on its content.Most of the works [1][2][3] address the problem of scene classification using supervised learning by exploring convolutional neural networks (CNNs).With the large coverage of remote sensing satellite images, the tedious annotation process for all categories of scenes becomes not possible practically.With its innate capability in accommodating the new unseen/undiscovered classes, zero-shot learning (ZSL) would benefit many existing categorization applications [4][5][6] of remote sensing imagery.
ZSL 7 is one such task that helps in understanding the scenes only with the description of the classes without involving any sample of new class during training.Hence, in ZSL, training and testing sets are disjointed.ZSL could be accomplished by sharing the semantic information from seen to unseen class samples.Here, semantic information is a high-level description of the classes; how it can be obtained and transferred to unseen classes is discussed in Sec. 2. ZSL approaches can be divided into two categories: conventional ZSL (CZSL) and generalized ZSL (GZSL).The objective of CZSL is to predict only unseen classes, whereas GZSL predicts both seen and unseen classes of samples. 8These techniques are illustrated in Fig. 1.GZSL is more challenging than CZSL as many unseen classes are prone to being misclassified into one of the seen classes at the testing phase.
Mainly, the GZSL-based approaches focus on realistic images and solve the issues of mapping from visual features to semantic embeddings 9,10 and seen-unseen bias. 11,12evertheless, numerous CZSL techniques have been explored in classifying remote sensing (RS) images, whereas GZSL is hardly explored in remote sensing images.To the best of our knowledge, we are the first to explore GZSL for scene classification tasks in remote sensing images.GZSL aims to categorize the RS samples for both seen and unseen classes by establishing a mapping relation between the feature and semantic spaces.With overhead imaging, the semantics of an image may ignore other modalities, such as the elevation of objects from the ground, which are described by digital elevation models. 13To incorporate information from other modalities, word embeddings become an alternative in describing other modalities rather than utilizing them explicitly in deriving the semantics.Generally, both seen and unseen classes are represented as semantic vectors in terms of word/sentence embeddings 14 and attribute vectors 14 in the embedding space.For feature extraction, we use the models (e.g., AlexNet, 15 VGG, 16 GoogLeNet, 17 and ResNet 18 ) pre-trained on ImageNet, 19 which ignore the cross-dataset bias 20 between the ImageNet dataset and remote sensing benchmark datasets.Often, the crossdataset bias results in low-standard visual features for GZSL in remote sensing scene classification (RSSC), which would affect the classification accuracy on both seen and novel scene classes.
In general, the methods used to extract visual features of remote sensing images are poor due to cross-dataset bias 20 as they rely on ImageNet pre-trained models. 19Also, feature vectors obtained from word2vec representation are limited by fixed representation irrespective of the context.Hence, they are not effective in achieving appropriate semantics.By alleviating these issues, extracted visual and semantic features can be enhanced to improve GZSL-RSSC classification performance.Thus, we propose a method called "GZSL for RSSC using data2vec representations," termed GZSL-RSD2V.The proposed method GZSL-RSD2V uses a feature enhancement (FE) 21 module to obtain discriminative features, which can effectively enhance the visual features.Also, the proposed method uses a data2vec model based on a standard transformer architecture to obtain continuous and contextualized hidden features.The representation of the data2vec model has two benefits: (i) targets are not fixed as visual tokens, and (ii) latent representations preserve contextual information.We conducted experiments using GZSL-RSD2V for scene classification in remote-sensing images.The main contributions of this paper are summarized as follows.
• We propose feature -variational autoencoder generative adversial networks (f-VAEGAN) to learn a mapping of semantics to the visual domain for visual feature generation.• A practical embedding approach based on a standard transformer architecture is developed to represent semantic features of remote sensing images.• We introduce an FE module to refine both seen and unseen class visual features.
• Our representation demonstrates the compactness of within-class similarity and separability of inter-class variations.
The remainder of this paper is organized as follows.Section 2 presents various methods for scene classification using ZSL and embedding approaches for encoding semantic information.In Sec. 3, we explain the proposed GZSL-RSD2V.The experimental results and the analysis of the proposed approach over the existing GZSL approaches are discussed in Sec. 4. The conclusion of this paper is provided in Sec. 5.

Related Work
This section presents the existing ZSL approaches and some important methods explored for encoding semantic information.

Zero-Shot Learning
ZSL-based scene classification in remote sensing images is divided into two categories, as explained in the following subsections.

Embedding-based methods
These methods aim to map seen class samples and their class semantic vectors into embedding space, and then a nearest neighbour search in the embedding space is used to classify unseen class samples with their class semantic vectors.In the domain of remote sensing images, a method based on label propagation is proposed for ZSL. 22The label propagation mechanism helps construct a semantic-directed graph to share the semantic information from seen to unseen classes, thereby classifying the test image into one of the unseen classes.
Quan et al. 23 employed the Shannon embedding method to implement ZSL for scene classification in remote sensing images.This method alters the features in the semantic space with the respective features in the visual space for maintaining the class structure consistency between visual and semantic space.Another work 24 proposes a semantic auto-encoder-based method to impose conditions on the distance to align the visual and semantic spaces for ZSL in remote sensing images.Further, a technique 25 is used to map semantic space from visual space by training a projection network to perform ZSL tasks in remote sensing images.With the learned mapping function, semantic knowledge is perhaps transferred during the inference of unseen classes.However, embedding-based methods cannot perform well in GZSL settings considering unseen classes are essentially biased 26,27 to seen classes during the testing process.This motivates us to explore generative-based methods for ZSL in remote sensing images.

Generative methods for zero-shot learning
Initially, we train a generative model to generate unseen class image features for data augmentation.Later, we learn a classifier (CLS) to classify seen features and generate novel class features to perform the ZSL task.To implement the ZSL task, we utilize the latest works on generative models, such as variational autoencoders (VAEs), 26,28,29 GANs, 11,30,31 and generative flows. 32Xian et al. 11 were the first to use generative adversarial networks (GANs) 33 to map semantic to visual features, giving a state-of-the-art proposal for ZSL.Li et al. 34 first implemented the ZSL task in remote sensing images using GANs by achieving within-class similarity and outside-class discrimination.
The description of the semantic information is as follows.

Semantic Information
In ZSL, only seen class images are available during training.Semantic vectors of remote sensing scene categories are a bridge between both seen and unseen class images to classify unseen classes.These semantics enable us to perform ZSL.Semantic information can be extracted from semantic attributes or word vectors.

Semantic attributes
Semantic or manually defined attributes are high-level descriptions of objects, such as objects' color or shape.Unseen classes can be recognized based on semantic attributes, but human annotation is required.As an example of natural image analysis, the "CUB dataset" was annotated with 312 semantic attributes corresponding to 200 different bird classes. 35However, the remote sensing benchmark datasets' semantic attributes have not yet been explored.

Word embeddings
In general, natural language processing models such as word2vec, 36 glove, 37 fastText, 38 which are trained over a corpus of one trillion words often results in very high dimensional vector representation.They do not require human annotation.However, they have some limitations.
In the word2vec model, each word has a fixed representation irrespective of its importance in the context, so the vector representation does not provide the promised performance.Also, they contain intense noise, which compromises the model's performance.To overcome the above limitations of the word2vec model, we explore representation from the data2vec model 39 as word embedding in ZSL.The data2vec tries to predict a contextualized latent representation based on the limited view of the input sample.The representation of data2vec has two benefits.First, targets are not fixed as visual tokens.Second, hidden representations preserve contextual information.
3 Generalized Zero-Shot Learning for Remote Sensing Scene Classification Using Data2vec Representations The block diagram of the proposed GZSL-RSD2V for GZSL is shown in Fig. 2. It comprises f-VAEGAN, 40 an FE module 21 and a CLS.In Fig. 2(a), f-VAEGAN aims to synthesize visual features during the training process from the semantics vector of data2vec embedding "d." Here, we introduce the FE module to determine discriminative seen visual features in conjunction with f-VAEGAN.Specifically, the FE module is optimized to learn discriminative features using joint center-triplet (JCT) loss and iterative semantic consistency (ISC) loss. 21In Fig. 2(b), we enhance the visual features for both seen and unseen class samples with the help of trained FE.Then, we train both enhanced seen and unseen class features using a CLS for classification purposes.Finally, we classify the enhanced unseen features using the trained CLS at the testing phase.

Formulation
Let L s and L u indicate the sets of seen and unseen class samples, respectively.We indicate seen class samples as S ¼ ff i ; l i g N i¼1 , where f i represents the visual feature; l i is a respective class label ∈ L s ; and N is the total number of seen samples.The relationship between seen and unseen sets is defined as L s ∩ L u ¼ ϕ, and L ¼ L s ∪ L u .We denote a set of semantic vectors for every seen and unseen class as dj ∈ D, ∀ j ∈ L, which helps to share semantic information from the seen to unseen class samples.

Data2vec Embedding
The data2vec model is "a general framework for self-supervised learning" 39 that works over several methods, such as vision, speech, and language.But, here, we use only data2vec for the vision model.The model data2vec obtains continuous and contextualized hidden features of input data.The main idea of data2vec is to regress contextualized hidden representations based on a masked view of the input.It has a teacher and student network trained on a standard transformer architecture. 41The teacher network generates contextualized representations of the full input data.The student network tries to predict full data representations based on the "blockwise masking view" 42 of the input sample.Despite that, data2vec uses a mask for 60% of the patches instead of 40%.The weights of the teacher network are updated based on the exponentially decaying average 43,44 of the student.Then, the target is made using the transformer's top K blocks, which are continual and contextualized.Prior methods predict targets lacking contextualized information.On the other hand, the data2vec model predicts contextualized latent target representations by embodying related features from the total image contrary to targets that accommodate information solitary to the present patch, such as visual tokens or pixels.

Feature Generating Models
We use f-VAEGAN 40 as a baseline for generating synthetic CNN features to map from semantic vectors to visual features conditioned on the data2vec embedding d.The f-VAEGAN uses feature generating VAE (f-VAE) 45 and feature generating wasserstein GAN (f-WGAN) 11 to improve the feature generator.f-VAE comprises an encoder Eðf; dÞ and a decoder Decðh; dÞ.Here, the encoder converts input f to hidden features h, and a decoder Decðh; dÞ rebuilds input f from h.The loss function for f-VAE is as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 7 ; 1 2 1 L VAE ¼ KLðqðhjf; dÞkpðhjdÞÞ − E qðhjf;dÞ ½log pðfjh; dÞ; (1) where the conditional distribution qðhjf; dÞ is modeled as Eðf; dÞ, pðhjdÞÞ is considered to be Nð0;1Þ, KL is the Kullback-Leibler divergence, and pðfjh; dÞ is equal to Decðh; dÞ.In f-WGAN, generator Gðh; dÞ generates a synthetic CNN feature f from random input noise h p .In contrast, the discriminator Dðf; dÞ tries to discriminate the real and synthetic features.f-WGAN returns a real value between 0 and 1, optimizing the following: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 4 ; 6 9 7 L WGAN ¼ E½Dðf; dÞ þ E½Dð f; dÞ − ηE½ðk∇ f 0 Dðf 0 ; dÞk 2 − 1Þ 2 ; (2) where f ¼ Gðh; dÞ is the synthetic feature, f 0 ¼ ρf þ ð1 − ρfÞ with ρ ∼ Uð0;1Þ, and η is the penalty multiplier.The parameters of the decoder Decðh; dÞ and generator Gðh; dÞ are shared to improve the feature generator.

Feature Enhancement Module
The cross-dataset bias is alleviated by processing the visual features of remote-sensing images through an FE module.Here, the FE module is constrained to JCT loss and ISC loss.

Joint center-triplet loss
This loss is introduced to learn discriminative features.These features are obtained by encouraging the features of the same class label to stay together and features of different class labels to be away from each other; this is defined as the compactness of within-class similarity and the separability of inter-class variations, respectively.This could be achieved with the help of class label information, center loss, and triplet loss.JCT loss is formally defined as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 4 ; 4 8 8 where c l is the l'th class center, c l 0 is the l 0 'th class center, Γ denotes the margin that controls the separability of intra-class pairs from inter-class pairs, ω represents the encoded features in FE, and ψ ∈ ½0;1 denotes the balancing factor to indicate the compactness of within class similarity and separability of inter-class variations.

Iterative semantic consistency loss
This loss is introduced at the last layer of the FE module to learn semantic features.ISC loss generates the semantic features d from f or f using the "reparameterization trick." 45To learn effective semantic features, ISC loss is applied to synthetic semantic features to ensure that synthesized semantic features are mapped from the original semantic vectors.This loss is achieved using the l 1 reconstruction loss and is formally defined as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 4 ; 3 1 5 where dreal represents semantic features synthesized from f with the help of FE and dsyn represents semantic features synthesized from f.Note that d ¼ dreal ∪ dsyn and d represents the semantic features for the given visual features f or f.

Extracting enhanced features
In this stage, we take out enhanced features fs and fu from the trained FE.Using the residual connection, 18 we combine visual features f, respective latent vector z s ∈ Z, and semantic embedding ds ∈ D as fs .Similarly, we combine visual features f, respective latent vector z u , and semantic embedding du as fu .Figure 2(b) illustrates the fully enhanced features fs and fu , formally defined as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 4 ; 1 4 2 fs ¼ f⊙z s ⊙ ds ; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 4 ; 1 0 6 where ⊙ denotes the concatentation operation and fs and fu ∈ F. Hence, visual features fs and fu are enhanced as discriminative features that are class-and semantically appropriate to avoid ambiguities within feature samples of the distinct classes.
Finally, our model GZSL-RSD2V is trained with the following overall objective function: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 7 ; 7 2 4 where λ JCT and λ R_d are hyperparameters of the JCT loss and ISC loss multipliers, respectively.

Experimental Results and Analysis
This section provides the results and analysis of the proposed approach for GZSL for scene classification in remote sensing images.We demonstrated the efficacy of the proposed approach on three benchmark datasets of scene classification: UCMercedLandUse (UCM21), 46 WHU-RS19 (RS19), 47 and aerial image dataset (AID30). 481 Details of Scene Classification Datasets UCM21 is the 21-class land use RSSC benchmark dataset manually extracted from large images from the US Geological Survey.The RS19 dataset contains 19 scene classes extracted from Google Earth with various high-resolutions.AID30 is a 30-class scene classification in the RSI.Table 1 provides details of these datasets.

Implementation
Our proposed method employs an encoder, generator, and discriminator, which are basically multilayer perceptrons.Each perceptron accommodates a 4096-node hidden layer with LeakyReLU activation.The FE module is also a multilayer perceptron.It holds two hidden layers with 4096 nodes and 2 × j dj nodes with LeakyReLU, followed by an encoding layer that uses two feature vectors of size j dj to constitute the second hidden layer.Its final layer j dj corresponds to the semantic vector of the word embedding method (e.g., j dj ¼ 768 for the data2vec).We used the Adam optimizer 49 with β1 ¼ 0.5 and β2 ¼ 0.999.The penalty multiplier η is set to 10.In this study, hyper parameters of the JCT loss multiplier (λ JCT ), ISC loss multiplier (λ R_d ), and gamma (ψ) are set to 0.999.

Extraction of Visual and Semantic Features
Deep learning-based, fine-tuned features of 2048 in size from remote sensing image scenes are extracted from the ResNet-101 18 model pre-trained on ImageNet. 19Semantic prototypes are extracted from the data2vec model.Here, the data2vec model predicts contextualized hidden features of the entire input image based on a masked version of the input sample in a self-refined setup using a standard transformer architecture. 41We used word2vec word embeddings pretrained on the Google News Corpus 36 for a fair comparison.The details of word embedding dimensions are shown in Table 2.

Quantitative Analysis
We used the unified evaluation protocol 11 for fair comparison to evaluate our proposed approach.We assess the top-1 accuracy for seen and unseen class samples (indicated by S and U, respectively).The harmonic mean (indicated by H) of S and U is also estimated using All of the zero-shot RSSC experiments are reiterated 25 times, accompanied by a random seen/unseen split, and average classification accuracies are noted.Tables 3-5 show the top-1 classification accuracies of the word2vec and data2vec methods over the UCM21, RS19, and AID30 datasets, respectively.It can be observed from the results that our approach with data2vec embedding performs better in comparison with word2vec embedding on three benchmark datasets.To the best of our knowledge, we are the first to implement GZSL for scene classification tasks in remote sensing images.

Analysis on UCM21 dataset
We evaluated our proposed GZSL-RSD2V method by considering the word2vec and data2vec embedding approaches over the four standard splits 22 of the UCM21 dataset with seen/unseen classes of 16/5, 13/8, 10/11, and 7/14.It is observed from Table 3 that data2vec shows an improvement of 4.5%, 7.4%, 0.3%, and 2.8% in seen class accuracy and 5.3%, 2.7%, 4.2%, Table 3 Seen, unseen, and harmonic mean scene classification accuracies with standard seen/unseen splits on the UCM21 dataset.and 3.1% in unseen class accuracy on the standard splits, respectively.Also, the efficacy of our proposed method with data2vec embedding is demonstrated in terms of the harmonic mean with an improvement of 5.7%, 4.4%, 5.5%, and 4.4% under the same seen/unseen splits (e.g., 16/5, 13/8, 10/11, and 7/14, respectively).Our proposed method with the data2vec embedding approach exhibits better classification in comparison with word2vec.This may be due to data2vec having self-supervised word embedding, making it capable of learning semantic features from unseen classes.

Analysis on RS19 dataset
We considered the word2vec and data2vec embedding approaches to evaluate our proposed method using the four standard splits 22 of the RS19 dataset with seen/unseen classes of 15/4, 12/7, 9/10, and 6/13.It is observed from Table 4 that data2vec achieved an improvement of 2.1%, 2.4%, 0.7%, and 1.8% in seen class accuracy and 12.5%, 10.6%, 6.4% and 4.1% in unseen class accuracy on the same standard splits.Also, the efficacy of our proposed method with data2vec embedding is demonstrated in terms of the harmonic mean with an improvement of 9.9%, 11.2%, 7.6%, and 5.6% under the same seen/unseen splits.Our proposed method with the data2vec embedding approach exhibits better classification in comparison with word2vec.This may be due to data2vec having self-supervised word embedding, making it able to learn semantic features of unseen classes.

Analysis on AID30 dataset
Our proposed GZSL-RSD2V is evaluated by considering the word2vec and data2vec embedding approaches over the four standard splits 22 of the AID30 dataset with seen/unseen classes of 25/5, 20/10, 15/15, and 10/20.Table 5 shows a rise in the classification accuracy of 23.0%, 3.2%, 9.6%, and 6.7% in unseen classes under these seen/unseen splits with the data2vec approach, though a marginal drop in the performance of seen class accuracy around 1% is observed with data2vec in comparison with word2vec.Our proposed method also exhibits the effectiveness of data2vec embedding in terms of the harmonic mean with an improvement of 21.4%, 4.0%, 13.9%, and 10.2% under the same seen/unseen splits.It is observed from the experiments that the data2vec provides better semantic features on unseen classes compared with seen classes.Upon evaluation of our proposed GZSL-RSD2V method over three challenging scene classification benchmark datasets, we noticed that data2vec embedding shows consistent improvement in classifying the scenes of unseen classes.However, data2vec embedding does not show improvement in classifying the scenes of seen classes from the AID30 dataset in comparison with the UCM21 and RS19 datasets, though the classification accuracy drop is relatively very small.

Qualitative Analysis
This section provides the qualitative results and their analysis of our proposed method.We used the uniform manifold approximation and projection (UMAP) 50 to visualize real unseen class visual features and the synthesized visual features through our proposed method with word2vec and data2vec embeddings.Figures 3-5 denote the UMAP visualization of the UCM21, RS19, and AID30 datasets, respectively.It is observed from Figs. 3-5 that the synthesized visual features through our proposed method with data2vec exhibit better separability in comparison with synthesized visual features with the word2vec method.It is evident from the visualization that our proposed method with data2vec embedding is able to capture meaningful semantics relevant to unseen classes.

Conclusion
This paper proposed a self-supervised embedding to represent semantics useful for GZSL for scene classification in remote sensing images.A learning mechanism was devised to map the semantics to the corresponding visual domain during the visual feature generation.A feature refinement module was employed to improve the visual features of both seen and unseen classes of remote-sensing images.To the best of our knowledge, we were the first to explore a GZSL approach in the remote sensing domain.Our proposed approach was evaluated using both data2vec and word2vec embeddings.It is observed from the experiments that our proposed method with data2vec embedding was able to capture meaningful semantics relevant to unseen classes.In the future, we will explore weighted embeddings for representing the semantics of remote-sensing images.

Fig. 2
Fig.2Block diagram for the proposed method GZSL-RSD2V.(a) The GZSL-RSD2V comprises three modules: f-VAEGAN, an FE module, and a CLS.The f-VAEGAN module aims to synthesize visual features during the training process from the semantics vector of data2vec embedding "d," whereas the FE module determines discriminative seen visual features in conjunction with f-VAEGAN.We enhance the visual features for both seen and unseen class samples with the help of trained FE.Then, we train to classify both enhanced seen and unseen class features using a CLS.(b) FE module architecture.FE is optimized to learn discriminative features using JCT loss and ISC loss; its discriminative features from different layers are then concatenated to obtain enhanced features.

Fig. 4
Fig. 4 UMAP visualizations for the features of four unseen class samples from the RS19 dataset.(a) The real unseen class features, (b) the synthesized features of our proposed method with word2vec embedding, and (c) the synthesized features of our proposed method with data2vec embedding.

Fig. 3
Fig. 3 UMAP visualizations for the features of five unseen class samples from the UCM21 dataset.(a) The real unseen class features.(b) The separability of the synthesized features of our proposed method with word2vec embedding.(c) The separability of the synthesized features of our proposed method with data2vec embedding.

Fig. 5
Fig. 5 UMAP visualizations for the features of five unseen class samples from the AID30 dataset.(a) The real unseen class features, (b) the synthesized features of our proposed method with word2vec embedding, and (c) the synthesized features of our proposed method with data2vec embedding.

Table 1
Details of three benchmark datasets for scene classification in remote sensing images.

Table 4
Seen, unseen, and harmonic mean scene classification accuracies with different seen/unseen splits on the RS19 dataset.

Table 5
Seen, unseen, and harmonic mean scene classification accuracies with standard seen/unseen splits on the AID30 dataset.

Table 2
Details of semantic vectors extracted from different embeddings over three datasets.