Multidomain feature fusion method for small object classification: MDFF

Abstract. The task of classifying small objects is still challenging for current deep learning classification models [such as convolutional neural networks (CNNs) and vision transformers (ViTs)]. We believe that these algorithms are not designed specifically for small targets, so their feature extraction abilities for small targets are insufficient. To improve the classification capabilities of CNN-based and ViT-based classification models for small objects, two multidomain feature fusion (MDFF) frameworks are proposed to increase the amount of feature information derived from images and they are called MDFF-ConvMixer and MDFF-ViT. Compared with the basic model, the uniquely added design includes frequency domain feature extraction and MDFF processes. In the frequency domain feature extraction part, the input image is first transformed into a frequency domain form through discrete cosine transform (DCT) transformation and then a three-dimensional matrix containing the frequency domain information is obtained via channel splicing and reshaping. In the MDFF part, MDFF-ConvMixer splices the spatial and frequency domain features by channel, whereas MDFF-ViT uses a cross-attention mechanism to fuse the spatial and frequency domain features. When targeting small target classification tasks, these two frameworks obviously improve the utilized classification algorithm. On the DOTA dataset and the CIFAR10 dataset with two downsampling operations, the accuracies of MDFF-ConvMixer relative to ConvMixer increase from 87.82% and 62.14% to 90.14% and 66.00%, respectively, and the accuracies of MDFF-ViT relative to the ViT increase from 79.22% and 36.2% to 88.15% and 59.23%, respectively.

impressive results on regular-sized targets. 3 Convolutional neural networks (CNNs) generally use methods such as network deepening and feature multiplexing to enhance the classifier's ability to extract spatial features from the target. 4,5 ResNet 6 uses the idea of residual learning. On the basis of VGG19, 7 a residual unit is added through a shortcut to solve the degradation problem of the deep network so that the network becomes deeper and can extract deeper features. However, deeper networks are more likely to lead to the loss of small target features. Although DenseNet 8 establishes dense connections between different layers, reuses features between the front and rear layers, and performs well on large targets, because small targets carry less information, it is easy to cause overfitting by directly using DenseNet to deepen the network. Xu et al. 9 proposed a new target feature extraction approach, which uses adaptive channel pruning to reshape images in the frequency domain and then uses conventional CNNs for classification. This method uses the frequency domain features of the target and its final effect is better than that of the original CNN method. However, simply using frequency domain features and pruning leads to incomplete feature extraction for small objects, so this method is still not suitable for small object classification. Through the learning processes of the above methods and our understanding of small target features, we believe that the feature extraction ability of a model for small targets can be enhanced by introducing a combination of frequency domain features and spatial domain features. Therefore, a spatial and frequency domain feature fusion method [multidomain feature fusion (MDFF)] based on data enhancement is proposed in this paper, which enhances the importance of the frequency domain features to the classifier, enriches the effective features of small targets, and improves the model's small target recognition ability. Based on this method, two recognition frameworks [MDFF-ConvMixer and MDFF-vision transformer (ViT)] are designed. On the DOTA dataset and the CIFAR10 dataset with two downsampling operations, we verify that the classifiers constructed by these two frameworks achieve improved recognition performance for small objects. For the small target classification task, this paper is a new attempt to fuse spatial and frequency domain features.

Related Works
The main work of this paper is to carry out research on classification and recognition technology for small targets. The research idea is to realize the extraction and fusion of the spatial and frequency domain features of targets based on the structures of ConvMixer and a vision transformer (ViT). In this chapter, we focus on some related work.

ViTs
In recent years, due to the success of transformers in the field of natural language processing (NLP), nonconvolution models that only rely on transformers have gradually become the most advanced algorithms in the field of computer vision. The transformer structure with attention proposed by Vaswani et al. 10 has achieved good results. The bidirectional encoder representations from transformers (BERT) method proposed by Devlin et al. 11 uses a classification (CLS) token to aggregate the classification information of the entire token to reduce the computational complexity of the transformation algorithm. Later, a self-attention technique was proposed by Parmar et al. 12 This method focuses only on the local neighborhoods of pixels, which enables the application of transformers in vision tasks. The ViT 13 borrows from previous work and encodes an image into several tokens with location information; this was the first approach to match or even surpass CNNs in terms of performance on vision tasks with a transformer-based algorithm. Cross ViT 14 utilizes a dual-path structure to improve the effect of the ViT on multiscale features. Based on the above algorithms, we propose a MDFF method for feature augmentation to better utilize ViTs for the visual representation of small objects.

CNNs
Although ViTs can outperform CNNs on some tasks with additional pretraining on big data, the lower dataset requirements and faster training speeds of CNNs make them some of the dominant frameworks for vision tasks. A large number of scholars have conducted in-depth research on CNNs. For example, ResNet 6 uses the idea of residual learning to deepen convolutional networks, and DenseNet 8 further reduces the number of required model parameters by reusing features. These improvements all provide CNNs with more powerful classification capabilities. ConvMixer 15 was inspired by the ViT. It directly operates on patches and performs the convolution operation, which further improves the classification ability of the convolutional model. However, the above methods are not suitable for small object classification because when the object sizes are too small, deeper networks with more spatial convolutions tend to lose features more easily. Based on the above considerations, in this paper, the frequency domain features are embedded in a ConvMixer-based MDFF framework to achieve feature enhancement.

Frequency Domain Feature Extraction
The frequency domain representations of images contain rich information and making full use of this information can improve a computer's understanding of various image processing tasks. Hsu et al. 16 was the first to use the Mandala transform to identify targets. Shen and Sethi 17 directly extracted low-level frequency domain features from images to detect regions of interest and edges. Both Ehrlich and Davis 18 and Gueguen et al. 19 skipped the JPEG decoding step and directly used frequency domain information for learning. Ehrlich and Davis 18 proposed a general learning algorithm in the JPEG transform domain for interconversion between spatial and frequency domain networks, whereas Gueguen et al. 19 used an intermediate JPEG codec module to extract frequency domain features to train a CNN model for image classification. Gueguen et al. 19 also considered DCT to be an alternative convolution. Xu et al. 9 analyzed spectral bias from the perspective of the frequency domain, proposed a learning-based channel pruning algorithm to prune frequency components that are of little use and used frequency domain information as the input of commonly used neural networks. These methods only consider the frequency domain features and do not consider the fusion of frequency domain features and spatial domain features. The work in this paper is an attempt to fuse the spatial and frequency domain features.

Methodology
Our feature fusion method is designed on ConvMixer and ViT models for small object classification. Therefore, this chapter first briefly introduces the ConvMixer and ViT models and then describes our proposed algorithm frameworks (MDFF-ConvMixer and MDFF-ViT) in detail.

ViT and ConvMixer Frameworks
A ViT splits an entire image into small image patches and then converts these small patches into linear embedding sequences via linear projection. 13 Since this splitting process loses the position information of the image block, which is indispensable in vision tasks, the ViT adds a position embedding to each token. Similar to BERT's CLS token, an additional CLS token is added to the front of each sequence to facilitate the final classification step. All tokens of a sequence are fed into multiple transformer encoders as inputs, but the final classification process uses only the CLS tokens, not all tokens, because after multiple encoding iterations, the CLS tokens already contain important information from other tokens. A transformer encoder consists of multiple stacked blocks, each of which consists of multiheaded self-attention 12 and a multilayer perceptron (MLP). 20 It is worth noting that since CLS tokens can contain rich feature information, we try to achieve joint feature enhancement by exchanging the CLS tokens and branch tokens of different domain features.
ConvMixer consists of a patch embedding block and multiple repeated fully convolutional blocks. Although ConvMixer's patch embedding block is similar to the ViT's linear projection function, it is implemented through a 2D convolution operation. Each fully convolutional block of ConvMixer consists of grouped convolution (the number of groups equals the number of channels) and point convolution (the convolution kernel has a size of 1 × 1). The group-convolved feature map and the point-convolved feature map undergo residual learning. A pooling and normalization layer is located after each convolution operation to reduce the computational cost of the model. Although Trockman and Kolter 15 believed that ConvMixer is similar to the ViT in terms of its idea, its architecture is more similar to those of CNNs, such as ResNet. Due to the superior performance of ConvMixer on small target recognition tasks, this paper uses the ConvMixer model to improve the proposed approach. While adding frequency domain features, we design a feature fusion recognition framework based on ConvMixer by performing crossdomain feature splicing according to the channel dimension.
Aiming at the difficulty of small target feature extraction, we introduce frequency domain features and use MDFF to enhance the classification ability of the model. This design idea broadens the feature extraction channels, rather than only mining features based on depth, so it is more suitable for the extraction of small target features.

Frequency Domain Feature Extraction
For small targets with insufficient spatial information, common classifiers do not perform well. For example, CNNs and ViTs, similar to the human eye, tend to pay more attention to information such as the textures and positions of images when performing classification, which are all spatial features. For objects with higher resolutions and larger sizes, a classifier can achieve better classification results by using only spatial features. For small targets, the advantages of these classifiers, such as their deeper networks and extra spatial convolutions, do not improve the classification effect but rather interfere with the extraction of identifiable features. For this reason, we propose an MDFF method that attempts to increase the frequency domain features to achieve feature enhancement, which is accomplished by expanding the feature domain rather than adding depth features. The frequency domain feature extraction method for the deep learning network is shown in Fig. 1.
As shown in the Fig. 1, the resized and cropped RGB image is denoted as x spa , and the image x 2 spa is obtained by performing upsampling twice using the bilinear interpolation method. x 2 spa obtains a one-dimensional matrix of the Y channel (Y − x 2 spa ) through DCT transformation, and x spa obtains one-dimensional Cb and Cr matrices (Cb − x 2 spa , Cr − x 2 spa ) through DCT transformation. When performing DCT transformation, the input image is divided into multiple 8*8 matrices, DCT transformation is performed on each matrix, and a frequency coefficient matrix is obtained after the transformation. The two-dimensional DCT coefficients with the same frequency are divided into the same group to form a channel. Later, reshaping and concatenation are used to deform the Y, Cb, and Cr matrices into tractable forms for normalization. Finally, the matrices are reshaped into a three-channel image, denoted as x fre , to facilitate the subsequent feature extraction and fusion processes.

Spatial and Frequency Domain Feature Fusion
To study the small target recognition effect attained after adding frequency domain features, we improve the feature fusion abilities of two different recognition models, including the state-of-the-art CNN-based ConvMixer model and the classic transform-based ViT classification model. We refer to the improved recognition frameworks as MDFF-ConvMixer and MDFF-ViT, respectively.

MDFF-ConvMixer
As shown in Fig. 2, the input image is subjected to simple image preprocessing techniques (such as resizing, cropping, and rotation) and frequency domain feature extraction (in Sec. 3.2) to obtain the target spatial image ∈ R c×n×n and the frequency domain image x fre ∈ R c×n×n . x spa and x fre obtain the spatial feature block y spa ∈ R h×ð n p Þ×ð n p Þ and the frequency domain feature block y fre ∈ R h×ð n p Þ×ð n p Þ , respectively, through their patch embedding modules. Each patch embedding module here is composed of a convolutional layer, an activation function and a normalization layer. 15 The convolutional layer's kernel size ¼p 1 , the stride ¼p 1 , and p 1 is the patch size. The transformation process is shown in the following equations: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 7 ; 4 3 2 (1) E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 7 ; 3 9 6 y fre ¼ BNðσfConv c in →h ðx fre ÞgÞ: The feature blocks obtained by y spa and y fre through overlapping ConvMixer layers are denoted as y m spa ∈ R h×ð n p Þ×ð n p Þ and y m fre ∈ R h×ð n p Þ×ð n p Þ , respectively, where m is the number of overlapping ConvMixer layers. Each ConvMixer layer is composed of depthwise convolution and pointwise convolution. 15 Depthwise convolution is actually a grouped convolution with the number of groups equal to the number of channels, whereas pointwise convolution is a 1 × 1 point convolution. An activation function and a normalization layer are located after each convolution. Before and after performing depthwise convolution, the features are connected by their residuals, 15 as shown in Eqs. (3)-(6). The depthwise convolution and pointwise convolution operations are shown in Eqs. (7) and (8), respectively E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 7 ; 2 7 6ŷ E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 7 ; 2 3 9ŷ t E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 7 ; 2 2 0ŷ E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 7 ; 2 0 2ŷ t fre ¼ BNðσfPotConv h→h ðŷ t fre ÞgÞ; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 7 ; 1 8 2 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 7 ; 1 6 4 PotConv ¼ Convðkernel ¼ 1Þ: Among them, the initial values of y t−1 spa and y t−1 fre are y spa and y fre , respectively; t is the variable representing the number of stacking iterations; and the value range is 1 ≤ t ≤ M. p 2 represents the stride parameter in the depthwise convolution, which is determined by the sizes of the input and output. The function of p 2 is to ensure that the size of each feature remains unchanged before and after the convolution operation. 21 p 3 represents the kernel size in the depthwise convolution, which is generally 7. The kernel size of point convolution is set to 1. After obtaining the spatial domain depth feature y M spa and the frequency domain depth feature y M fre , the fusion feature z ∈ R ð2ÃhÞ×ð n p Þ×ð n p Þ is obtained through channel splicing, as Eq. (11) shows E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 4 ; 6 9 6 z ¼ ½y M spa ky M fre : The feature z can yield the output category result after going through the fully connected layer. And the loss function of MDFF-ConvMixer is calculated using cross-entropy and is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 4 ; 6 3 5 where M represents the number of categories. q ic is an sign function that takes a value of 0 or 1.
If the true category of sample i is c, it takes the value of 1; otherwise, it takes the value of 0. The probability p ic represents the likelihood of sample i belonging to category c, which is obtained by inputting the feature z into a fully connected layer.
To study the best location for fusing the spatial and frequency domain features, we make attempts with different strategies, such as the following.
MC-strategy1: fuse x spa and x fre before extracting features; MC-strategy2: fuse image patches y spa and y fre before extracting features; MC-strategy3: use attention mechanism to perform feature fusion on y M spa and y M fre after feature extraction; MC-strategy4: plus the last two outputs of spa-branch and fre-branch instead of fusing features, which means combining output spa and output fre to get a new output, expressed in the formula as: output new ¼ output spa þ output fre ; MC-strategy5: take the element-wise maximum of the outputs from the spa-branch and fre-branch to get a new output: output new ¼ maxðoutput spa ; output fre Þ.
Through comparative experiments, we find that the optimal feature fusion method for the ConvMixer model performs feature splicing before the fully connected layer. We believe that this is because the fusion mechanisms of MC-strategy1 to MC-strategy3 may destroy the location information contained in airspace features, and MC-strategy4 to MC-strategy5 fail because the meanings of loss in the airspace and frequency domains are quite different and their loss are difficult to fuse. Detailed ablation experiments can be found in Sec. 4.3.

MDFF-ViT
As shown in Fig. 3, the input image is subjected to simple image preprocessing technology to obtain the spatial domain image x spa , and then the frequency domain image x fre is obtained through frequency domain feature extraction (shown in Sec. 3.2). Linear projection is used to process x spa and x fre to obtain two different tokens (T spa and T fre , respectively). T spa and T fre are processed through the token fusion module to obtain two fusion features T freþspa and T spaþfre , respectively. It should be noted that the input of the token fusion module includes two tokens, and the output also contains two tokens. A token can be split into a CLS token and multiple patch tokens, where the CLS token contains most of the information of the entire token. T cls freþspa contains most of the information in the fusion feature T freþspa , and T cls spaþfre contains most of the information in the fusion feature T spaþfre . Our processing approach sends T cls freþspa and T cls spaþfre to 2 separate MLP heads and finally performs linear fusion on the two losses. Effective feature fusion is the key to learning multidomain feature representations. After testing several strategies, such as self-attention feature fusion, simple token splicing and fusion, etc. Fusion scheme details can be found in Sec. 4.3. We choose the token fusion module based on the cross-attention mechanism and its design idea is inspired by Cross ViT. 14 Each token fusion module in MDFF-ViT consists of two parallel transformer encoders and a cross-attention mechanism. The two transformer encoders separately extract the spatial and frequency domain features of the data, and the attention mechanism is mainly responsible for processing the feature fusion part. The features obtained after the transformer encoder are denoted asT fre andT spa , respectively, and they are cross-fused using the cross-attention mechanism. The following describes the process of feature cross fusion, as shown in Eqs. (11)- (15). TakingT fre fusing the information ofT spa to obtainT freþspa as an example,T fre can be split intoT cls fre andT branch fre , andT spa can be split intoT cls spa andT branch spa : E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 1 1 7 ; 3 2 4T When calculating the QKV matrix, cross-attention can utilize different processing methods. 22 QKV matrix represents the query matrix, key matrix and value matrix proposed in Ref. 10. As shown in Eq. (12), q is calculated through the linear projection ofT cls fre , k is calculated by the simple concatenation of the linear projection results ofT cls fre andT branch spa , and the calculation process of v is similar to that of k. The purpose of linear projection is to align the dimensions ofT cls fre andT branch spa to facilitate subsequent feature cross-fusion calculations. Linear projection is achieved by adding several linear layers, and the linear projections in the spatial and temporal domains are represented by the functions f spa ð·Þ and f fre ð·Þ, respectively. The calculation equations of q, k, and v are as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 1 1 7 ; 1 7 2 q ¼ f fre ðT cls fre ÞW q ; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 1 1 7 ; 1 2 9 k ¼ ½f fre ðT cls fre ÞkT branch spa W k ; Among them, W q , W k , and W v are learnable parameters. After calculating the QKV matrix, the CLS tokenT cls freþspa fused with spatial information can be obtained.T cls freþspa is aligned and spliced withT branch fre through the backprojection function g fre ð·Þ to obtain the fused token, which is denoted as T freþspa . This is shown in the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 1 1 4 ; 6 8 2 The fusion of T spa and T fre is also a similar process. For better feature extraction and feature fusion, the token fusion module needs to overlap R times.
Here, the feature cross-fusion process of MDFF-ViT is more complicated than that of MDFF-ConvMixer. MDFF-ViT uses cross-attention for feature fusion, whereas MDFF-ConvMixer uses simple channel concatenation for fusion. The reason for this design in this paper is that the tokens in MDFF-ViT have location information, and the frequency domain features do not destroy the location information of the spatial domain features during feature crossover; thus, the fusion process is more sufficient. Subsequent experiments demonstrate that sufficient feature fusion can achieve higher performance gains.
It should be noted that in the entire MDFF-ViT framework, the process leading up to the sum of MLP heads is coherent. However, unlike the concatenation method employed in MDFF-ConvMixer to merge spatial and frequency-domain features, we adopt a different approach in MDFF-ViT. Instead, MDFF-ViT feeds the fusion results of the two feature layers (T cls freþspa , T cls spaþfre ) into two MLP layers separately. The MLP heads output two recognition scores derived from the feature mappings, and the final fusion decision is obtained by summing the two recognition scores. The loss function of MDFF-ViT is similar to that of MDFF-ConvMixer, both utilizing cross-entropy for computation. However, due to the presence of an additional MLP head in MDFF-ViT, the probability score calculation is diffient. The loss function of MDFF-ViT is expressed as follows: where N represents the number of categories, and q ic denotes the sign function, taking binary values of 0 or 1. Specifically, q ic takes the value of 1 if sample i belongs to the true category c, otherwise it is set to 0. The probabilities p ic spaþfre and p ic freþspa represent the likelihood of sample i belonging to category c. These two probabilities are obtained by passing the features T cls spaþfre and T cls freþspa through two MLP heads separately.

Enhancement Analysis of Feature Representation
Our framework is not limited to the improvement of some special spatial models such as ConvMixer and ViT. It can also be applied to enhance other spatial feature models easily, including recognition methods based on contour features, such as (Misra, 2018), 23 (Asem, 2018), 24 and (Saleem, 2019). 25 Due to the significant differences between the spatial and frequency domain features of small object, our improvement strategy can remarkably enhance the information content of the original features (spatial) by introducing frequency domain features. This naturally leads to improved recognition performance. The feature maps before and after feature fusion are shown in Fig. 4. x spa , y M spa , y M fre , and Z in Fig. 4 represent the input RGB image, the spatial feature map, the frequency feature map, and the fused feature map, respectively. For better visualization, all of them have been scaled to a size of 256 × 256. It can be intuitively observed that the frequency domain features enrich the information content of the spatial domain features, which is the key to our model's excellent performance.

Experiments
In this section, we present experiments and their results to demonstrate the effectiveness of our proposed method.

Dataset
Our purpose is to conduct research on recognition technology for small objects with sizes between 8*8 and 32*32. Due to the lack of such publicly available datasets, we use the downsampled DOTA dataset. The target areas marked by the DOTA dataset are cropped and downsampled to 1/4 of the original images, and the targets with pixel areas less than 32*32 are retained as the classified dataset. The dataset composition is shown in Fig. 5. In the following, the original dataset is recorded as Dota 32×32 , and the smaller target dataset obtained after continued downsampling of Dota 32×32 is recorded as Dota32 2 × 32 2 . These two datasets are subsequently used for experiments to test the performance of the algorithm on small targets with different sizes. The ratio of the training set to the test set is ∼5∶3. Dota 32×32 has 6 types of positive samples and 1 type of negative sample, for a total of 15,065 samples, and Dota32 2 × 32 2 has the same breakdown. The training sets of Dota 32×32 and Dota32 2 × 32 2 are shown in the following figure. We also utilized the publicly available dataset cifar10 26 and downsampled it by 1/2 to meet the requirements of our research, referred to as cifar1032 2 × 32 2 . In addition to Dota 32×32 , Dota32 2 × 32 2 , and cifar1032 2 × 32 2 , we conducted experiments on the publicly available dataset Fashion-MNIST. 27

Training and evaluation
When conducting control experiments, we set the same hyperparameters for the same set of experiments. We run the experiments for 200 epochs on 2 pieces of 3080Ti GPUs. The optimizer uses adaptive moment estimation (Adam), the default batch size is set to 64 (dynamically adjusted to models), the initial learning rate is set to 0.0001, the learning rate decay coefficient is 0.9, and the number of learning rate decay iterations is 20. The datasets are resized to 256 before being input into the classifier, and simple data enhancements such as flipping and  cropping are used. The convolutional models participating in the experimental comparison include ResNet50, Desnet, HorNet, 28 and ConvMixer, and the transformer models include Deit, 29 ViT-pre, ViT, Cross ViT-pre, Cross ViT, Swin-Transformer, 30 and CSWin-Transformer. 31 Among them, the patch embeddings of ViT-pre, ViT, Cross ViT-pre, Cross ViT, and MDFF-ViT are all linear, the patch sizes of the ViT and MDFF-ViT are 16, 13 the small patch size of Cross ViT is 16, and the large patch size is 64. 14 In the experiment, Deit's teacher model is ResNet50, and the patch size is 16. 29 ConvMixer 15 and MDFF-ConvMixer have 256 dimensions and depths of 24. We use the first accuracy attained on the test set as a model performance evaluation metric.

Training process
The training processes of MDFF ConvMixer and MDFF ViT are largely similar to the original algorithms, both following an end-to-end approach. The only difference is that original ConvMixer and ViT only require spatial domain images and labels as inputs during training, whereas MDFF ConvMixer and MDFF ViT additionally require frequency domain images. The frequency domain images are generated together with the dataloader when the dataset is loaded, by using CPU. Therefore, our conversion method does not introduce any additional GPU time. Moreover, the conversion of spatial domain images to frequency domain images by the CPU is also fast, taking less than 1 min to convert 100,000 images to frequency domain images on an Intel Xeon Gold 6330. However, our MDFF method also has its limitations. Due to the addition of frequency domain images, the memory usage and the GPU training time of the new network also nearly doubles compared to the original network.

Main Results
The experimental results are shown in Table 1. Except for the models with "-pre" after their name, the models appearing in this paper do not load any pre-trained weights. The ViT, Deit, and Cross ViT are commonly used transformer models, and ResNet and DenseNet are commonly used convolutional models.

Convolutional models
It can be seen from

Transformer models
As seen from Table 1, on the multiscale small image datasets Dota 32×32 , Dota32 2 × 32 2 , cifar1032 2 × 32 2 , and Fashion-MNIST, compared with the ViT that is not pretrained, Cross ViT and Deit, MDFF-ViT has obvious advantages in terms of its classification ability. Cross ViT can be regarded as an improved version of the ViT from the perspective of multiscale feature fusion, and compared with Cross ViT, MDFF-ViT yields accuracy improvements of 1.62%, 0.61%, 3.39%, and 2.23% on Dota 32×32 , Dota32 2 × 32 2 , cifar1032 2 × 32 2 , and Fashion-MNIST, respectively, which shows that the improvement exhibited by MDFF-ViT over the ViT is not merely due to the fact that the number of network calculations increases. It also shows that MDFF is more effective than multiscale feature fusion under the same number of computations. On the four datasets, MDFF-ViT is stronger than the pretrained Cross ViT, and the classification ability of the pretrained ViT is competitive. The design of MDFF-ViT not only aims to improve accuracy but also considers the efficiency of the model. Due to the additional features provided by MDFF, MDFF-ViT requires fewer MLP layers and has a smaller parameter size compared to ViT and Cross ViT. Furthermore, the FLOPs of MDFF-ViT are only 1/7 of ViT's FLOPs, thanks to the convergence achieved by the multi-domain features of MDFF-ViT in a shallower network. In terms of accuracy, parameter size, and FLOPs, MDFF-ViT outperforms the non-pretrained ViT.
It is worth noting that MDFF-ViT, with its more complex feature fusion, demonstrates inferior recognition performance compared to the simpler feature fusion approach of MDFF-ConvMixer. This is because convolutional models, leveraging their inherent inductive prior for exploiting spatial invariance in 2D image data, outperform transformer-based models in recognition performance, with smaller parameter counts and computational requirements, especially in the case of small datasets and non-pretrained classification models. Consequently, when using non-pretrained models, ConvMixer outperforms ViT significantly in terms of recognition. As a result, MDFF-ConvMixer surpasses MDFF-ViT in recognition performance.

Ablation Study with Each Improvements
The MDFF-ConvMixer and MDFF-ViT frameworks designed in this paper both contain two improvements: the introduction of frequency domain features and their feature fusion modules. In this section, a series of ablation experiments are conducted with Dota 32×32 , Dota32 2 × 32 2 , cifar1032 2 × 32 2 , and Fashion-MNIST as experimental subjects to better understand the effectiveness of each improvement. In the experiment, A represents the algorithm name, D represents the frequency domain feature, and F represents feature fusion, which includes the following experimental combinations ( Table 2). When D is added to the ViT and ConvMixer, since only the frequency domain features are used, the resulting effect is not as good as that yielded when only the spatial domain features are used, and the accuracy rate produced on the dataset decreases. When two improvement points are added, the classification accuracy improves the most, which demonstrates the effectiveness of the MDFF approach proposed in this paper.

Ablation Study with Different Feature Fusion Methods of MDFF-ConvMixer
Spatial domain features and frequency domain features are two different types of features, the better fusion of them, the more and richer information will be brought, which is very useful for classification. We have introduced some feasible feature fusion schemes based on ConvMixer respectively, details can be found in Sec. 3.3. In this section, to compare the performance of these fusion schemes and verify the effectiveness of our models, a series of experiments will be conducted on the Dota 32×32 dataset. Except for these models, other hyperparameters such as batch size are the same as mentioned before. Same as below.
The results of different feature fusion schemes based on ConvMixer on Dota 32×32 dataset are shown in Table 3, as well as their parameters and FLOPs. The symbols used in the table are related to Fig. 2. These schemes in the table correspond to our model MDFF-ConvMixer and five ConvMixer-related feature fusion schemes mentioned (from MC-Strategy1 to MC-Strategy5) in Sec. 3.3. As seen in the tabel, our MDFF-ConvMixer achieves the best accuracy with smallest FLOPs and parameters.

Ablation Study with Different Patch Sizes Used in MDFF-ConvMixer
When images are input into MDFF-ConvMixer, they are all processed into a sequence of embedded image-patches by patch embedding machine. Different patch sizes will have a great impact on model's performance. Here we will perform experiments on Dota 32×32 dataset to understand the effect of patch sizes in MDFF-ConvMixer. As a result of the feature fusion scheme of concate y M spa and y M fre is adopted in model MDFF-ConvMixer, the sizes of y M spa and y M fre must be the same, and enventually the patch sizes of the spa-branch and fre-branch must be the same, too. Since the patch sizes pair (7, 7) is used in MDFF-ConvMixer, we test the other four pairs of patch sizes on Dota 32×32 dataset such as (3,3); (5,5); (9,9); and (11,11). Their accurateness, parameters, and FLOPs can be found in Table 4. The symbols used in the table are related to Fig. 2.The MC-Strategy6 to MC-Strategy9 means the four models mentioned above that contain different patch sizes. Smaller patch sizes lead to more FLOPs and richer information; bigger patch sizes reduce computation, but omit some details of targets, especially for small targets. Combined with the experimental results, by using the patch sizes pair (7,7), MDFF-ConvMixer achieves the best accuracy with a little increase in parameters and FLOPs and it confirms the superiority of our model for small targets.

Ablation Study with Different Feature Fusion Methods of MDFF-ViT
Efficient feature fusion is the key to learn multi-domain feature representations. To confirm the effectiveness of MDFF-ViT, we propose other three different fusion strategies (from  MV-strategy1 to MV-strategy3), and test them on the Dota 32×32 dataset, respectively. The details of each strategy are as follows. MDFF-ViT: the method used in this article. First, fuse T spa and T fre by cross-attention module, and then send the two results T cls spaþfre and T cls freþspa ouput by cross-attention module to two separate MLP heads to get two different classification scores, finally add the two classification scores to get the final result.
MV-strategy1: use cross-attention module to fuse T spa and T fre , then concat T cls spaþfre and T cls freþspa , which are outputs from cross-attention module, finally only use one MLP head to output the classification result; MV-strategy2: compared with MDFF-ViT, only use transformer block for feature extraction, instead of cross-attention module for feature fusion; MV-strategy3: compared with MDFF-ViT, only use self-attention module instead of crossattention module to fuse features.
The symbols used above are related to Fig. 3. As seen in the Table 5, our MDFF-ViT achieves the best accuracy with minor increase in FLOPs and parameters.

Ablation Study with Different Patch Sizes used in MDFF-ViT
Different from MDFF-ConvMixer, MDFF-ViT adds the two classification scores of the doublebranch as final output, so the sizes of T spa and T fre need not to be the same, which means we can set patch sizes differently for spa-branch and fre-branch in MDFF-ViT. We test 9 different patch  sizes pairs on Dota 32×32 , respectively, to learn the effect of patch sizes, results are shown in Table 6. Without a doubt, MDFF-ViT achieves the best performance with the patch sizes pair of (16,16). Intuitively, small patch sizes will increase model's computation and the memory usage of GPU, simultaneously big patch sizes will lose details. The patch sizes pair (8,8) should get better results as it provides more fine-grained features; however it is not good as (16,16) because of it's huge FLOPs. The patch size pair (8,8) has 16 times as many tokens as patch sizes pair (16,16). The large number of tokens will generate a lot of floating point operations (FLOPs) and take up a large amount of GPU memory, leading to a very small batch size, such as 1 and high randomness of the gradient of each layer of the model, which consumes a lot of training time and makes the model difficultly to converge.

Conclusion
We propose an MDFF method for small target classification, which realizes multidomain feature extraction through the fusion of frequency domain features and spatial domain features. The MDFF method enriches the information content of targets, which is crucial for improving the accuracy of small target classification tasks. Experiments demonstrate the effectiveness of this method. Although the current work in this study only involves research on small target classification, the MDFF idea presented in this work can be used in more computer vision fields theoretically, such as object detection. This is because in the task of object detection, networks often generate numerous proposals containing positive and negative samples. When performing bounding box regression on the proposals and ground truth bounding boxes (GT-bbox), it is necessary to  assign a class to each positive proposal, a process similar to object classification. Therefore, the use of the MDFF method can increase the information contained in the proposals, which is highly beneficial for generating high-quality proposals. To demonstrate the feasibility of this viewpoint, we made a simple modification to the ROI-HEAD of faster R-CNN 32 by adding a frequency domain branch for class prediction and named it MDFF-faster R-CNN. 33 We trained it on a subset of the COCO 34 dataset (named person-car). To investigate the performance of our algorithm in real-world scenarios, we also conducted transfer learning on the unmanned aerial vehicle (UAV)human 35 dataset for object detection, where the targets are observed from the perspective of UAVs. The experimental results in Table 7 confirmed that MDFF-faster R-CNN outperforms the original faster R-CNN in terms of detection performance. Despite the increase in network parameters, MDFF-faster R-CNN outperforms faster R-CNN in all COCO AP metrics, especially noteworthy is the improvement of 1.35% in AP 75 , which further demonstrates that the MDFF method effectively enhances the quality of proposals. Furthermore, from the visualization results in Fig. 6, it can be observed that the inclusion of the MDFF method effectively eliminates false detections and improves the quality of predicted bounding boxes. Based on the experimental results, we are further convinced that incorporating multi-domain features will lead to better performance in object detection. We plan to conduct further research and investigation in subsequent studies to explore this extension thoroughly.

Code, Data, and Materials Availability
The truth dataset used in our study is publicly available. The author's code and dataset are not publicly available at this time but are available from the authors upon reasonable request.