Offset flow-guide transformer network for semisupervised real-world video denoising

Abstract. Video denoising is a fundamental task in low-level computer vision. Most existing denoising algorithms use synthetic data learning. However, there is a significant difference between the noise distributions of synthetic and natural data, which leads to poor generalization performance of the model in actual scenes. Hence, a video method based on an offset optical flow-guided transformer is proposed. The proposed method adopts a semisupervised framework to improve the model’s generalization performance, designs the offset optical flow to guide the transformer in capturing critical information, and performs global self-similarity modeling using neighboring spatiotemporal domain features to improve the denoising performance. In addition, contrastive learning is introduced in the supervised branch to prevent the fitting of wrong labels, imaging prior information to mine sequence features in the unsupervised branch, and a two-branch memory loss is introduced to reduce the difference of double branch training. Experimental results on synthetic and real videos demonstrate that our method has obvious quantitative and qualitative improvements over state-of-the-art methods with fewer parameters.


Introduction
With the wide application of digital devices in tasks such as handheld cameras, target tracking, and automatic driving, the underlying visual task of video denoising has gradually become increasingly critical, requiring denoising algorithms to not only improve the visual quality of sequences but also achieve better generalization performance in natural complex environments.
Traditional multiframe denoising methods extend a priori image information to a time series, 1 such as extending self-similarity to intersequences 2 for searching and compensating and using a priori image variation to smooth the noise.However, the general prior applies only to isolated situations, making it impossible to apply it to all scenarios.With the development of deep learning methods, algorithms based on convolutional neural networks (CNN) have been proposed, and their powerful characterization ability can address temporal redundancy and improve denoising performance.For example, a priori features have been progressively incorporated into the convolutional kernel 3 to reduce local redundancy; strategies, such as optical flow 4,5 and deformable convolution, 6 are used to achieve frame alignment; and sequence modeling is implemented using U-shaped structures 7 and recurrent neural network (RNN) architectures 1,8,9 to adequately propagate sequence features.However, CNN models essentially learn in a fixed core domain, and there remain some limitations to their learning of remote spatiotemporal dependence and nonlocal self-similarity.
Recently, vision transformer (ViT) approaches have bridged this gap.The transformer captures the correlation between pixels through a global attention mechanism, unlike the nonlocal self-similarity property of images. 1 Additionally, it enables the modeling of remote spatial dependencies; however, existing methods still face some problems.First, the transformer's processing of multiple input sequences leads to substantial computational overhead.Although the currently proposed global sliding window, 10 cyclic frame-by-frame parallel processing incorporation of wavelet transform, 11,12 and other approaches reduce computational redundancy, they are still deficient in boundary processing and are difficult to train.Second, to fully utilize temporal redundancy, the introduction of optical flow 13 or an implicit alignment strategy enhances the consistency of the sequence; however, it also leads to increased computation time, and the literature 14 has confirmed that existing alignment methods will affect the performance of the original ViT.
Most methods use synthetic noise sequences for training and verification.Owing to the nonuniformity of the probability model of synthetic noise and the obvious difference between it and the degradation mechanism of image sequences, overfitting easily occurs, and the denoising effect is poor in actual scenes.Currently, some scholars are committed to conducting semisupervised or unsupervised research.MF2F 15 uses fine-tuning technology to minimize losses to reduce input/output differences.UDVD 16 extends image blind spot technology to video sequences.These two studies presented typical unsupervised algorithms.However, its denoising effect easily ignores details, and its generalization performance is insufficient; thus its denoising effect is not ideal.In addition, noise-reduction methods exist for the original RAW sequence to obliterate noise. 17,18Although their denoising performance is good, they have yet to be widely used in practical applications.Therefore, it must be considered how to make full use of the limited number of actual noise sequences to alleviate the problem of domain bias.
To solve the above problems, this paper proposes a flow-guide double transformer (FGDFormer) video denoising method.This method mainly uses the advantages of ViT remote timing modeling to build an attention block guided by an offset optical flow, find matching key features in adjacent frames guided by the flow, and perform self-attention calculations with query elements in reference frames.In addition, FGDFormer is composed of two training branches.The supervision branch uses synthetic data training and introduces contrast regularization constraints to improve the visual quality of denoising, whereas the unsupervised branch uses natural noise sequences for training, using image prior features and double memory loss to correct constraints.In summary, the main contributions of the proposed method are as follows.
(1) The use of an offset optical flow as a guide for the transformer to calculate self-attention is proposed.First, flow guidance can reduce redundant self-attention calculations and provide image-prior features.Second, the offset optical flow avoids the inaccuracy of the original optical flow.(2) The design uses a contrastive regularization term to constrain supervised training.In the feature space, the denoising sequence is closer to the clean sequence and retains the sequence details.To the best of our knowledge, this is the first study to explore contrastive learning in the field of video denoising.(3) In unsupervised branches, an image prior is introduced to preserve the sequence structure and details, and dual-branch memory loss is proposed to reduce the difficulty of semisupervised learning and the difference in the denoising effect between dual branches.

Architecture
In this study, a semisupervised approach was used to train the FGDFormer to mitigate the differences between natural and synthesized data; the overall architecture is shown in Fig. 1.Specifically, the synthesized video dataset ðI i ; L N s i¼1 Þ and the real dataset I N u i¼1 were used for learning, where N s and N u denote the total numbers of synthesized and natural sequences, respectively.In this study, a network ℏð•Þ was trained for learning a clean sequence Y from an input noisy sequence X.Therefore, the overall learning strategy can be formulated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 7 ; 5 4 8

Y ¼ ℏðXÞ;
(1 where ℏð•Þ consists of two parts, the supervised branch ℏ s and the unsupervised branch ℏ u , where the two branch networks share the same weights.During training, supervised branching constrains the model using reconstruction loss and contrast loss and introduces contrast learning to improve the model's learning ability.Unsupervised branching improves the network's fitting ability based on a priori physical features and helps the model learn the distributional properties of natural noise components.The architecture of the method adopts the typical encoder-decoder architecture with skip connections.The global features in the spatiotemporal dimension are extracted using a series of offset flow-guide attention blocks in the code stage.Finally, the features at different levels are aggregated in the decoding stage to generate more realistic and natural video denoising results.

Offset Flow-Guided Attention Block
As previously analyzed, the computational complexity and redundant learning of ViT are challenging issues, particularly when handling high-resolution videos.Based on the existing literature, 19 mainstream methods currently employ multiframe fusion recovery.However, these approaches do not address the impact of redundant features in nearby frames, which can generate pseudonoise, leading to reduced generalization performance.Hence, we propose the offset flow-guided attention block (OFGAB), as shown in Fig. 2(a), where a novel offset flow-guided multihead attention (OFG-MSA) is designed.In addition, unlike the traditional ViT, a dual-gate control network (DGCN) is used instead of a feed-forward network.
; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 4 ; 6 3 3 where ϕð•Þ denotes the GELU nonlinear activation, C i d denotes different 3 × 3 depth convolutions, LN denotes the normalization layer, and ⊙ denotes the elemental dot product.

OFG-MSA
Building on the excellent performance of the optical flow-guided transformer in the underlying vision task, this study drew inspiration from a concept put forth in the literature. 20Our proposed multihead attention is executed with optical flow support.Because optical flow estimation is susceptible to noise, the derived motion characteristics may not be accurate and, in certain cases, may even impede the performance of the ViT, as demonstrated in the literature.In this study, we utilized offset optical flow as the search range for the "query."This method assists in identifying key elements that exhibit high similarity to the query elements.Consequently, it reduces the extraction of key elements by the error optical flow and, in turn, enhances the self-alignment performance of the transformer.Specifically, for the input noise sequence F t ∈ R 3×H×W , the neighboring and reference frames are prealigned according to the computed optical flow.The obtained prealigned and original features are learned by convolution to obtain residual offsets based on the deviation between the two.The residuals are then fused with the original optical flow to obtain the actual offset optical flow off i , which is used as a guide for motion information.The off i formation can be formulated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 4 ; 3 8 1 where flow t denotes the optical flow information with neighboring frames, and W denotes the warp operation.For the query elements, to fully utilize time redundancy, the features are divided into nonoverlapping windows of size P × P. The query and key-value elements are extracted for the window range, and the set of query elements is formulated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 4 ; 3 1 0 where, q t m;n denotes the elements at ðm; nÞ with ði; jÞ as the center in a certain window with a distance from the center point of less than P. The set of all elements of query will be searched for the corresponding highly similar key-value elements in the neighboring frames according to the guidance of the offset optical flow off t .The set of key-value elements key, value is formulated as where Ω t i;j denotes the set of all key-value elements obtained according to the guidance of off i using q t m;n as the query feature, where t denotes the reference frame index, f denotes the neighboring frame index, r denotes the number of neighboring frames, and the offset motion information is expressed by Eq. ( 5).ðΔx f ; Δy f Þ denotes the location of the window located at position ðm þ Δx f ; n þ Δy f Þ according to the motion information obtained from the offset optical flow.F ref and F sup denote the reference and support frames, respectively, and ½• denotes the operation of extracting the information of the offset optical flow.The OFG-MSA is represented by where N denotes the number of attention heads, and

Slip Compensation Strategy
The transformer has the advantage of global spatial modeling.However, considering the computational cost, the radius of the key-value lookup of the method in this paper is limited to r, which limits the ability of remote modeling to some extent.Meanwhile, if the alignment features are transmitted only between each module, the performance of the subsequent key-value matching will be continuously affected when the optical flow is inaccurate.Therefore, a slip compensation strategy (SCS) is proposed to further enhance remote timing modeling.Specifically, the output of the features from the forward OFGAB is utilized to connect with the features of the reference frame, and the features of the original reference frame are fused with the output intermediate results for the fusion extraction operation, so that when looking for the key-value elements in the posterior block as the support frame, highly similar key-value regions can still be found through the original optical flow.As shown in Fig. 3, f denotes the features of each input frame, the superscript t denotes the t'th block in the time sequence, and the subscript denotes the index of the sequence frame.The sliding compensation strategy can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 7 ; 3 9 3 The SCS is proposed to propagate the sequence features of the video without interruption, and the output denoising results of the last block are preserved.In addition, the fusion of the output features with the original features facilitates accurate guidance of the subsequent motion information and helps preserve more texture structures.

Design of Loss Function
The supervised branch learns the mapping between synthetic and clean noise sequences, whereas the unsupervised branch mainly learns the probability distribution of the natural noise.Therefore, in this study, different loss functions were used for the design according to the following equation:

Supervised branch loss functions
The supervised branch is trained using a synthetic dataset, and the supervised branch loss L sup is defined in Eq. ( 9), where is used to balance the reconstruction and contrast losses: ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 7 ; 1 7 1 Reconstruction loss L re .L1 and L2 losses are two standard functions.However, L2 loss penalizes minor errors less, thus ignoring the detailed content of the image itself, which has also been confirmed in the literature, 21 and better practical results can be obtained with L1.Therefore, this study used L1 loss as the reconstruction loss.Contrast regularization loss L cr .Relying solely on the reconstruction loss increases the likelihood of inaccurate labeling.Current approaches involve refining the perceptual domain content using various methods, including perceptual and a priori-based loss, as a regularization term.Therefore, this method incorporates the contrast loss L cr as a constraint.This pushes the anchor samples toward the positive samples and away from the negative samples by learning the representation.Compared to perceptual loss, contrast loss not only considers the difference between the real and output sequences but also constrains the solution space by considering negative samples as negative features space. 22In this study, the denoised sequence output was used as an anchor point, and positive and negative samples consisted of clear and noisy sequences, respectively.To strengthen the fitting ability of the model, the negative samples contained other types of noise distinct from the input noise.To extract the potential feature space, a pretrained VGG-19 23 was utilized as the fixed feature decoder (FE) L cr can be reformulated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 4 ; 5 6 9 where X i and Y i denote the input sequence frame and output sequence frame of the i'th frame, respectively, ϕ m ð•Þ denotes that the sequence is noised as a negative sample, Y i ¼ ℏ s ðϕ t ðX i ÞÞ denotes that the output denoised sequence is used as an anchor point, φð•Þ denotes the j'th layer of the FE, w j denotes the weight coefficients of each layer, K denotes the specified number of layers of the potential feature space, and T denotes the number of input sequence frames.In this study, we used L1 loss to measure the feature space distance between the anchor points and positive and negative samples.Therefore, Eq. ( 8) can be rewritten as follows:

Unsupervised branch loss functions
Unsupervised branch loss utilizes a real dataset for training, and the branch loss L unsup is defined as ; t e m p : i n t r a l i n k -; e 0 1 2 ; 1 1 4 ; 3 5 5 Total variation loss L tv .Due to the lack of clean sequences as labels, a priori features based on images are required as supervised constraints.Total variation (TV), as an a priori image feature, can model the information of the image gradient 24 distribution.Therefore, this study introduced TV loss as an unsupervised branching constraint as follows: where T denotes the number of frames of the natural noise sequence, and ∇ h and ∇ v are the gradient operators in the horizontal and vertical directions, respectively.Variational loss preserves the image edge features of a sequence.However, it utilizes gradient error backpropagation supervision, which raises the problem of training instability.
Content preservation loss L cp .To improve the robustness of the network, the input sequence is used as supervision, and the content preservation loss is designed to minimize the difference between the two as a regular term through L1 loss, which helps the model generate denoising results that are as similar as possible to the original inputs in terms of the overall structure and color.Simultaneously, it alleviates the problem of difficult model training: ; t e m p : i n t r a l i n k -; e 0 1 4 ; 1 1 4 ; 1 1 0 Double memory loss L dm .Semisupervision can enhance the generalization performance of the method.However, during the transition, the two conflicting learning methods tend to lose their vital features.Thus the unsupervised branch introduces double-branch memory loss, which helps to retain the knowledge acquired in the supervised branch.Specifically, first, a copy of the parameter model trained in the supervised branch is kept as ħs .When the real sequence is used for training, the input noise sequence, in addition to obtaining the result Y u i busing the unsupervised branch, is simultaneously computed by the copy to obtain a copy of the denoising result Y s i .The error between the two is then minimized to avoid semisupervised training difficulty.Considering that the commonly used L1 and L2 losses are pixel-level comparisons, to fully utilize the self-similarity property of sequence images, this study adopted the structural similarity function [structural similarity index measure (SSIM)] as the loss function, and L dm can be reformulated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 1 1 7 ; 5 8 0

Noise Degradation
To better adapt to noise reduction processing in a real environment, it is necessary to carefully design synthetic noise processing.Therefore, inspired by the literature, 25 fixed degradation processes exist because the noise distribution is unknown and complex.We adjusted the noise synthesis strategy.Specifically, for a given clean sequence, the following probabilistic noise models were randomly added: Gaussian noise, Poisson noise, sensor noise, and JPEG compression noise.The probability of occurrence of Gaussian noise was 1, and the probability of occurrence of all other types was random, with a probability of no more than 0.5.If the probability of camera sensor noise was not zero, it was set preferentially; if the probability of JPEG compression noise was not zero, it was set last.In addition, the blurring effect and resizing were introduced in the late stage of noise addition, and the two strategies could better simulate the actual noise sequence situation (Algorithm 1).

Datasets
In this study, DAVIS 26 and Set8 17 were selected as the synthesized datasets, and CRVD 17 and ICOV 27 were selected as the real noise datasets.Among them, DAVIS contains 90 sequences at 480P resolution and 30 test sequences at 854 × 480; Set8 is generally used for testing and contains 8 video sequences at 940 × 540 resolution.CRVD is in the RAW format and contains five different ISO levels for 16 different indoor and outdoor scenes, with real static sequences as a reference for indoor scenes and no clean sequences for outdoor scenes.We used pretrained image signal processing converted to the sRGB format for a fair comparison.ICOV is a fixed handheld device used to capture dynamic videos in various states, and the corresponding "clean" sequences were generated based on the average value of multiple frames as labeling data.
In this study, we used a 9:1 ratio from DAVIS as the training and validation sets, and different types of noise were added during the training process.Six CRVD indoor scenes were selected as the training set, and the remaining scene videos were used as the test set.

Training and Evaluation Setup
The proposed framework is a U-shaped structure containing three different scale features, with the model intermediate feature dimension set to 64 and skip connections as residuals for information complementation.The feature extraction and aggregation phases consist of 3 and 6 residual blocks, respectively, which are used to ensure speed while fusing different levels of spatiotemporal information.The optical flow is estimated using a pretrained SpyNet. 28After many experiments and experiences, it was concluded that the model performs best when the different weights in the joint loss function β, λ 1 , λ 2 , and λ 3 are set to 0.4, 1, 0.3, and 0.5, respectively.
The experiments were implemented via PyTorch 1.8 with an NVIDIA GeForce RTX2080SUPER GPU.The proposed algorithm was implemented by training the network parameters via the Adam 29 optimizer with the β 1 and β 2 parameters set as 0.9 and 0.99, respectively.The batch size was 4, initial learning rate was 1 × 10 −4 , a cosine annealing strategy controlled the learning rate, and the learning rate was stopped by decelerating to 10 −7 , which is proven to be effective for stabilizing the training in various experiments.The random cropping size during training was 128, and the total number of training iterations was 200k.

Synthetic Denoising
Comparative experiments were conducted using various types of noise.To assess the image quality, we utilized established metrics that are often employed in video restoration, specifically peak signal-to-noise ratio (PSNR) 30 and SSIM. 31The larger the metric value is, the better the image quality is.We compared our method with state-of-the-art (SOTA) deblurring methods, including DVDNet, 4 FastDVDnet, 7 PaCNet, 5 VRT, 13 RVRT, 32 TempFormer, 11 and ASWin. 12We retrained them according to the aforementioned settings for inconsistencies in the type and amount of training data of the compared algorithms.

Quantitative comparison
Table 1 lists the objective metrics and running times under synthetic Gaussian white noise at different noise levels, with the resolution of the test sequences uniformly set at 480P.As shown in this table, the proposed algorithm achieved good objective metrics compared with the other algorithms, and there was a significant improvement in the average PSNR under all noise levels compared with the CNN-based methods.Compared with the CNN-based methods, the proposed algorithm exhibits a noticeable improvement in various noise levels, with an average PSNR gain of 2.4 dB, which is close to that of the current SOTAVRT and RVRT algorithms, and the processing speed is the fastest.This was analyzed because of the performance burden generated by the complex network architecture of VRT and the parallel strategy of RVRT.This method fully utilizes the advantages of the transformer.It achieves a trade-off between the denoising performance and speed as much as possible with the help of offset optical flow guidance and SCS.In addition, according to the performance in Table 1 on Set8, the generalization performance of the proposed method was best under different noise levels, indicating that the proposed method is suitable for video denoising tasks in natural scenes.In Table 2, we show the PSNR and SSIM of the different methods on the DAVIS testing set under different noise types.Poisson noise was generated using the pixel distribution range of the image.The JPEG compression ratio was set to 50, and the mixed method described in this section was utilized.Compared to other methods, our method exhibited better performance.Specifically, our model outperformed previous SOTA RVRT by an average PSNR of 0.45 dB.These methods are influenced by noise because they neglect the generalization performance.These results demonstrate the superiority of the proposed architecture.

Qualitative comparison
Figure 4 shows the denoising results when the Gaussian noise intensity was 20.As this figure shows, there is still some noise in DVDNet, and FastDVDNet lacks the ability to process detailed information, such as the character's facial features.However, these methods are influenced by implicit alignment and cannot capture the fast movements of tennis rackets.PaCNet is high because it is used to search for a patch of self-similarity.However, it ignores the edge structure, resulting in detailed information not being reflected.The VRT processing effect is effectively too smooth, and the character's skin color is influenced by the bright background; compared to RVRT, the present method can be more explicit for the character's facial details and background  texture yellow line recovery.Figure 5 shows the denoising results when the synthetic noise intensity was 40.Under high-intensity noise, the proposed method can recover the details of trees on top of the mountain in the background, and the overall color of the characters is closer to that of the original image.However, other methods based on ViT cause a certain amount of distortion in the color of the character's outlook.Some of the samples were extremely smooth.

Real-World Denoising
The video denoising performance was verified for real environments using the remaining CRVD indoor and outdoor scenes as tests.In addition to the above PSNR metrics, a no-reference image quality assessment was introduced to assess the denoising effect accurately.In this study, we used the SOTA NIQE 33 (natural image quality evaluator) metric, which does not require a reference image but fits a multivariate Gaussian model based on a series of a priori information to measure the differences in multivariate distributions of a single image to be tested; the smaller the value of NIQE is, the higher the quality of the image is.For a fair comparison, we compared our method with SOTA methods supporting blind video noise reduction, including ViDeNN, 34 EDVR, 6 RViDeNet, 17 UDVD, 16 and FloRNN. 1

Quantitative comparison
Tables 3 and 4 present a visual comparison of the different methods on the CRVD and IOCV datasets with the PNSR and NIQE metrics, respectively.As shown in Table 3, the proposed algorithm obtained the best performance metrics for different exposure scenarios, with an average PSNR gain of ∼1 dB compared to RViDeNet, which provides the original data, and an average gain of 0.3 dB compared to the current SOTA FloRNN.The proposed method provides better NIQE objective metrics under different exposure scenarios without a reference quality assessment.A comparison of the denoising performance under the untrained IOCV dataset is shown in Table 4, where the proposed method mostly achieved advanced quantitative results and improved the NIQE by ∼0.2 compared to the FloRNN method, which indicates that the proposed method is better able to generate high-quality denoising results that are more in line with the visual habits of the human eye.

Qualitative comparison
Figure 6 shows a scene with ISO 25600 in CRVD, from which it can be seen that ViDeNN still has some noise because its architecture modeling ability does not apply to complex backgrounds.In contrast, both the EDVR and UDVD methods have some denoising effects but lack the processing of complex edges, which leads to blurring of the ball and background boundaries.
RViDeNet and FloRNN are similar to the original image; however, some blurring remains in the background region in the lower left corner.The method proposed in this paper also utilizes the guidance of the offset optical flow, which can capture highly similar clean regions as a complement during processing.Hence, the overall denoising effect is more visually consistent with the behavior of the human eye.
Figure 7 shows the comparative denoising results of a video sequence from IOCV data, which provides more testing of the actual generalization performance because the IOCV data are not involved in the training process.The video sequence camera moves quickly and presents   objects with certain blurring and artifacts.Nevertheless, the proposed method still achieves good visual results, the detailed information of the font is complex, and at the same time, according to the data in Fig. 4 and Table 4, the method can recover more texture detail information, which can deal with more complex and unknown scenes.

Ablation Study
The following ablation experiments were performed to verify the contribution of each module and loss function to the and generalization performance of the method.The training settings were the same as above, and the test data consisted of 10 randomly selected sequences from the DAVIS test set and CRVD outdoor scenes with an exposure level of 6400.In this case, Gaussian white noise of level 30 was added to the synthetic noise.The evaluation metrics of the ablation experiments were compared with the PSNR values.

Effectiveness of each module
The effectiveness of each module was investigated (Table 5).Specifically, we utilized the design of the initial model as attention blocks, referred to as "base."We conducted the experiments by removing these modules.The model without these modules exhibited a performance decrease, demonstrating the importance of each module.These results demonstrate the importance of our proposed attention block, DGCN, and SCS.   it can be observed that contrast learning still favors the overall denoising performance by crowding out negative samples.The introduction of semisupervised learning helps the network produce high-quality denoised sequences compared with the use of supervised learning alone.In semisupervised branching, double memory loss can effectively mitigate the gaps in the training of the branching network and improve the overall generalization performance of the structure.

Comparison of sample sizes for contrastive learning
The numbers of positive and negative samples in contrastive learning also affect the overall performance of the algorithm. 22Therefore, we assumed that the number of samples is s, where positive samples contain clean sequences by default, and negative samples contain input sequences by default.The remaining s − 1 positive samples come from other frames of the same video, and the negative samples add different types of noise to these frames.For the experiments, given the limitations of the deployment environment, the maximum value of s was set to 3. As shown in Table 7, the addition of negative samples improved the overall performance of the model.Negative samples were used to move the model results "away" from noise features.Although increasing the number results in more computations, contrast loss is not applied in the actual processing, so the training time only partially increases.The number of positive and negative samples in contrastive learning also affects the algorithm's overall performance. 23Therefore, the effect of different positive and negative sample ratios will be explored.Suppose the number of samples is s, where positive samples contain clean sequences by default, negative samples contain input sequences by default, other frames of the same video supplement the remaining s − 1 positive samples, and the remaining s − 1 negative samples are supplemented by introducing different types of noise.Considering the limitations of the deployment environment, the maximum number of s is set to 3. As shown in Table 7, adding more negative samples will improve the model's overall performance.The negative samples are used to keep the model results "away" from the noise features, so the number of negative samples in this paper is set to 3, which performs best.However, increasing the number will lead to a certain amount of computation, but the loss of contrast is not applied in the actual processing.Although increasing the number of negative samples will lead to some computational effort, the contrast loss will not be applied in the actual processing, so it will only increase part of the training time.

Conclusion
This paper proposed a semisupervised denoising method based on an offset optical FGDFormer for natural scene-oriented video noise reduction tasks.The core idea is to use the motion characteristics of the offset optical flow to sparsely model the transformer, achieve fast acquisition of highly similar regions for spatiotemporal domain learning, and use an SCS to achieve long-distance modeling.In this case, the implementation of offset optical flow enables a reduction in the overall computation, and the employment of a sliding compensation strategy promotes the temporal consistency of the denoised sequences.Furthermore, this study adopted a semisupervised learning approach that utilizes different losses for supervision.Novel contrast learning was introduced in the supervised branch to comprehensively improve the denoising performance of the model, in which a combination of image priors was used to preserve the detailed features of the original sequences, and the difficulty of two-branch training was alleviated by two-branch memory loss.Comprehensive experiments and qualitative comparisons demonstrated that the proposed method achieves the best video denoising effect at a low cost.In future work, more effective and lightweight video-denoising algorithms will be further explored for real-world applications.

Disclosures
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Table 1
Quantitative (PSNR/SSIM) comparison on DAVIS test and Set8 dataset for synthetic Gaussian noise.

Table 6
compares the objective evaluation of the denoising and generalization abilities of different loss functions.It can be observed from this table that the joint use of multiple loss functions leads to a better denoising performance of the model while improving the generalization performance.Comparing the use of perceptual loss and contrast learning in the supervised branches (S2 and S3),

Table 5
Ablation validation of the effectiveness of each component.

Table 6
Ablation validation with different loss functions.
Note: bold values indicate the best scores.

Table 7
Ablation validation for positive and negative sample sizes.