Global motion compensated visual attention-based video watermarking

Abstract. Imperceptibility and robustness are two key but complementary requirements of any watermarking algorithm. Low-strength watermarking yields high imperceptibility but exhibits poor robustness. High-strength watermarking schemes achieve good robustness but often suffer from embedding distortions resulting in poor visual quality in host media. This paper proposes a unique video watermarking algorithm that offers a fine balance between imperceptibility and robustness using motion compensated wavelet-based visual attention model (VAM). The proposed VAM includes spatial cues for visual saliency as well as temporal cues. The spatial modeling uses the spatial wavelet coefficients while the temporal modeling accounts for both local and global motion to arrive at the spatiotemporal VAM for video. The model is then used to develop a video watermarking algorithm, where a two-level watermarking weighting parameter map is generated from the VAM saliency maps using the saliency model and data are embedded into the host image according to the visual attentiveness of each region. By avoiding higher strength watermarking in the visually attentive region, the resulting watermarked video achieves high perceived visual quality while preserving high robustness. The proposed VAM outperforms the state-of-the-art video visual attention methods in joint saliency detection and low computational complexity performance. For the same embedding distortion, the proposed visual attention-based watermarking achieves up to 39% (nonblind) and 22% (blind) improvement in robustness against H.264/AVC compression, compared to existing watermarking methodology that does not use the VAM. The proposed visual attention-based video watermarking results in visual quality similar to that of low-strength watermarking and a robustness similar to those of high-strength watermarking.


Introduction
With the recent rapid growth of digital technologies, content protection now plays a major role within content management systems.Of the current systems, digital watermarking provides a robust and maintainable solution to enhance media security.The visual quality of the host media (often known as imperceptibility) and robustness are widely considered as the two main properties vital for a good digital watermarking system.They are complimentary to each other, hence it is challenging to attain the right balance between them.This paper proposes a new approach to achieve high robustness in watermarking while not affecting the perceived visual quality of the host media by exploiting the concepts of visual attention (VA).
The human visual system (HVS) is sensitive to many features which lead to attention being drawn toward specific regions in a scene and is a well-studied topic in psychology and biology. 1,2VA is an important and complex biological process that helps to identify potential danger, e.g., prey, predators quickly in a cluttered visual world 3 as attention to one target leaves other targets less available. 4Recently, a considerable effort was noticed in the literature in modeling VA 3 that has applications in many related domains including media quality evaluation.Areas of visual interest stimulate neural nerve cells, causing the human gaze to fixate toward a particular scene area.The visual attention model (VAM) highlights these visually sensitive regions, which stimulates a neural response within the primary visual cortex. 5Whether that neural vitalization be from contrast in intensity, a distinctive face, unorthodox motion, or a dominant color, these stimulative regions diverge human attention providing highly useful saliency maps within the media processing domain.
Human vision behavioral studies 6 and feature integration theory 1 have prioritized the combination of three visually stimulating low level features: intensity, color, and orientation which comprise the concrete foundations for numerous image domain saliency models. 3,7,80][11] Temporal features must be considered as moving objects and are more eyecatching than most static locations. 12Work has seldom been directed toward video saliency estimation, in comparison to the image domain counterpart, as temporal feature consideration dramatically increases the overall VA framework complexity.Most typical video saliency estimation methodologies 3,[13][14][15][16][17][18] exist as a supplementary extension from their image domain algorithms.Research estimating VA within video can also be derived from exploiting spatiotemporal cues, 19,20 structural tensors, 21 and optical flow. 22owever, none of these algorithms explicitly captures the spatiotemporal cues that consider object motion between frames as well as the motion caused by camera movements.Motion within a video sequence can come from two categories namely, local motion and global motion.Local motion is the result of object movement within frames, which comprises all salient temporal data.One major feature associated with local motion is independence, so no single transformation can capture all local movement for the entire frame.
Local motion can only be captured from successive frames differences if the camera remains motionless.On the contrary, global motion describes all motion in a scene based on a single affine transform from the previous frame and usually is a result of camera movement during a scene.The transform consists of three components, i.e., camera panning, tilting, and zooming or in image processing terms translation, rotation, and scaling.Figure 1 shows three causes for global motion.This paper proposes a new video VAM that accounts for local and global motions using a wavelet-based motion compensated temporal filtering framework.Compensating for any perceived camera movement reduces the overall effect of global movement so salient local object motion can be captured during scenes involving dynamic camera action.
A region of interest (ROI) dictates the most important visible aspects within media, so distortion within these areas will be highly noticeable to any viewer.The VAM computes such regions.This paper proposes a unique video watermarking algorithm exploiting the new video VAM.In frequency domain watermarking, the robustness of the watermarking is usually achieved by increasing the embedding strength.However, this results in visual distortions in the host media, thus a low imperceptibility of embedding.In the proposed method in this work, high watermark robustness without compromising the visual quality of the host media is achieved by embedding greater watermark strength within the less visually attentive regions within the media, as identified by the video VAM (in Sec. 2).
Related work includes defining an ROI [23][24][25][26][27][28] and increasing the watermark strength in the ROI to address cropping attacks.However, in these works, the ROI extraction was only based on foreground-background models rather than VAM.There are major drawbacks of such solutions: (a) increasing the watermark strength within eye catching frame regions is perceptually unpleasant as human attention will naturally be drawn toward any additional embedding artifacts, and (b) scenes exhibiting sparse salience will potentially contain extensively fragile or no watermark data.Sur et al. 29 proposed a pixel domain algorithm to improve embedding distortion using an existing visual saliency model described in Ref. 3.However, the algorithm only discusses its limited observation on perceptual quality without considering any robustness.
Our previous work 30,31 shows the exploitation of image saliency in achieving image watermarking robustness.It is infeasible to simply extend the VA-based image domain algorithm into a frame-by-frame video watermarking scheme, as temporal factors must first be considered within the video watermarking framework.A viewer has unlimited time to absorb all information within an image, so potentially could view all conspicuous and visually uninteresting aspects in a scene.However, in a video sequence, the visual cortex has very limited processing time to analyze each individual frame.Human attention will naturally be drawn toward temporally active visually attentive regions.Thus the proposed motion compensated VAM is a suitable choice for VA-based video watermarking.By employing VA concepts within the digital watermarking, an increased overall robustness against adversary attacks can be achieved, while subjectively limiting any perceived visual distortions by the human eye.The concept of VA-based image and video watermarking was first introduced in our early work. 30,32ecent work following this concept can be found in watermarking H.264 video 33 and application on cryptography. 34n the contrary, in this paper, we propose a video watermark embedding strategy based on VA modeling that uses the same spatiotemporal decomposition used in the video watermarking scheme.In addition, the VAM compensates global motion in order to capture local motion into the saliency model.
Performances of our saliency model and the watermarking algorithms are separately evaluated by comparisons with existing schemes.Subjective tests for media quality assessment recommended by the International Telecommunication Union (ITU), 35 largely missing in the watermarking literature, are also conducted to complement the objective measurements.Major contributions of this paper are: • A new motion compensated spatiotemporal video VAM that considers object motion between frames as well as global motions due to camera movement.• New blind and nonblind video watermarking algorithms that are highly imperceptible and robust against compression attacks.• Subjective tests that evaluate visual quality of the proposed watermarking algorithms.
The saliency model and the watermarking algorithms are evaluated using the existing video datasets described in Sec.4.1.The initial concept of the motion compensated video attention model was reported earlier in the form of a conference publication 36 while this paper discusses the proposed scheme in detail with an exhaustive evaluation and proposes a case study describing a new video watermarking scheme that uses the attention model.

Motion Compensated Video Visual Attention
Model The most attentive regions within media can be captured by exploiting and imposing characteristics from within the HVS.In this section, a method is proposed to detect any saliency information within a video.The proposed methods incorporate motion compensated spatiotemporal wavelet decomposition combined with HVS modeling to capture any saliency information.A unique approach combining salient temporal, intensity, color, and orientation contrasts formulate the essential video saliency methodology.
Physiological and psychophysical evidence demonstrates that visually stimulating regions occur at different scales within media 37 and the object motion within the scene. 12onsequently, models proposed in this work exploit the identifiable multiresolution property of the wavelet transform that incorporates a motion compensation algorithm to generate the model.By exploiting the multiresolution spatiotemporal representation of the wavelet transform, VA is estimated directly from within the wavelet domain.The video saliency model is divided into three subsections.First, Sec.2.1 describes the global motion compensation following the description of the spatial saliency model in Secs.2.2 and 2.3 that illustrates the temporal saliency feature map generation.Finally, Sec.2.4 combines the spatiotemporal model to estimate video visual saliency.An overall functional block diagram of our proposed model is shown in Fig. 2. For the spatial saliency model in this work, we adopted our image VAM proposed in Refs.30 and 31.

Global Motion Compensated Frame Difference
Compensation for global motion is dependent upon homogeneous motion vector (MV) detection, consistent throughout the frame.Figure 3 considers the motion estimation between two consecutive frames, taken from the coastguard sequence.A fixed block size based on the frame resolution determines the number of MV blocks.The magnitude and phase of the MVs are represented by the size and direction of the arrows, respectively, whereas the absence of an arrow portrays an MV of zero.First, it is assumed there is a greater percentage of pixels within moving objects than in the background, so large densities of comparative MVs are the result of dynamic camera action.To compensate for camera panning, the entire reference frame is spatially translated by the most frequent MV, the global camera MV, Mglobal .This process is applied prior to the 2-D þ t wavelet decomposition to deduce global motion compensated saliency estimation.The global motion compensation is described in Eq. ( 1) E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 4 9 4 where Mobject is the local object MV and Mtotal is the complete combined MV.
Compensating for other camera movement can be achieved by searching for a particular pattern of MVs.For example, a circular MV pattern will determine camera rotation and all MVs converging or diverging from a particular point will govern camera zooming.An iterative search over all possible MV patterns can cover each type of global camera action. 38Speeded up robust features detection 39 could be used to directly align key feature points between consecutive frames, but this would be very computationally exhaustive.This model only requires a fast rough global motion estimate to neglect the effect of global camera motion on the overall saliency map.

Spatial Saliency Model
As the starting point in generating the saliency map from a color image/frame, RGB color space is converted to YUV color spectral space as the latter exhibits prominent intensity variations through its luminance channel Y. First, the twodimensional (2-D) forward discrete wavelet transform (FDWT) is applied on each Y, U, and V channel to decompose them in multiple levels.The 2-D FDWT decomposes an image in frequency domain expressing coarse grain approximation of the original signal along with three fine grain orientated edge information at multiple resolutions.Discrete wavelet transform (DWT) captures horizontal, vertical, and diagonal contrasts within an image portraying prominent edges in various orientations.Due to the dyadic nature of the multiresolution wavelet transform, the image resolutions are decreased after each wavelet decomposition iteration.This is useful in capturing both short and long structural information at different scales and useful for saliency computation.The absolute values of the wavelet coefficients are normalized so that the overall saliency contributions come  from each subband and prevent biasing toward the finer scale subbands.An average filter is also applied to remove unnecessary finer details.To provide full resolution output maps, each of the high frequency subbands is consequently interpolated up to full frame resolution.The interpolated subband feature maps, LH i (horizontal), HL i (vertical), and HH i (diagonal), i ∈ N 1 , for all decomposition levels L are combined by a weighted linear summation as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 4 1 8 where τ i is the subband weighting parameter and and HH 1 • • • L X are the subband feature maps for a given spectral channel X, where X ∈ fY; U; Vg.
A feature map promotion and suppression steps follow next as shown in Eq. ( 3).If m is the average of local maxima present within the feature map and M is the global maximum, the promotion and suppression normalization is achieved by where lh X , hl X , and hh X are the normalized set of subband feature maps.Finally, the overall saliency map, S, is generated by where w X is the weight given to each spectral component and S X is the saliency map for each spectral channel ðY; U; VÞ, which is computed as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 3 2 6 ; 4 7 3 Finally, the overall map is generated by using a weight summation of all color channels as shown in Fig. 4.

Temporal Saliency Model 2.3.1 2-D + t wavelet domain
We extend our spatial saliency model toward video domain saliency logically by utilizing a three-dimensional wavelet transform.Video coding research provides evidence that differing texture and motion characteristics occur after wavelet decomposition from the t þ 2-D domain 40 and incorporating its alternative technique, the 2-D þ t transform. 41,42The t þ 2-D domain decomposition compacts most of the transform coefficient energy within the low frequency temporal subband and provides efficient compression within the temporal high frequency subbands.Vast quantities of the high frequency coefficients have zero magnitude, or very close, which is unnecessary for the transforms' usefulness within this framework.Alternatively, 2-D þ t decomposition produces greater transform energy within the higher frequency components, i.e., a greater amount of larger and nonzero coefficients and reduces computational complexity to a great extent.A description of reduced computational complexity by using 2-D þ t compared to t þ 2-D can be found in Ref. 42.Therefore, in this work we have used a 2-D þ t decomposition as shown in Fig. 5 (for three levels of spatial followed by one level of temporal Haar wavelet decomposition).

Temporal saliency feature map
To acquire accurate video saliency estimation, both spatial and temporal features within the wavelet transform are considered.The wavelet-based spatial saliency model, described in Sec.2.2, constitutes the spatial element for the video saliency model, whereas this section concentrates upon establishing temporal saliency maps, S Temp .Similar methodology to expose temporal conspicuousness is implemented in comparison to the spatial model in Sec.2.2.First, the existence of any palpable local object motion is determined within the sequence.Figure 6 shows the histograms of two globally motion compensated frames.Global motion is any frame motion due to camera movement, whether that be panning, zooming, or rotation (see Sec.The S Temp methodology bears a distinct similarity to the spatial domain approach as the high pass temporal subbands: LHt i , HLt i , and HHt i , for i levels of spatial decomposition, combine after full 2-D þ t wavelet decomposition, which is shown in Fig. 5.The decomposed data are forged using  comparable logic as Eq. ( 2), as all transformed coefficients are segregated into 1 of 3 temporal subband feature maps.This process is described as ; t e m p : i n t r a l i n k -; e 0 0 7 ; 6 3 ; 7 1 9 where LHt t , HLt t , and HHt t are the temporal LH, HL, and HH combined feature maps, respectively.The method captures any subtle conspicuous object motion in horizontal, vertical, and diagonal directions.This subsequently fuses the coefficients into a meaningful visual saliency approximation by merging the data across multiple scales.S Temp is finally generated from E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ;

Spatial-Temporal Saliency Map Combination
The spatial and temporal maps are combined to form an overall saliency map.The primary visual cortex is extremely sensitive to object movement so if enough local motion is detected within a frame, the overall saliency estimation is dominated by any temporal contribution with respect to local motion M. Hence, the temporal weightage parameter, γ, determined from Eq. ( 6) is calculated as ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; 3 9 4 γ ¼ If significant motion is detected within a frame, the complete final saliency map comprises solely from the temporal feature.Previous studies support this theory, providing evidence that local motion is the most dominant feature within low level VA. 43Consequently, if no local motion is detected with a frame, the spatial model contributes toward the final saliency map in its entirety, hence γ is a binary variable.The equation forging the overall saliency map is where S Spat , S Temp , and S Final are the spatial, temporal, and combined overall saliency maps, respectively.An overall diagram for the entire proposed system is shown in Fig. 2.  44 and video coding, e.g., motion JPEG2000, motion-compensated embedded zeroblock coding (MC-EZBC), 45 schemes, leading to smooth adaptability within modern frameworks.][48][49][50][51][52][53][54][55] The FDWT is applied on the host image before watermark data are embedded within the selected subband coefficients.The inverse discrete wavelet transform reconstructs the watermarked image.The extraction operation is performed after the FDWT.The extracted watermark data are compared to the original embedded data sequence before an authentication decision verifies the watermark presence.A wide variety of potential adversary attacks, including compression and filtering, can occur in an attempt to distort or remove any embedded watermark data.A detailed discussion of such watermarking schemes can be found in Ref. 56.

Nonblind watermarking
Magnitude-based multiplicative watermarking 34,51,53,[57][58][59] is a popular choice when using a nonblind watermarking system due to its simplicity.Wavelet coefficients are modified based on the watermark strength parameter, α, the magnitude of the original coefficient, Cðm; nÞ, and the watermark information, Wðm; nÞ.The watermarked coefficients, C 0 ðm; nÞ, are obtained as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 3 2 6 ; 1 7 2 C 0 ðm; nÞ ¼ Cðm; nÞ þ αWðm; nÞCðm; nÞ: Wðm; nÞ is derived from a pseudorandom binary sequence, b, using weighting parameters, W 1 and W 2 (where W 2 > W 1 ), which are assigned as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 1 0 8 To obtain the extracted watermark, W 0 ðm; nÞ, Eq. ( 11) is rearranged as Since the nonwatermarked coefficients, Cðm; nÞ, are needed for comparison, this results in nonblind extraction.A threshold limit of is used to determine the extracted binary watermark b 0 as follows:

Blind watermarking
Quantization-based watermarking 52,54-63 is a blind scheme which relies on modifying various coefficients toward a specific quantization step.As proposed in Ref. 52, the algorithm is based on modifying the median coefficient toward the step size, δ, by using a running nonoverlapping 3 × 1 pixels window.The altered coefficient must retain the median value of the three coefficients within the window after the modification.The equation calculating δ is described as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 6 3 ; 3 4 0 δ ¼ α where C min and C max are the minimum and maximum coefficients, respectively.The median coefficient, C med , is quantized towards the nearest step, depending on the binary watermark, b.The extracted watermark, b 0 , for a given window position, is extracted by where % denotes the modulo operator to detect an odd or even number and C med is the median coefficient value within the 3 × 1 pixels window.

Authentication of extracted watermarks
Authentication is performed by comparison of the extracted watermark with the original watermark information and computing closeness between the two in a vector space.
Common authentication methods are defined by calculating the similarity correlation or Hamming distance, H, between the original embedded and extracted watermark as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 7 ; 3 2 6 ; 6 1 3 where N represents the length of the watermark sequence and È; is the XOR logical operation between the respective bits.

Saliency Map Segmentation
This section presents the threshold-based saliency map segmentation which is used for adapting the watermarking algorithms described in Sec.3.1 in order to change the watermark strength according to the underlying VA properties.Figures 7(a) and 7(b) show an example original host frame and its corresponding saliency map, respectively, generated from the proposed methodology in Sec. 2. In Fig. 7(b), the light and dark regions within the saliency map represent the visually attentive and nonattentive areas, respectively.At this point, we employ thresholding to quantize the saliency map into coarse saliency levels as fine granular saliency levels are not important in the proposed application.In addition, that may also lead to reducing errors in saliency map regeneration during watermark extraction as follows.Recalling blind and nonblind watermarking schemes in Sec.3.1, the host media source is only available within nonblind algorithms.However in blind algorithms, identical saliency reconstruction might not be possible within the watermark extraction process due to the coefficient values changed by watermark embedding as well as potential attacks.Thus, the saliency map is quantized using thresholds leading to regions of similar visual attentiveness.The employment of a threshold reduces saliency map reconstruction errors, which may occur as a result of any watermark embedding distortion, as justified further in Sec.3.4.The thresholding strategy relies upon a histogram analysis approach.Histogram analysis depicts automatic segmentation of the saliency map into two independent levels by employing the saliency threshold, T s , where s ∈ S represents the saliency values in the saliency map, S. In order to segment highly conspicuous locations within a scene, first, the cumulative frequency function, f, of the ordered saliency values, s, (from 0 to the maximum saliency value, s max ) is considered.Then T s is chosen as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 8 ; 3 2 6 ; 1 4 8 where p corresponds to the percentage of the pixels that can be set as the least attentive pixels and f max ¼ fðs max Þ corresponds to the cumulative frequency corresponding to the maximum saliency value, s max .An example of a cumulative Journal of Electronic Imaging 061624-7 Nov∕Dec 2016 • Vol. 25 (6)  frequency plot of a saliency map and finding T s for p ¼ 0.75 is shown in Fig. 7(c).Saliency-based thresholding enables determining the coefficients' eligibility for a low-or high-strength watermarking.To ensure VA-based embedding, the watermark weighting parameter strength, α, in Eqs. ( 11) and (15), is made variable αðj; kÞ, dependent upon T s , as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 9 ; 6 3 ; 6 7 5 αðj; kÞ ¼ α max if sðj; kÞ < T s ; α min if sðj; kÞ ≥ T s ; where αðj; kÞ is the adaptive watermark strength map giving the α value for the corresponding saliency at a given pixel coordinate ðj; kÞ.The watermark weighting parameters, α min and α max , correspond to the high and low strength values, respectively, and their typical values are determined from the analysis within Sec. 3.3.As shown in Fig. 7(d), the most and the least salient regions are given watermark weighting parameters of α min and α max , respectively.An example of the final VA-based alpha watermarking strength map is shown in Fig. 7(e), where a brighter intensity represents an increase in α.

Watermark Embedding Strength Calculation
The watermark weighting parameter strengths, α max and α min , can be calculated from the visible artifact peak signal-to-noise ratio (PSNR) limitations within the image.
Visual distortion becomes gradually noticeable as the overall PSNR drops below 40 ∼ 35 dB, 64 so minimum and maximum PSNR requirements are set to approximate 35 and 40 dB, respectively, for both the blind and nonblind watermarking schemes.These PSNR limits ensure a maximum amount of data can be embedded into any host image to enhance watermark robustness without substantially distorting the media quality.Therefore, it is sensible to incorporate PSNR in determining the watermark strength parameter α.
Recall that PSNR, which measures the error between two images with dimensions X × Y, is expressed on the pixel domain as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 6 3 ; 3 0 8 PSNRðI; where M is the maximum coefficient value of the data, and Iðj; kÞ and I 0 ðj; kÞ are the original and watermarked image pixel values at ðj; kÞ indices, respectively.Considering the use of orthogonal wavelet kernels and the Parseval's theorem, the mean square error in the wavelet domain due to watermarking is equal to the mean square error in the spatial domain. 48Therefore, Eq. ( 20) can be redefined on the transform domain for nonblind magnitude-based multiplicative watermarking, shown in Eq. ( 11), as follows: By rearranging for α, an expression determining the watermark weighting parameter depending on the desired PSNR value is derived for nonblind watermarking in Eq. ( 22) as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 2 ; 3 2 6 ; 7 3 0 α ¼ M ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Similarly, for the blind watermarking scheme described in Sec.3.1.2,PSNR in the transform domain can be estimated by substituting the median and modified median coefficients, C ðmedÞ and C 0 ðmedÞ , respectively, in Eq. (20).Then subsequent rearranging results in an expression for the total error in median values, in terms of the desired PSNR as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 3 ; 3 2 6 ; 6 1 0 Equation ( 23) determines the total coefficient modification for a given PSNR requirement, hence it is used for α in Eq. ( 15).

Saliency Map Reconstruction
For nonblind watermarking, the host data are available during watermark extraction so an identical saliency map can be generated.However, a blind watermarking scheme requires the saliency map to be reconstructed based upon the watermarked media, which may have gotten pixel values slightly different from the original host media.
Thresholding the saliency map into two levels, as described in Sec.3.2, ensures high accuracy within the saliency model reconstruction for blind watermarking.Further experimental objective analysis reveals that the use of thresholding improves the saliency coefficients match up to 99.4% compared to approximately only 55.6% of coefficients when thresholding was not used, hence reconstruction errors are greatly reduced.

Experimental Results and Discussion
The performance of the proposed video VA method and its application in robust video watermarking is presented and discussed in this section.The video VAM is evaluated in terms of the accuracy with respect to the ground truth and computational time in Sec.4.1.The video VA-based watermarking is evaluated in terms of embedding distortion and robustness to compression in Sec.4.2.

Visual Attention Model Evaluation
For attention model evaluation, the video dataset is taken from the literature, 65  Experimental results demonstrate the model performance against the existing state-of-the-art methodologies.The proposed algorithm is compared with the Itti, 15 dynamic, 66 and Fang 19 video VAMs, in terms of accurate salient region detection and computational efficiency.The Itti framework is seen as the foundation and benchmark used for VA model comparison, whereas the dynamic algorithm is dependent upon locating energy peaks within incremental length coding.A more recent Fang algorithm uses a spatiotemporally adaptive entropy-based uncertainty weighting approach.
Figure 9 shows the performance of the proposed model and compares it against the Itti, dynamic, and Fang algorithms.The Itti motion model saliency maps are depicted in column 2, the dynamic model saliency maps in column 3 and the Fang model in column 4. Results obtained using the proposed model are shown in column 5 where from top to bottom, the locally moving snowboarder, flower, and bird are clearly identified as salient objects.Corresponding ground truth frames are shown in column 6, which depict all salient local object movement.Results from our model are subjected to the presence of significant object motion, which dominates the saliency maps.This is in contrast to the other models where differences between local and global movements are not fully accounted for, therefore, those maps are dominated by spatially attentive features, leading to salient object misclassification.For example, the trees within the background of the snowboard sequence are estimated as an attentive region when a man is performing acrobatics within the frame foreground.
The receiver operating characteristic (ROC) curves and corresponding area under curve (AUC) values, shown in Fig. 10 and the top row in Table 1, respectively, display an objective model evaluation.The results show the proposed method is close to the recent Fang model and exceeds the performance of the Itti motion and dynamic models having 3.5% and 8.2% higher ROC-AUCs, respectively.Further results demonstrating our video VA estimation model across  four video sequences are shown in Fig. 11.Video saliency becomes more evident when viewed as a sequence rather than from still frames.The video sequences with corresponding saliency maps are available for viewing in Ref. 67.
The bottom row in Table 1 shows the complexity of each algorithm in terms of average frame computational time.The values in the table are calculated from the mean computational time over every frame within the video database and provide the time required to form a saliency map from the original raw frame.All calculations include any transformations required.From the table, the proposed low complex methodology can produce a video saliency map around 30%, 88%, and 0.5% of the time for the Itti, dynamic, and Fang model frames, respectively.Additionally, the proposed model uses the same wavelet decomposition scheme used for watermarking.Therefore, overall visual saliency-based watermarking complexity is low compared to all three methods compared in this paper.

Visual Attention-Based Video Watermarking
The proposed VA-based watermarking is agnostic to the watermark embedding methodology.Thus, it can be used on any existing watermarking algorithm.In our experiments, we use the nonblind embedding proposed by Xia et al. 51 and the blind algorithm proposed by Xie and Arce 52 as our reference algorithms.
A series of experimental results are generated for our video watermarking case study as described in Sec. 3, analyzing both watermark robustness and imperceptibility.Objective and subjective quality evaluation tools are enforced to provide a comprehensive embedding distortion measure.Robustness against H.264/AVC compression 68 is provided, as common video attacks are comprised of platform reformatting and video compression.Since the VA-based watermarking scheme was presented here as a case study of exploitation of the proposed VAM, our main focus of performance evaluation is on the embedding distortion and the robustness performance with respect to compression attacks.Compression attacks are given focus as watermarking algorithms often employ a higher watermarking strength for encountering the compression and requantization attack.In this work, we demonstrate robustness against H.264/AVC compression, for example.The watermarking evaluation results are reported using the four example video sequences (shown in Fig. 11) from the same data set used for VAM evaluation in the previous section.
α max and α min approximating a PSNR of 35 and 40 dB, respectively, are utilized by applying Eqs. ( 22) and (23).Four scenarios of varied watermark embedding strengths are considered for the VA-based video watermarking evaluation as follows: 1. a uniform α min throughout the entire sequence; 2. the proposed visual VAM-based α strength; 3. a uniform average watermark strength, α ave , chosen as α ave ¼ ðα min þ α max Þ∕2; and 4. a uniform α max used throughout the entire video sequence.
The experimental results are shown in the following two sections: embedding distortion (visual quality) and robustness.

Embedding distortion
The embedding distortion can be evaluated using objective metrics or subjective metrics.While objective quality measurements are mathematical models that are expected to approximate results from subjective assessments and are easy to compute, subjective measurements ensure a viewer's overall opinion of the quality of experience of the visual quality.Often these metrics are complimentary to each other and particularly important in this paper to measure the effect on imperceptibility of the proposed watermark algorithms.
1. "Objective metrics" define a precise value, dependent upon mathematical modeling, to determine visual quality.Such metrics include PSNR, structural similarity index measure (SSIM), 69 just noticeable difference, 70 and video quality metric (VQM). 71PSNR that calculates the average error between two images is one of the most commonly used visual quality metrics and is described in Eq. (20).Unlike PSNR, SSIM focuses on a quality assessment based on the degradation of structural information.SSIM assumes that the HVS is highly adapted for extracting structural information from a scene.A numeric output is generated between 1 and 0 and higher video quality is represented by values closer to 1. VQM evaluates video quality based upon subjective human perception modeling.It incorporates numerous aspects of early visual processing, including both luma and chroma channels, a combination of temporal and spatial filtering, light adaptation, spatial frequency, global contrast, and probability summation.A numeric output is generated between 1 and 0 and higher video quality is represented by values closer to 0. VQM is a commonly used video quality assessment metric as it eliminates the need for participants to provide a subjective evaluation.
Although the subjective evaluation is considered as the most suitable evaluation for the proposed method in this paper, the visual quality evaluation in terms of the PSNR, SSIM, and VQM metrics are shown in Tables 2 and 3 for nonblind and blind watermarking schemes, respectively.In both PSNR and SSIM, higher values signify better visual quality.The performance of the four watermarking scenarios in terms of both SSIM and PSNR is rank ordered in terms of the highest visual quality, as follows: low strength embedding (ðα min Þ > VA-based algorithm∕average strength > high strength embeddingðα max Þ.From the tables, PSNR improvements of ∼3 dB are achieved when comparing the proposed VA-based approach and constant high strength scenario.The SSIM measures remain consistent for each scenario, with a decrease of 2% for the high-strength watermarking model in most cases.In terms of the VQM metric, which mimics subjective evaluation, the proposed VA-based watermarking consistently performs better than average or high-strength watermarking scenarios.
Objective metrics, such as PSNR, SSIM, and VQM, do not necessarily equal identical perceived visual quality.Two distorted frames with comparable PSNR, SSIM, or VQM metrics do not necessitate coherent media quality.Two independent viewers can undergo entirely different visual experiences, as two similarly distorted frames can provide a contrasting opinion for which contains higher visual quality.To provide a realistic visual quality evaluation, subjective testing is used to analyze the impact of the proposed watermarking scheme on the overall perceived human viewing experience.2. "Subjective evaluation" measures the visual quality by recording the opinion of human subjects on the perceived visual quality.The watermarked videos were viewed by 30 subjects, following the standard ITU-T 35 viewing test specifications, often used in compression quality evaluation experiments.The final rating was arrived at by averaging all ratings given by the subjects.This work employs two subjective evaluation metrics that are computed based on the subjective viewing scores, as follows: "Double stimulus continuous quality test" (DSCQT) subjectively evaluates any media distortion by using a continuous scale.The original and watermarked media is shown to the viewer in a randomized order.The viewer must provide a rating for the media quality of the original and watermarked images individually using a continuous scaling, as shown in Fig. 12(a).Then the degradation category rating (DCR) value is calculated by the absolute difference between the subjective ratings for the two test images.
Double stimulus impairment scale test (DSIST) determines the perceived visual degradation between two media sources, A and B, by implementing a discrete scale.A viewer must compare the quality of B with respect to A, on a 5-point discrete absolute category rating (ACR) scale, as shown in Fig. 12(b).
In a subjective evaluation session, first, training images are shown to acclimatize viewers to both ACR and DCR scoring systems.In either of the two subjective tests, a higher value in DCR or ACR scales represents a greater perceived visual quality.Figure 13 shows an overall timing diagram for each subjective testing procedure, showing the sequence of test image displays for scoring by the viewers.Note that the video display time, t 1 , and blank screen time, t 2 , before the change of video, should satisfy the following condition: Subjective evaluation performed in this work comprises of DSCQT and DSIST and the results are shown in Fig. 14 for both nonblind and blind watermarking schemes.The top and bottom rows in Fig. 14 show subjective results for the nonblind and blind watermarking cases, respectively, whereas the left and right columns show the results using DSCQT and DSIST evaluation tools.Consistent results are portrayed for both the blind and nonblind scenarios.Figure 14 shows the subjective test results for DCQST and DSIST averaged over four video test sequences.For the DSCQT, the lower the DCR, the better the visual quality, i.e., fewer embedding distortions.In the given results, when comparing the proposed and low strength embedding methodologies, the DCR value only deviates by approximately one unit in the rating scale, suggesting a subjectively similar  For each of the blind and nonblind watermarking cases, in both the objective and subjective visual quality evaluations, the low strength watermark and VAM-based watermarking sequences yield similar visual quality, whereas the high strength embedded sequence appears severely more distorted.Low-strength watermarking provides a high imperceptibility but is fragile as discussed in Sec.4.2.2.

Robustness
Video reformatting and compression are frequent and typically unintentional adversary attacks, hence watermark tolerance for H.264/AVC compression is calculated.Robustness against H.264/AVC compression for both nonblind and blind video watermarking schemes is shown in Figs.16(a) and 16(b), respectively.For simulating the watermark robustness, five constant quantization parameter (QP) values are implemented to compress the high strength, average strength, VA-based, and low strength test sequences.In both scenarios as shown in the plots, the proposed VA-based methodology shows an increase in robustness compared with the low strength watermark counterpart where a lower Hamming distance indicates better robustness.From the plots in Fig. 16, Hamming distance reductions up to 39% for the nonblind case and 22% for the blind case are possible, when comparing the low and VA-based models.Naturally, the high-strength watermarking scheme portrays a strong Hamming distance but is highly perceptible (low visual quality), as described previously.The proposed watermarking scheme has a slight increased robustness toward H.264/AVC compression, as shown in Fig. 16, when compared against a constant average strength watermark.It is worth noting that for a constant QP value, the compression ratio is inversely proportional to the increase in watermark strength, i.e., as the watermark strength increases, the overall compression ratio decreases due to the extra watermark capacity.
The proposed VA-based method results in a robustness close to the high-strength watermarking scheme, while showing low distortions, as in the low-strength watermarking approach.The incurred increase in robustness coupled with high imperceptibility, verified by subjective and objective metrics, deem the VA-based methodology highly suitable for providing an efficient watermarking scheme.

Conclusions
In this paper, we have presented a video watermarking algorithm using a motion compensated VAM.The proposed method exploits both spatial and temporal cues for saliency modeled in a motion-compensated spatiotemporal wavelet multiresolution analysis framework.The spatial cues were modeled using the 2-D wavelet coefficients.The temporal cues were modeled using the temporal wavelet coefficients by considering the global and local motion in the video.We have used the proposed VA model in visual-attention-based video watermarking to achieve robust video watermarking that has minimal or no effect on the visual quality due to watermarking.In the proposed scheme, a two-level watermarking weighting parameter map is generated from the VAM saliency maps using the proposed saliency model and data are embedded into the host image according to the visual attentiveness of each region.By avoiding higher strength watermarking in the visually attentive region, the resulted watermarked video achieved high perceived visual quality while preserving high robustness.
The proposed VAM outperforms the state-of-the-art video VA methods in joint saliency detection and low computational complexity performances.The saliency maps from the proposed method are dominated by the presence of  significant object motion.This is in contrast to the other models where differences between local and global movements are not fully accounted for, therefore, those maps are dominated by spatially attentive features, leading to salient object misclassification.The watermarking performance was verified by performing the subjective evaluation methods as well as the objective metric VQM.For the same embedding distortion, the proposed VA-based watermarking achieved up to 39% (nonblind) and 22% (blind) improvement in robustness against H.264/AVC compression attacks, compared to the existing methodology that does not use the VAM.Finally, the proposed VA-based video watermarking has resulted in visual quality similar to that of low-strength watermarking and robustness similar to those of highstrength watermarking.

Fig. 1
Fig. 1 The three causes of global motion: camera panning, tilting, and zooming.

Fig. 4
Fig. 4 Overall functional diagram of the spatial visual saliency model.

6 )
2.1).Change within lighting, noise, and global motion compensation error account for the peaks present within Fig. 6(a), whereas the contribution from object movement is also present within Fig. 6(b).A local threshold, T, segments frames containing sufficiently noticeable local motion, M, from an entire sequence.If F 1 and F 2 are consecutive 8-bit luma frames within the same sequence, Eq. (6) classifies temporal frame dynamics using frame difference D E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 3 2 6 ; 4 5 2 Dðx; yÞ ¼ jF 1 ðx; yÞ − F 2 ðx; yÞj: (From the histograms shown within Figs. 6(a) and 6(b), a local threshold value of T ¼ D max ∕10 determines motion classification, where D max is the maximum possible frame pixel difference, and T is highlighted by a red dashed line within both figures.A 0.5 percent error ratio of coefficients representing local motion M must be greater than T to reduce frame misclassification.For each temporally active frame, the Y channel renders sufficient information to estimate salient object movement without considering the U and V components.

Fig. 6
Fig. 6 Difference frames after global motion compensation: a sequence (a) without local motion and (b) containing local motion.
which is comprised of 15 video sequences, containing over 2000 frames in total.Ground truth video sequences have been generated from the database by subjective testing.A thumbnail from each of the 15 test sequences are shown in Fig. 8. Common test set parameters for VAM and later in watermarking, used throughout all performed experiments, include: the orthogonal Daubechies length 4 (D4) wavelet for three levels of 2-D spatial decomposition and one level of motion compensated temporal Haar decomposition.

Fig. 11
Fig. 11 Video visual attention estimation results for four example sequences: row 1, original frame from the sequence; row 2, proposed saliency map; and row 3, ground truth.Video sequences and the VA map sequences are available at Ref. 67.

Fig. 12 Fig. 13
Fig. 12 Subjective testing visual quality measurement scales (a) DCR continuous measurement scale and (b) ACR ITU 5-point discrete quality scale.

Fig. 15
Fig. 15 Example frames from soccer and tennis sequences after watermarking with different embedding scenarios (for visual inspection): column 1: original frame, column 2: low strength watermarked frame, column 3: VA-based watermarked frame, and column 4: high strength watermarked frame.

Table 1
19C and computational times comparing state-of-the-art video domain VAMs.Fig.10ROCcurvecomparingperformance of proposed model with state-of-the-art video domain VAMs: Itti model,15dynamic model,66and Fang model.19

Table 2
PSNR, SSIM, and VQM average of four video sequences for nonblind watermarking.