With the recent rapid growth of digital technologies, content protection now plays a major role within content management systems. Of the current systems, digital watermarking provides a robust and maintainable solution to enhance media security. The visual quality of the host media (often known as imperceptibility) and robustness are widely considered as the two main properties vital for a good digital watermarking system. They are complimentary to each other, hence it is challenging to attain the right balance between them. This paper proposes a new approach to achieve high robustness in watermarking while not affecting the perceived visual quality of the host media by exploiting the concepts of visual attention (VA).
The human visual system (HVS) is sensitive to many features which lead to attention being drawn toward specific regions in a scene and is a well-studied topic in psychology and biology.1,2 VA is an important and complex biological process that helps to identify potential danger, e.g., prey, predators quickly in a cluttered visual world3 as attention to one target leaves other targets less available.4 Recently, a considerable effort was noticed in the literature in modeling VA3 that has applications in many related domains including media quality evaluation. Areas of visual interest stimulate neural nerve cells, causing the human gaze to fixate toward a particular scene area. The visual attention model (VAM) highlights these visually sensitive regions, which stimulates a neural response within the primary visual cortex.5 Whether that neural vitalization be from contrast in intensity, a distinctive face, unorthodox motion, or a dominant color, these stimulative regions diverge human attention providing highly useful saliency maps within the media processing domain.
Human vision behavioral studies6 and feature integration theory1 have prioritized the combination of three visually stimulating low level features: intensity, color, and orientation which comprise the concrete foundations for numerous image domain saliency models.3,7,8 Most saliency models often use multiresolution analysis.910.–11 Temporal features must be considered as moving objects and are more eye-catching than most static locations.12 Work has seldom been directed toward video saliency estimation, in comparison to the image domain counterpart, as temporal feature consideration dramatically increases the overall VA framework complexity. Most typical video saliency estimation methodologies3,1322.214.171.124.–18 exist as a supplementary extension from their image domain algorithms. Research estimating VA within video can also be derived from exploiting spatiotemporal cues,19,20 structural tensors,21 and optical flow.22
However, none of these algorithms explicitly captures the spatiotemporal cues that consider object motion between frames as well as the motion caused by camera movements. Motion within a video sequence can come from two categories namely, local motion and global motion. Local motion is the result of object movement within frames, which comprises all salient temporal data. One major feature associated with local motion is independence, so no single transformation can capture all local movement for the entire frame. Local motion can only be captured from successive frames differences if the camera remains motionless. On the contrary, global motion describes all motion in a scene based on a single affine transform from the previous frame and usually is a result of camera movement during a scene. The transform consists of three components, i.e., camera panning, tilting, and zooming or in image processing terms translation, rotation, and scaling. Figure 1 shows three causes for global motion. This paper proposes a new video VAM that accounts for local and global motions using a wavelet-based motion compensated temporal filtering framework. Compensating for any perceived camera movement reduces the overall effect of global movement so salient local object motion can be captured during scenes involving dynamic camera action.
A region of interest (ROI) dictates the most important visible aspects within media, so distortion within these areas will be highly noticeable to any viewer. The VAM computes such regions. This paper proposes a unique video watermarking algorithm exploiting the new video VAM. In frequency domain watermarking, the robustness of the watermarking is usually achieved by increasing the embedding strength. However, this results in visual distortions in the host media, thus a low imperceptibility of embedding. In the proposed method in this work, high watermark robustness without compromising the visual quality of the host media is achieved by embedding greater watermark strength within the less visually attentive regions within the media, as identified by the video VAM (in Sec. 2).
Related work includes defining an ROI23126.96.36.199.–28 and increasing the watermark strength in the ROI to address cropping attacks. However, in these works, the ROI extraction was only based on foreground-background models rather than VAM. There are major drawbacks of such solutions: (a) increasing the watermark strength within eye catching frame regions is perceptually unpleasant as human attention will naturally be drawn toward any additional embedding artifacts, and (b) scenes exhibiting sparse salience will potentially contain extensively fragile or no watermark data. Sur et al.29 proposed a pixel domain algorithm to improve embedding distortion using an existing visual saliency model described in Ref. 3. However, the algorithm only discusses its limited observation on perceptual quality without considering any robustness.
Our previous work30,31 shows the exploitation of image saliency in achieving image watermarking robustness. It is infeasible to simply extend the VA-based image domain algorithm into a frame-by-frame video watermarking scheme, as temporal factors must first be considered within the video watermarking framework. A viewer has unlimited time to absorb all information within an image, so potentially could view all conspicuous and visually uninteresting aspects in a scene. However, in a video sequence, the visual cortex has very limited processing time to analyze each individual frame. Human attention will naturally be drawn toward temporally active visually attentive regions. Thus the proposed motion compensated VAM is a suitable choice for VA-based video watermarking. By employing VA concepts within the digital watermarking, an increased overall robustness against adversary attacks can be achieved, while subjectively limiting any perceived visual distortions by the human eye. The concept of VA-based image and video watermarking was first introduced in our early work.30,32 Recent work following this concept can be found in watermarking H.264 video33 and application on cryptography.34 On the contrary, in this paper, we propose a video watermark embedding strategy based on VA modeling that uses the same spatiotemporal decomposition used in the video watermarking scheme. In addition, the VAM compensates global motion in order to capture local motion into the saliency model.
Performances of our saliency model and the watermarking algorithms are separately evaluated by comparisons with existing schemes. Subjective tests for media quality assessment recommended by the International Telecommunication Union (ITU),35 largely missing in the watermarking literature, are also conducted to complement the objective measurements. Major contributions of this paper are:
• A new motion compensated spatiotemporal video VAM that considers object motion between frames as well as global motions due to camera movement.
• New blind and nonblind video watermarking algorithms that are highly imperceptible and robust against compression attacks.
• Subjective tests that evaluate visual quality of the proposed watermarking algorithms.
The saliency model and the watermarking algorithms are evaluated using the existing video datasets described in Sec. 4.1. The initial concept of the motion compensated video attention model was reported earlier in the form of a conference publication36 while this paper discusses the proposed scheme in detail with an exhaustive evaluation and proposes a case study describing a new video watermarking scheme that uses the attention model.
Motion Compensated Video Visual Attention Model
The most attentive regions within media can be captured by exploiting and imposing characteristics from within the HVS. In this section, a method is proposed to detect any saliency information within a video. The proposed methods incorporate motion compensated spatiotemporal wavelet decomposition combined with HVS modeling to capture any saliency information. A unique approach combining salient temporal, intensity, color, and orientation contrasts formulate the essential video saliency methodology.
Physiological and psychophysical evidence demonstrates that visually stimulating regions occur at different scales within media37 and the object motion within the scene.12 Consequently, models proposed in this work exploit the identifiable multiresolution property of the wavelet transform that incorporates a motion compensation algorithm to generate the model. By exploiting the multiresolution spatiotemporal representation of the wavelet transform, VA is estimated directly from within the wavelet domain. The video saliency model is divided into three subsections. First, Sec. 2.1 describes the global motion compensation following the description of the spatial saliency model in Secs. 2.2 and 2.3 that illustrates the temporal saliency feature map generation. Finally, Sec. 2.4 combines the spatiotemporal model to estimate video visual saliency. An overall functional block diagram of our proposed model is shown in Fig. 2. For the spatial saliency model in this work, we adopted our image VAM proposed in Refs. 30 and 31.
Global Motion Compensated Frame Difference
Compensation for global motion is dependent upon homogeneous motion vector (MV) detection, consistent throughout the frame. Figure 3 considers the motion estimation between two consecutive frames, taken from the coastguard sequence. A fixed block size based on the frame resolution determines the number of MV blocks. The magnitude and phase of the MVs are represented by the size and direction of the arrows, respectively, whereas the absence of an arrow portrays an MV of zero. First, it is assumed there is a greater percentage of pixels within moving objects than in the background, so large densities of comparative MVs are the result of dynamic camera action. To compensate for camera panning, the entire reference frame is spatially translated by the most frequent MV, the global camera MV, . This process is applied prior to the wavelet decomposition to deduce global motion compensated saliency estimation. The global motion compensation is described in Eq. (1)
Compensating for other camera movement can be achieved by searching for a particular pattern of MVs. For example, a circular MV pattern will determine camera rotation and all MVs converging or diverging from a particular point will govern camera zooming. An iterative search over all possible MV patterns can cover each type of global camera action.38 Speeded up robust features detection39 could be used to directly align key feature points between consecutive frames, but this would be very computationally exhaustive. This model only requires a fast rough global motion estimate to neglect the effect of global camera motion on the overall saliency map.
Spatial Saliency Model
As the starting point in generating the saliency map from a color image/frame, RGB color space is converted to YUV color spectral space as the latter exhibits prominent intensity variations through its luminance channel Y. First, the two-dimensional (2-D) forward discrete wavelet transform (FDWT) is applied on each Y, U, and V channel to decompose them in multiple levels. The 2-D FDWT decomposes an image in frequency domain expressing coarse grain approximation of the original signal along with three fine grain orientated edge information at multiple resolutions. Discrete wavelet transform (DWT) captures horizontal, vertical, and diagonal contrasts within an image portraying prominent edges in various orientations. Due to the dyadic nature of the multiresolution wavelet transform, the image resolutions are decreased after each wavelet decomposition iteration. This is useful in capturing both short and long structural information at different scales and useful for saliency computation. The absolute values of the wavelet coefficients are normalized so that the overall saliency contributions come from each subband and prevent biasing toward the finer scale subbands. An average filter is also applied to remove unnecessary finer details. To provide full resolution output maps, each of the high frequency subbands is consequently interpolated up to full frame resolution. The interpolated subband feature maps, (horizontal), (vertical), and (diagonal), , for all decomposition levels are combined by a weighted linear summation as
A feature map promotion and suppression steps follow next as shown in Eq. (3). If is the average of local maxima present within the feature map and is the global maximum, the promotion and suppression normalization is achieved by
Finally, the overall saliency map, , is generated by
Finally, the overall map is generated by using a weight summation of all color channels as shown in Fig. 4.
Temporal Saliency Model
2-D + t wavelet domain
We extend our spatial saliency model toward video domain saliency logically by utilizing a three-dimensional wavelet transform. Video coding research provides evidence that differing texture and motion characteristics occur after wavelet decomposition from the domain40 and incorporating its alternative technique, the transform.41,42 The domain decomposition compacts most of the transform coefficient energy within the low frequency temporal subband and provides efficient compression within the temporal high frequency subbands. Vast quantities of the high frequency coefficients have zero magnitude, or very close, which is unnecessary for the transforms’ usefulness within this framework. Alternatively, decomposition produces greater transform energy within the higher frequency components, i.e., a greater amount of larger and nonzero coefficients and reduces computational complexity to a great extent. A description of reduced computational complexity by using compared to can be found in Ref. 42. Therefore, in this work we have used a decomposition as shown in Fig. 5 (for three levels of spatial followed by one level of temporal Haar wavelet decomposition).
Temporal saliency feature map
To acquire accurate video saliency estimation, both spatial and temporal features within the wavelet transform are considered. The wavelet-based spatial saliency model, described in Sec. 2.2, constitutes the spatial element for the video saliency model, whereas this section concentrates upon establishing temporal saliency maps, .
Similar methodology to expose temporal conspicuousness is implemented in comparison to the spatial model in Sec. 2.2. First, the existence of any palpable local object motion is determined within the sequence. Figure 6 shows the histograms of two globally motion compensated frames. Global motion is any frame motion due to camera movement, whether that be panning, zooming, or rotation (see Sec. 2.1). Change within lighting, noise, and global motion compensation error account for the peaks present within Fig. 6(a), whereas the contribution from object movement is also present within Fig. 6(b). A local threshold, , segments frames containing sufficiently noticeable local motion, , from an entire sequence. If and are consecutive 8-bit luma frames within the same sequence, Eq. (6) classifies temporal frame dynamics using frame difference
From the histograms shown within Figs. 6(a) and 6(b), a local threshold value of determines motion classification, where is the maximum possible frame pixel difference, and is highlighted by a red dashed line within both figures. A 0.5 percent error ratio of coefficients representing local motion must be greater than to reduce frame misclassification. For each temporally active frame, the Y channel renders sufficient information to estimate salient object movement without considering the U and V components.
The methodology bears a distinct similarity to the spatial domain approach as the high pass temporal subbands: , , and , for levels of spatial decomposition, combine after full wavelet decomposition, which is shown in Fig. 5. The decomposed data are forged using comparable logic as Eq. (2), as all transformed coefficients are segregated into 1 of 3 temporal subband feature maps. This process is described as
Spatial-Temporal Saliency Map Combination
The spatial and temporal maps are combined to form an overall saliency map. The primary visual cortex is extremely sensitive to object movement so if enough local motion is detected within a frame, the overall saliency estimation is dominated by any temporal contribution with respect to local motion . Hence, the temporal weightage parameter, , determined from Eq. (6) is calculated as43 Consequently, if no local motion is detected with a frame, the spatial model contributes toward the final saliency map in its entirety, hence is a binary variable. The equation forging the overall saliency map is Fig. 2.
Visual Attention-Based Video Watermarking
We propose an algorithm that provides a solution toward blind and nonblind VA-based video watermarking. The video saliency model described in Sec. 2 is utilized within the video watermarking framework to determine the watermarking embedding strength. Coinciding with the previous video VA model, watermark data are embedded within the wavelet domain as outlined in Sec. 2.3.1. The VAM identifies the ROI most perceptive to human vision, which is a highly exploitable property when designing watermarking systems. The subjective effect of watermark embedding distortion can be greatly reduced if any artifacts occur within inattentive regions. By incorporating VA-based characteristics within the watermarking framework, algorithms can provide a retained media visual quality and increased overall watermark robustness, compared with the methodologies that do not exploit the VA. This section proposes two (blind and nonblind) new video watermarking approaches that incorporate the VAM. In both scenarios, a content-dependent saliency map is generated which is used to calculate the region adaptive watermarking strength parameter alpha, . A lower and higher value of in salient regions and nonsalient regions, respectively, ensures higher imperceptibility of the watermarked image distortions while keeping greater robustness.
At this point, we describe the classical wavelet-based watermarking schemes without considering the VAM and subsequently propose the new approach that incorporates the saliency model. Frequency-based watermarking, more precisely wavelet domain watermarking, methodologies are highly favored in the current research era. The wavelet domain is also compliant within many image coding, e.g., JPEG200044 and video coding, e.g., motion JPEG2000, motion-compensated embedded zeroblock coding (MC-EZBC),45 schemes, leading to smooth adaptability within modern frameworks. Due to the multiresolution decomposition and the property to retain spatial synchronization, which are not provided by other transforms (the discrete cosine transform for example), the DWT provides an ideal choice for robust watermarking.46188.8.131.52.184.108.40.206.–55
The FDWT is applied on the host image before watermark data are embedded within the selected subband coefficients. The inverse discrete wavelet transform reconstructs the watermarked image. The extraction operation is performed after the FDWT. The extracted watermark data are compared to the original embedded data sequence before an authentication decision verifies the watermark presence. A wide variety of potential adversary attacks, including compression and filtering, can occur in an attempt to distort or remove any embedded watermark data. A detailed discussion of such watermarking schemes can be found in Ref. 56.
Magnitude-based multiplicative watermarking34,51,53,5758.–59 is a popular choice when using a nonblind watermarking system due to its simplicity. Wavelet coefficients are modified based on the watermark strength parameter, , the magnitude of the original coefficient, , and the watermark information, . The watermarked coefficients, , are obtained as follows:
Quantization-based watermarking52,54220.127.116.11.18.104.22.168.–63 is a blind scheme which relies on modifying various coefficients toward a specific quantization step. As proposed in Ref. 52, the algorithm is based on modifying the median coefficient toward the step size, , by using a running nonoverlapping window. The altered coefficient must retain the median value of the three coefficients within the window after the modification. The equation calculating is described as follows:
Authentication of extracted watermarks
Authentication is performed by comparison of the extracted watermark with the original watermark information and computing closeness between the two in a vector space. Common authentication methods are defined by calculating the similarity correlation or Hamming distance, , between the original embedded and extracted watermark as follows:
Saliency Map Segmentation
This section presents the threshold-based saliency map segmentation which is used for adapting the watermarking algorithms described in Sec. 3.1 in order to change the watermark strength according to the underlying VA properties. Figures 7(a) and 7(b) show an example original host frame and its corresponding saliency map, respectively, generated from the proposed methodology in Sec. 2. In Fig. 7(b), the light and dark regions within the saliency map represent the visually attentive and nonattentive areas, respectively. At this point, we employ thresholding to quantize the saliency map into coarse saliency levels as fine granular saliency levels are not important in the proposed application. In addition, that may also lead to reducing errors in saliency map regeneration during watermark extraction as follows. Recalling blind and nonblind watermarking schemes in Sec. 3.1, the host media source is only available within nonblind algorithms. However in blind algorithms, identical saliency reconstruction might not be possible within the watermark extraction process due to the coefficient values changed by watermark embedding as well as potential attacks. Thus, the saliency map is quantized using thresholds leading to regions of similar visual attentiveness. The employment of a threshold reduces saliency map reconstruction errors, which may occur as a result of any watermark embedding distortion, as justified further in Sec. 3.4.
The thresholding strategy relies upon a histogram analysis approach. Histogram analysis depicts automatic segmentation of the saliency map into two independent levels by employing the saliency threshold, , where represents the saliency values in the saliency map, . In order to segment highly conspicuous locations within a scene, first, the cumulative frequency function, , of the ordered saliency values, , (from 0 to the maximum saliency value, ) is considered. Then is chosen asFig. 7(c).
Saliency-based thresholding enables determining the coefficients’ eligibility for a low- or high-strength watermarking. To ensure VA-based embedding, the watermark weighting parameter strength, , in Eqs. (11) and (15), is made variable , dependent upon , as follows:3.3. As shown in Fig. 7(d), the most and the least salient regions are given watermark weighting parameters of and , respectively. An example of the final VA-based alpha watermarking strength map is shown in Fig. 7(e), where a brighter intensity represents an increase in .
Watermark Embedding Strength Calculation
The watermark weighting parameter strengths, and , can be calculated from the visible artifact peak signal-to-noise ratio (PSNR) limitations within the image. Visual distortion becomes gradually noticeable as the overall PSNR drops below ,64 so minimum and maximum PSNR requirements are set to approximate 35 and 40 dB, respectively, for both the blind and nonblind watermarking schemes. These PSNR limits ensure a maximum amount of data can be embedded into any host image to enhance watermark robustness without substantially distorting the media quality. Therefore, it is sensible to incorporate PSNR in determining the watermark strength parameter .
Recall that PSNR, which measures the error between two images with dimensions , is expressed on the pixel domain as follows:48 Therefore, Eq. (20) can be redefined on the transform domain for nonblind magnitude-based multiplicative watermarking, shown in Eq. (11), as follows:
Similarly, for the blind watermarking scheme described in Sec. 3.1.2, PSNR in the transform domain can be estimated by substituting the median and modified median coefficients, and , respectively, in Eq. (20). Then subsequent rearranging results in an expression for the total error in median values, in terms of the desired PSNR as follows:
Saliency Map Reconstruction
For nonblind watermarking, the host data are available during watermark extraction so an identical saliency map can be generated. However, a blind watermarking scheme requires the saliency map to be reconstructed based upon the watermarked media, which may have gotten pixel values slightly different from the original host media. Thresholding the saliency map into two levels, as described in Sec. 3.2, ensures high accuracy within the saliency model reconstruction for blind watermarking. Further experimental objective analysis reveals that the use of thresholding improves the saliency coefficients match up to 99.4% compared to approximately only 55.6% of coefficients when thresholding was not used, hence reconstruction errors are greatly reduced.
Experimental Results and Discussion
The performance of the proposed video VA method and its application in robust video watermarking is presented and discussed in this section. The video VAM is evaluated in terms of the accuracy with respect to the ground truth and computational time in Sec. 4.1. The video VA-based watermarking is evaluated in terms of embedding distortion and robustness to compression in Sec. 4.2.
Visual Attention Model Evaluation
For attention model evaluation, the video dataset is taken from the literature,65 which is comprised of 15 video sequences, containing over 2000 frames in total. Ground truth video sequences have been generated from the database by subjective testing. A thumbnail from each of the 15 test sequences are shown in Fig. 8. Common test set parameters for VAM and later in watermarking, used throughout all performed experiments, include: the orthogonal Daubechies length 4 (D4) wavelet for three levels of 2-D spatial decomposition and one level of motion compensated temporal Haar decomposition.
Experimental results demonstrate the model performance against the existing state-of-the-art methodologies. The proposed algorithm is compared with the Itti,15 dynamic,66 and Fang19 video VAMs, in terms of accurate salient region detection and computational efficiency. The Itti framework is seen as the foundation and benchmark used for VA model comparison, whereas the dynamic algorithm is dependent upon locating energy peaks within incremental length coding. A more recent Fang algorithm uses a spatiotemporally adaptive entropy-based uncertainty weighting approach.
Figure 9 shows the performance of the proposed model and compares it against the Itti, dynamic, and Fang algorithms. The Itti motion model saliency maps are depicted in column 2, the dynamic model saliency maps in column 3 and the Fang model in column 4. Results obtained using the proposed model are shown in column 5 where from top to bottom, the locally moving snowboarder, flower, and bird are clearly identified as salient objects. Corresponding ground truth frames are shown in column 6, which depict all salient local object movement. Results from our model are subjected to the presence of significant object motion, which dominates the saliency maps. This is in contrast to the other models where differences between local and global movements are not fully accounted for, therefore, those maps are dominated by spatially attentive features, leading to salient object misclassification. For example, the trees within the background of the snowboard sequence are estimated as an attentive region when a man is performing acrobatics within the frame foreground.
The receiver operating characteristic (ROC) curves and corresponding area under curve (AUC) values, shown in Fig. 10 and the top row in Table 1, respectively, display an objective model evaluation. The results show the proposed method is close to the recent Fang model and exceeds the performance of the Itti motion and dynamic models having 3.5% and 8.2% higher ROC-AUCs, respectively. Further results demonstrating our video VA estimation model across four video sequences are shown in Fig. 11. Video saliency becomes more evident when viewed as a sequence rather than from still frames. The video sequences with corresponding saliency maps are available for viewing in Ref. 67.
AUC and computational times comparing state-of-the-art video domain VAMs.
|Average frame computational time (s)||0.244||0.194||31.54||0.172|
The bottom row in Table 1 shows the complexity of each algorithm in terms of average frame computational time. The values in the table are calculated from the mean computational time over every frame within the video database and provide the time required to form a saliency map from the original raw frame. All calculations include any transformations required. From the table, the proposed low complex methodology can produce a video saliency map around 30%, 88%, and 0.5% of the time for the Itti, dynamic, and Fang model frames, respectively. Additionally, the proposed model uses the same wavelet decomposition scheme used for watermarking. Therefore, overall visual saliency-based watermarking complexity is low compared to all three methods compared in this paper.
Visual Attention-Based Video Watermarking
The proposed VA-based watermarking is agnostic to the watermark embedding methodology. Thus, it can be used on any existing watermarking algorithm. In our experiments, we use the nonblind embedding proposed by Xia et al.51 and the blind algorithm proposed by Xie and Arce52 as our reference algorithms.
A series of experimental results are generated for our video watermarking case study as described in Sec. 3, analyzing both watermark robustness and imperceptibility. Objective and subjective quality evaluation tools are enforced to provide a comprehensive embedding distortion measure. Robustness against H.264/AVC compression68 is provided, as common video attacks are comprised of platform reformatting and video compression. Since the VA-based watermarking scheme was presented here as a case study of exploitation of the proposed VAM, our main focus of performance evaluation is on the embedding distortion and the robustness performance with respect to compression attacks. Compression attacks are given focus as watermarking algorithms often employ a higher watermarking strength for encountering the compression and requantization attack. In this work, we demonstrate robustness against H.264/AVC compression, for example. The watermarking evaluation results are reported using the four example video sequences (shown in Fig. 11) from the same data set used for VAM evaluation in the previous section.
and approximating a PSNR of 35 and 40 dB, respectively, are utilized by applying Eqs. (22) and (23). Four scenarios of varied watermark embedding strengths are considered for the VA-based video watermarking evaluation as follows:
1. a uniform throughout the entire sequence;
2. the proposed visual VAM-based strength;
3. a uniform average watermark strength, , chosen as ; and
4. a uniform used throughout the entire video sequence.
The experimental results are shown in the following two sections: embedding distortion (visual quality) and robustness.
The embedding distortion can be evaluated using objective metrics or subjective metrics. While objective quality measurements are mathematical models that are expected to approximate results from subjective assessments and are easy to compute, subjective measurements ensure a viewer’s overall opinion of the quality of experience of the visual quality. Often these metrics are complimentary to each other and particularly important in this paper to measure the effect on imperceptibility of the proposed watermark algorithms.
1. “Objective metrics” define a precise value, dependent upon mathematical modeling, to determine visual quality. Such metrics include PSNR, structural similarity index measure (SSIM),69 just noticeable difference,70 and video quality metric (VQM).71 PSNR that calculates the average error between two images is one of the most commonly used visual quality metrics and is described in Eq. (20). Unlike PSNR, SSIM focuses on a quality assessment based on the degradation of structural information. SSIM assumes that the HVS is highly adapted for extracting structural information from a scene. A numeric output is generated between 1 and 0 and higher video quality is represented by values closer to 1. VQM evaluates video quality based upon subjective human perception modeling. It incorporates numerous aspects of early visual processing, including both luma and chroma channels, a combination of temporal and spatial filtering, light adaptation, spatial frequency, global contrast, and probability summation. A numeric output is generated between 1 and 0 and higher video quality is represented by values closer to 0. VQM is a commonly used video quality assessment metric as it eliminates the need for participants to provide a subjective evaluation.
Although the subjective evaluation is considered as the most suitable evaluation for the proposed method in this paper, the visual quality evaluation in terms of the PSNR, SSIM, and VQM metrics are shown in Tables 2 and 3 for nonblind and blind watermarking schemes, respectively. In both PSNR and SSIM, higher values signify better visual quality. The performance of the four watermarking scenarios in terms of both SSIM and PSNR is rank ordered in terms of the highest visual quality, as follows: low strength embedding (. From the tables, PSNR improvements of are achieved when comparing the proposed VA-based approach and constant high strength scenario. The SSIM measures remain consistent for each scenario, with a decrease of 2% for the high-strength watermarking model in most cases. In terms of the VQM metric, which mimics subjective evaluation, the proposed VA-based watermarking consistently performs better than average or high-strength watermarking scenarios.
PSNR, SSIM, and VQM average of four video sequences for nonblind watermarking.
|Low strength||Proposed||Average strength||High strength|
PSNR, SSIM, and VQM average of four video sequences for blind watermarking.
|Low strength||Proposed||Average strength||High strength|
Objective metrics, such as PSNR, SSIM, and VQM, do not necessarily equal identical perceived visual quality. Two distorted frames with comparable PSNR, SSIM, or VQM metrics do not necessitate coherent media quality. Two independent viewers can undergo entirely different visual experiences, as two similarly distorted frames can provide a contrasting opinion for which contains higher visual quality. To provide a realistic visual quality evaluation, subjective testing is used to analyze the impact of the proposed watermarking scheme on the overall perceived human viewing experience.
2. “Subjective evaluation” measures the visual quality by recording the opinion of human subjects on the perceived visual quality. The watermarked videos were viewed by 30 subjects, following the standard ITU-T35 viewing test specifications, often used in compression quality evaluation experiments. The final rating was arrived at by averaging all ratings given by the subjects. This work employs two subjective evaluation metrics that are computed based on the subjective viewing scores, as follows:
“Double stimulus continuous quality test” (DSCQT) subjectively evaluates any media distortion by using a continuous scale. The original and watermarked media is shown to the viewer in a randomized order. The viewer must provide a rating for the media quality of the original and watermarked images individually using a continuous scaling, as shown in Fig. 12(a). Then the degradation category rating (DCR) value is calculated by the absolute difference between the subjective ratings for the two test images.
Double stimulus impairment scale test (DSIST) determines the perceived visual degradation between two media sources, A and B, by implementing a discrete scale. A viewer must compare the quality of B with respect to A, on a 5-point discrete absolute category rating (ACR) scale, as shown in Fig. 12(b).
In a subjective evaluation session, first, training images are shown to acclimatize viewers to both ACR and DCR scoring systems. In either of the two subjective tests, a higher value in DCR or ACR scales represents a greater perceived visual quality. Figure 13 shows an overall timing diagram for each subjective testing procedure, showing the sequence of test image displays for scoring by the viewers. Note that the video display time, , and blank screen time, , before the change of video, should satisfy the following condition: .
Subjective evaluation performed in this work comprises of DSCQT and DSIST and the results are shown in Fig. 14 for both nonblind and blind watermarking schemes. The top and bottom rows in Fig. 14 show subjective results for the nonblind and blind watermarking cases, respectively, whereas the left and right columns show the results using DSCQT and DSIST evaluation tools. Consistent results are portrayed for both the blind and nonblind scenarios. Figure 14 shows the subjective test results for DCQST and DSIST averaged over four video test sequences. For the DSCQT, the lower the DCR, the better the visual quality, i.e., fewer embedding distortions. In the given results, when comparing the proposed and low strength embedding methodologies, the DCR value only deviates by approximately one unit in the rating scale, suggesting a subjectively similar visual quality. The high-strength watermarking scheme shows a high DCR value indicating significantly higher degradation of subjective visual quality compared with the VAM-based methodology. Similar outcomes are evident from the DSIST plots, where the higher mean opinion score (MOS) on ACR corresponds to better visual quality, i.e., fewer embedding visual distortions. DSIST plots for low-strength and VAM-based schemes show a similar ACR MOS approximately in the range 3 to 4, whereas the high strength watermark yields an ACR of less than 1 for nonblind and nearly 2 for blind watermarking. Compared with an average watermark strength, the proposed watermarking scheme shows an improved subjective image quality in all four graphs by around 0.5 to 1 units. As more data are embedded within the visually salient regions, the subjective visual quality of constant average strength watermarked images is worse than the proposed methodology.
For visual inspection, an example of watermark embedding distortion is shown in Fig. 15. The original, the low strength watermarked, VAM-based watermarked, and the high strength watermarked images are shown in four consecutive columns, where the distortions around the legs of the player with blue jersey (row 1) and around the tennis player (row 2) are distinctively visible in high-strength watermarking.
For each of the blind and nonblind watermarking cases, in both the objective and subjective visual quality evaluations, the low strength watermark and VAM-based watermarking sequences yield similar visual quality, whereas the high strength embedded sequence appears severely more distorted. Low-strength watermarking provides a high imperceptibility but is fragile as discussed in Sec. 4.2.2.
Video reformatting and compression are frequent and typically unintentional adversary attacks, hence watermark tolerance for H.264/AVC compression is calculated. Robustness against H.264/AVC compression for both nonblind and blind video watermarking schemes is shown in Figs. 16(a) and 16(b), respectively. For simulating the watermark robustness, five constant quantization parameter (QP) values are implemented to compress the high strength, average strength, VA-based, and low strength test sequences. In both scenarios as shown in the plots, the proposed VA-based methodology shows an increase in robustness compared with the low strength watermark counterpart where a lower Hamming distance indicates better robustness. From the plots in Fig. 16, Hamming distance reductions up to 39% for the nonblind case and 22% for the blind case are possible, when comparing the low and VA-based models. Naturally, the high-strength watermarking scheme portrays a strong Hamming distance but is highly perceptible (low visual quality), as described previously. The proposed watermarking scheme has a slight increased robustness toward H.264/AVC compression, as shown in Fig. 16, when compared against a constant average strength watermark. It is worth noting that for a constant QP value, the compression ratio is inversely proportional to the increase in watermark strength, i.e., as the watermark strength increases, the overall compression ratio decreases due to the extra watermark capacity.
The proposed VA-based method results in a robustness close to the high-strength watermarking scheme, while showing low distortions, as in the low-strength watermarking approach. The incurred increase in robustness coupled with high imperceptibility, verified by subjective and objective metrics, deem the VA-based methodology highly suitable for providing an efficient watermarking scheme.
In this paper, we have presented a video watermarking algorithm using a motion compensated VAM. The proposed method exploits both spatial and temporal cues for saliency modeled in a motion-compensated spatiotemporal wavelet multiresolution analysis framework. The spatial cues were modeled using the 2-D wavelet coefficients. The temporal cues were modeled using the temporal wavelet coefficients by considering the global and local motion in the video. We have used the proposed VA model in visual-attention-based video watermarking to achieve robust video watermarking that has minimal or no effect on the visual quality due to watermarking. In the proposed scheme, a two-level watermarking weighting parameter map is generated from the VAM saliency maps using the proposed saliency model and data are embedded into the host image according to the visual attentiveness of each region. By avoiding higher strength watermarking in the visually attentive region, the resulted watermarked video achieved high perceived visual quality while preserving high robustness.
The proposed VAM outperforms the state-of-the-art video VA methods in joint saliency detection and low computational complexity performances. The saliency maps from the proposed method are dominated by the presence of significant object motion. This is in contrast to the other models where differences between local and global movements are not fully accounted for, therefore, those maps are dominated by spatially attentive features, leading to salient object misclassification. The watermarking performance was verified by performing the subjective evaluation methods as well as the objective metric VQM. For the same embedding distortion, the proposed VA-based watermarking achieved up to 39% (nonblind) and 22% (blind) improvement in robustness against H.264/AVC compression attacks, compared to the existing methodology that does not use the VAM. Finally, the proposed VA-based video watermarking has resulted in visual quality similar to that of low-strength watermarking and robustness similar to those of high-strength watermarking.
We acknowledge the support of the UK Engineering and Physical Research Council (EPSRC), through a Dorothy Hodgkin Postgraduate Award and a Doctoral Training Award at the University of Sheffield.
Matthew Oakes graduated with an MEng degree in electronic and electrical engineering from the University of Sheffield in 2009. He received his PhD in electronic and electrical engineering also at the University of Sheffield in 2014. He is currently working at the University of Buckingham as a knowledge transfer partnership associate. His main expertise lies in image/video processing, compression, digital watermarking, and visual saliency estimation. His current research includes biometric security systems and machine learning.
Deepayan Bhowmik received his PhD in electronic and electrical engineering from the University of Sheffield, UK, in 2011. Previously, he worked as a research associate at Heriot-Watt University, Edinburgh, UK and the University of Sheffield, UK, and a system engineer in ABB Ltd., India. He is currently working as a lecturer in Sheffield Hallam University, Sheffield, UK. His current research interests include computer vision, machine learning, embedded imaging hardware on FPGA, and multimedia security.
Charith Abhayaratne received his BE degree in electrical and electronic engineering from the University of Adelaide, Australia, in 1998, and his PhD in the same from the University of Bath, UK, in 2002. He is currently a lecturer in the Department of Electronic and Electrical Engineering at the University of Sheffield, UK. His research interests include video and image compression, watermarking, image and video analysis, multidimensional signal processing, graph spectral analysis, and computer vision.