## 1.

## Introduction

Video matching techniques are motivated by various applications, including video retrieval, surveillance, and watermarking. In video retrieval,^{1, 2} frame matching is used to retrieve video clips, similar to a query clip, from a database. In video surveillance using multiple cameras,^{3} it is necessary to establish the correspondences in time and space to combine asynchronous multiview sequences. In video watermarking,^{4, 5, 6} temporal synchronization should be established between the original and watermarked videos before the extraction of frame-dependent watermarks.

Video matching techniques have different accuracy requirements, depending on their target applications. In video retrieval,^{1, 2} video matching need not be accurate at the frame level. It is acceptable that a retrieved frame does not match a query frame exactly, as long as they contain similar contents. In contrast, in video watermarking, more accurate matching is required, and two video sequences should be synchronized at the frame level, since the matching is a preprocessing step before the extraction of frame-dependent watermarks.^{8} For instance, in Ref. 6, the pattern of skipped frames is employed as a watermark, and the accurate frame-level synchronization between an original video and its watermarked version is an essential step in watermark detection.

In watermarking applications, Delannay, Roover, and Macq^{4} used an affine model to align video sequences. An affine model, however, cannot effectively describe irregular frame insertion and removal as well as frame shuffling, which are employed to attack watermarked video sequences. Cheng^{5} proposed a temporal registration algorithm to correct temporal misalignment, which occurs in frame rate conversion or video capturing. The two-frame integration model was employed to represent temporally overlapping frames in the acquisition procedures.

In video watermarking, the original video can be modified by various temporal attacks, such as frame removal, insertion, and shuffling. To establish frame-level synchronization between original and modified sequences in video watermarking applications, we propose a dynamic programming algorithm that matches each frame in the modified sequence to the corresponding frame in the original sequence. Although Ref. 5 also employed dynamic programming to match two sequences, they used only frame dissimilarity in the matching. On the other hand, we introduce the notion of adaptive unmatched cost in addition to the matching cost to achieve more accurate matching. Moreover, the unmatched cost is used to deal with frame shuffling attacks, as well as frame removal and insertion attacks.

## 2.

## Matching Function and Matching Cost

Suppose that a video clip $\mathbf{X}=\{{X}_{1},\dots ,{X}_{{l}_{X}}\}$ , where ${X}_{j}$ denotes the $j$ ’th frame, is modified from the original clip $\mathbf{Y}=\{{Y}_{1},\dots ,{Y}_{{l}_{Y}}\}$ . A frame ${X}_{j}$ may be different from any frame ${Y}_{k}$ in the original clip, since the clip $\mathbf{X}$ may be a compressed version of $\mathbf{Y}$ . Moreover, some frames can be removed from or inserted into $\mathbf{X}$ , and then the resulting frames can be shuffled. We assume that the number of removed frames is less than a threshold ${\nu}_{R}$ . Similarly, let ${\nu}_{I}$ and ${\nu}_{S}$ denote the maximum numbers of inserted frames and shuffled frames, respectively.

We say that a frame ${X}_{j}$ in $\mathbf{X}$ matches a frame ${Y}_{k}$ in $\mathbf{Y}$ if ${X}_{j}$ originates from ${Y}_{k}$ . On the other hand, ${X}_{j}$ is called an unmatched frame if ${X}_{j}$ does not match any frame in $Y$ . Then, the temporal modification of a video can be represented by a function on the space of frame indices. Specifically, we define the matching function $\lambda $ as

## Eq. 1

$$\lambda \left(j\right)=\{\begin{array}{ll}k,& \text{if}\phantom{\rule{0.3em}{0ex}}{X}_{j}\phantom{\rule{0.3em}{0ex}}\text{matches}\phantom{\rule{0.3em}{0ex}}{Y}_{k}\\ 0,& \text{if}\phantom{\rule{0.3em}{0ex}}{X}_{j}\phantom{\rule{0.3em}{0ex}}\text{is}\phantom{\rule{0.3em}{0ex}}\text{an}\phantom{\rule{0.3em}{0ex}}\text{unmatched}\phantom{\rule{0.3em}{0ex}}\text{frame}\end{array}\phantom{\}},$$## Table 1

A matching function λ(j) that describes the temporal alignment between two sequences.

j | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|

$\lambda \left(j\right)$ | 1 | 3 | 4 | 5 | 0 | 7 | 9 | 8 | 10 |

Let ${d}_{m}(j,k)$ denote the matching cost between two frames ${X}_{j}$ and ${Y}_{k}$ , which measures the dissimilarity between ${X}_{j}$ and ${Y}_{k}$ . The mean squared error (MSE) is employed as the matching cost.

## 3.

## Proposed Matching Algorithm

Given a video clip $\mathbf{X}$ and its original $\mathbf{Y}$ , we attempt to find a matching function $\lambda \left(j\right)$ , such that ${Y}_{\lambda \left(j\right)}$ in $\mathbf{Y}$ is identical or similar to ${X}_{j}$ in $\mathbf{X}$ for each $1\u2a7dj\u2a7d{l}_{X}$ . The simplest matching can be achieved by the local minimization of matching costs (LMMC) via

## Eq. 2

$${\lambda}_{\mathrm{LMMC}}\left(j\right)=\underset{1\u2a7dk\u2a7d{l}_{Y}}{\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\mathrm{min}}\phantom{\rule{0.2em}{0ex}}{d}_{m}(j,k)\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{0.3em}{0ex}}1\u2a7dj\u2a7d{l}_{X}.$$We propose a globally optimal matching algorithm that minimizes the total matching cost subject to the one-to-one matching constraint. First, we assume that the original video clip is temporally modified by frame removal and insertion attacks only, and find the best matching function ${\lambda}^{*}$ under this monotonicity assumption. Later, in the second phase, we detect frame shuffling attacks by further matching the unmatched frames of ${\lambda}^{*}$ without imposing the monotonicity constraint.

Assuming frame removal and insertion attacks only, the matching function can be estimated by minimizing the sum of the costs of matched frames and unmatched frames, given by

## Eq. 3

$${\lambda}^{*}=\underset{\lambda \u220a\Lambda}{\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\mathrm{min}}\{\sum _{j;\lambda \left(j\right)>0}{d}_{m}[j,\lambda \left(j\right)]+\sum _{l;\lambda \left(l\right)=0}{d}_{u}\left(l\right)\},$$## Eq. 4

$${d}_{u}\left(l\right)=\underset{j\u220a[l-\delta ,l+\delta ];\lambda \left(j\right)>0}{\mathrm{max}}\phantom{\rule{0.2em}{0ex}}\phantom{\rule{0.2em}{0ex}}\underset{1\u2a7dk\u2a7d{l}_{Y}}{\mathrm{min}}\phantom{\rule{0.2em}{0ex}}{d}_{m}(j,k).$$In Eq. 3, when $i$ frames, $\mathrm{max}\{{l}_{X}-{l}_{Y},0\}\u2a7di\u2a7d{\nu}_{I}$ , are inserted into $\mathbf{X}$ , there are $\left({\scriptstyle \genfrac{}{}{0ex}{}{{l}_{X}}{i}}\right)$ possible choices of unmatched frames and $\left({\scriptstyle \genfrac{}{}{0ex}{}{{l}_{Y}}{{l}_{X}-i}}\right)$ possible choices of matched frames. Therefore, the complexity of the exhaustive minimization in Eq. 3 is proportional to ${\sum}_{i=\mathrm{max}\{0,{l}_{X}-{l}_{Y}\}}^{{\nu}_{I}}\left({\scriptstyle \genfrac{}{}{0ex}{}{{l}_{X}}{i}}\right)\left({\scriptstyle \genfrac{}{}{0ex}{}{{l}_{Y}}{{l}_{X}-i}}\right)$ , which is too demanding in most applications.

To reduce the complexity, we minimize the total cost function in Eq. 3 based on dynamic programming. The dynamic programming method in this work is similar to that for computing the Levenshtein distance, also called the edit distance, between two text strings.^{10, 11} The edit distance counts the minimum number of letter substitutions, insertions, or removals to convert a text string to another string. Similarly, the proposed algorithm matches a video sequence to another sequence by considering frame insertions and removals. However, whereas the distance between two letters is binary (identical or not), the distance between two frames should represent the similarity of those two frames. Thus, as compared with letter insertions or removals, it is more difficult to identify frame insertions or removals. To overcome this difficulty, the proposed algorithm defines the matching cost as MSE and the unmatched cost in Eq. 4 and minimizes the total cost in Eq. 3 to achieve reliable video matching.

Suppose that the first $j$ frames of $\mathbf{X}$ are obtained by removing $r$ frames from the first $k$ frames of $\mathbf{Y}$ and then inserting $i$ new frames. Note that $k=j-i+r$ , since $(j-i)$ frames in $\mathbf{X}$ match $(k-r)$ frames in $\mathbf{Y}$ . Let $s(j;i,r)$ denote the minimum sum of the matching costs for the first $j$ frames in $\mathbf{X}$ when $i$ frames are inserted and $r$ frames are removed.

We compute $s(j;i,r)$ recursively. First, we initialize $s(j;i,r)=\infty $ if $j<1$ , $i<0$ , or $r<0$ with the exception $s(0;0,0)=0$ . Then, $s(j;i,r)$ can be obtained by finding the minimum among three cases

## Eq. 5

$$s(j;i,r)=\mathrm{min}\left\{\begin{array}{l}s(j-1;i,r)+{d}_{m}(j,k),\\ s(j-1;i-1,r)+{d}_{u}\left(j\right),\\ s(j;i,r-1)\end{array}\right\},$$Next, let us consider frame shuffling attacks, in addition to frame removal and insertion attacks. In Eqs. 3, 5, we impose the monotonicity constraint that the matching function for matched frames is monotonically increasing, and shuffled frames hence can be classified as unmatched frames. Therefore, when ${X}_{j}$ is classified as unmatched, it can be further matched to a frame in $\mathbf{Y}$ , which is not matched by the function in Eq. 3. More specifically, ${\lambda}^{*}\left(j\right)$ is refined by

## Eq. 6

$${\lambda}^{*}\left(j\right)=\underset{k\u220aU}{\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\mathrm{min}}\phantom{\rule{0.2em}{0ex}}{d}_{m}(j,k)\phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{0.3em}{0ex}}\underset{k\u220aU}{\mathrm{min}}\phantom{\rule{0.2em}{0ex}}{d}_{m}(j,k)<{d}_{u}\left(j\right),$$## 4.

## Experimental Results

The performance of the proposed algorithm is evaluated using five common intermediate format (CIF) sequences Foreman, Paris, Mobile, Tempete, and Carphone, each of which consists of 100 frames. We also use 33,220 video clips, selected from 20 Korean movies.^{12} Each clip consists of 100 frames of resolution
$360\times 240$
, and has a frame rate of either 24 or 30 frames per second. Thus, 3,322,500 frames are used in total.

To match a video clip with another clip, we minimize the total cost in Eq. 3. It can be shown that the dynamic programming method requires also $O\left({l}_{X}{l}_{Y}\right)$ recursion steps. We use a personal computer with an Intel Pentium D $3\text{-}\mathrm{GHz}$ processor for simulations. To match a video clip of 100 frames to another clip, the proposed algorithm requires about $5.32\phantom{\rule{0.3em}{0ex}}\mathrm{ms}$ for the dynamic programming method.

Table 2 shows the matching error probabilities. A matching error probability is defined as the rate of an estimated matching function being different from the true matching function. The sequences are compressed by the H.264/AVC standard with QP 20 and 35, which yield about 42.0 and $30.8\phantom{\rule{0.3em}{0ex}}\mathrm{dB}$ peak signal-to-noise ratios (PSNRs) on average, respectively. The sequences then go through frame removal attacks $\left(R\right)$ , frame insertion attacks $\left(I\right)$ , and frame shuffling attacks $\left(S\right)$ . The numbers of removed and inserted frames are randomly selected from 1 to 10 and from 1 to 3 ( ${\nu}_{R}=10$ and ${\nu}_{I}=3$ ), respectively. An inserted frame is constructed by averaging two adjacent frames in the original sequences. For shuffling attacks, the number of pairs of adjacent frames is randomly selected from 1 to 3 $({\nu}_{S}=3)$ , and each pair of frames swaps their places.

## Table 2

Comparison of matching error probabilities when a video is modified from its original through frame removal (R) , insertion (I) , shuffling (S) , and data compression attacks.

H. 264/AVC | Temporal attacks (%) | ||||
---|---|---|---|---|---|

Algorithm | QP | R | I | R+I | R+I+S |

Cheng^{5} | 20 | 0.00 | 0.46 | 0.66 | N/A |

35 | 0.03 | 1.75 | 2.12 | N/A | |

LMMC | 20 | 0.27 | 0.51 | 0.59 | 0.59 |

35 | 1.71 | 3.43 | 4.11 | 4.12 | |

Proposed | 20 | 0.00 | 0.00 | 0.00 | 0.00 |

35 | 0.00 | 0.09 | 0.10 | 0.17 |

We compare the performance of the proposed algorithm with those of Cheng’s algorithm^{5} and the LMMC approach. In Ref. 5, frame shuffling attacks are not considered, and the matching fails under these attacks. Since the weights for the two-frame integration model are overparameterized for insertion attacks, matching errors occur even when QP is as low as 20. The LMMC approach is also vulnerable to insertion attacks, since it simply searches the best matching frame for each individual frame. On the other hand, the proposed algorithm provides much better performance by globally minimizing the total matching cost. For example, when QP is 35 and frame removal and insertion attacks are combined
$(R+I)$
, the matching error probability of the proposed algorithm
$(=0.10\%)$
is about 41 times lower than that of LMMC
$(=4.11\%)$
and about 21 times lower than that of Cheng’s algorithm
$(=2.12\%)$
. As QP increases, the qualities of modified videos degrade and the matching error probabilities get higher. However, the proposed algorithm provides significantly better performance than the conventional algorithm, even in these severe conditions.

## 5.

## Conclusion

We propose a temporal alignment algorithm between two video sequences, when a video is modified from the other through frame removal, insertion, and shuffling attacks as well as data compression attacks. We define a cost function for global matching errors and develop a dynamic programming algorithm to minimize cost efficiently. Experimental results show that the proposed algorithm provides a significantly lower probability of matching errors than the conventional algorithm.

## Acknowledgment

This work was supported partly by the Ministry of Knowledge Economy, Korea, under the Information Technology Research Center support program supervised by the Institute of Information Technology Advancement (grant number IITA-2008-C1090-0801-0017) and partly by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (number R01-2008-000-20292-0).