1 October 2010 Efficient temporal alignment of video sequences using unbiased bidirectional dynamic time warping
Author Affiliations +
J. of Electronic Imaging, 19(4), 040501 (2010). doi:10.1117/1.3488415
We propose an efficient technique for temporally aligning video sequences of similar activities. The proposed technique is able to synchronize view-variance videos from different scenes performing similar 3-D activities. Unlike existing techniques that just consider unidirectional alignment, the proposed technique considers symmetric temporal alignment and computes the optimal alignment by eliminating any view-based bias. The advantages of our technique are validated by experiments conducted on synthetic and real video data. The experimental results show that the proposed technique out-performs existing techniques in several test video sequences.
Lu and Mandal: Efficient temporal alignment of video sequences using unbiased bidirectional dynamic time warping



Temporal alignment of video sequences is important in applications such as superresolution imaging,1 robust multiview surveillance,2 and mosaicking. In some applications, it is required to align video sequences from two similar scenes, where analogous motions have different trajectories through the video sequence. Figure 1 illustrates two similar motions occurring in related 3-D planar scenes with respect to time. Camera 1 views 3-D scene [TeX:] $X(X_1 ,Y_1 ,Z_1 ,t_1 )$X(X1,Y1,Z1,t1) in view 1(ν1) and acquires video [TeX:] $I_1 (x_1 ,y_1 ,t_1 )$I1(x1,y1,t1). Camera 2 views another 3-D scene[TeX:] $X(X_2 ,Y_2 ,Z_{2,} \,t_1 )$X(X2,Y2,Z2,t1)in view 2 (ν2) and acquires video [TeX:] $I_2 (x_2 ,y_2 ,t_2 )$I2(x2,y2,t2). Note that the motions in these two scenes are similar but have dynamic time shift. The homography matrix H is typically used to represent the spatial relationship between these two views.

Fig. 1

Illustration of two distinct scenes acquired using two distinct cameras.


A typical schematic for temporal alignment is shown in Fig. 2. Note that for the sake of correlating two videos and representing the motions, features are extracted and tracked separately in each video. Robust view-invariance tracker methods are used to generate feature trajectories [TeX:] ${\cal F}_1 (x_1 ,y_1 ,t_1 )$F1(x1,y1,t1) and [TeX:] ${\cal F}_2 (x_2 ,y_2 ,t_2 )$F2(x2,y2,t2) from video [TeX:] $I_1$I1 and [TeX:] $I_2$I2, respectively.

Fig. 2

A typical schematic of existing techniques.


Existing techniques vary on how to compute the temporal alignments. Giese and Poggio3 computed the temporal alignment of activities of different people using dynamic time warping (DTW) between the feature trajectories, but limited their technique to a fixed viewpoint. Rao 4 used a rank-constraint-based technique (RCB) in DTW to calculate the synchronization. Such techniques only consider unidirectional alignment,3, 4 i.e., they project the trajectory from one scene to the other, which designates one view as the reference for computing the temporal alignment. Such techniques introduce the bias toward the reference trajectory, i.e., due to the noise and imperfection of the obtained reference trajectory, such a technique will produce erroneous alignment. Therefore, for the sake of minimizing the bias, one should consider computing the alignment in a symmetric way. Singh 5 formulated a symmetric transfer error (STE) as a functional of regularized temporal warp. The technique determines the time warp that has the smallest STE. It then chooses one of the symmetric warps as the final temporal alignment. The STE technique provides better results than unidirectional alignment schemes. The accuracy of the temporal alignment can be improved further, since the STE technique does not really eliminate the reference-view bias between two sequences.

In this work, we propose an unbiased bidirectional dynamic time warping (UBDTW) technique that can remove biasing and provide more accurate results.


Proposed Technique

The schematic of the proposed temporal alignment technique is shown in Fig. 3. The technique consists of three steps which are explained in the following sections.

Fig. 3

The schematic of UBDTW.



Bidirectional Projections

Since feature trajectories represent the activities in the video sequences, we compute the projections of the feature trajectories [TeX:] ${\cal F}_1$F1 from scene 1 to 2 and [TeX:] ${\cal F}_2$F2 from scene 2 to 1 using Eq. 1 as follows:

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation*} {\cal F}^p _2 (x^\prime _1 ,y'_1 ,t_1 ) = H_{1 \to 2} \,\cdot\,{\cal F}_1 (x_1 ,y_1 ,t_1 ), \end{equation*}\end{document}F2p(x1,y1,t1)=H12·F1(x1,y1,t1),


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\cal F}^p _1 (x^\prime _2 ,y'_2 ,t_2 ) = H_{2 \to 1} \,\cdot\,{\cal F}_2 (x_2 ,y_2 ,t_2 ), \end{equation}\end{document}F1p(x2,y2,t2)=H21·F2(x2,y2,t2),
where [TeX:] $H_{1 \to 2}$H12 and [TeX:] $H_{2 \to 1}$H21 are the homographies from scene 1 to 2 and scene 2 to 1, respectively. Homographies are independent of the scene structure and can be computed from [TeX:] $X(x_1 ,y_1 ,z_1 ,t_1 ) \cap X(x_2 ,y_2 ,z_2 ,t_2 )$X(x1,y1,z1,t1)X(x2,y2,z2,t2) using the direct linear transform (DLT) algorithm.6


Computation of Symmetric Warps

Once we obtain two pairs of feature trajectories, [TeX:] $({\cal F}_1 ,{\cal F}_2 ^p )$(F1,F2p) and [TeX:] $({\cal F}_1 ^p ,{\cal F}_2 )$(F1p,F2), we compute the symmetric warps [TeX:] ${\cal W}_{1,2p}$W1,2p and [TeX:] ${\cal W}_{1p,2}$W1p,2 using regularized DTW. We construct the warp [TeX:] ${\cal W}$W as follows:


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\cal W} = w_1 ,w_2 ,\ldots,w_{L\,} \,\,\max ({\cal L}_1 ,{\cal L}_2 ) \le L < {\cal L}_1 + {\cal L}_2 , \end{equation}\end{document}W=w1,w2,...,wLmax(L1,L2)L<L1+L2,
where [TeX:] ${\cal L}_1$L1 and [TeX:] ${\cal L}_2$L2 are the length of trajectories [TeX:] ${\cal F}_1$F1 and [TeX:] ${\cal F}_2$F2, respectively. The L'th element of the warp [TeX:] ${\cal W}$W is [TeX:] $w_L = (i,j)$wL=(i,j), where i and j are the time indices of [TeX:] ${\cal F}_1$F1 and [TeX:] ${\cal F}_2$F2, respectively. The optimal warp is the minimum distance warp, where the distance of a warp is defined as follows:


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\rm dist}({\cal W}) = \sum\limits_{k = 1}^L {{\rm dist}[{\cal F}(i_k ),{\cal F}^p (j_k )],} \end{equation}\end{document} dist(W)=k=1L dist[F(ik),Fp(jk)],
where [TeX:] ${\rm dist}[{\cal F}(i_k ),{\cal F}^p (j_k )]$ dist[F(ik),Fp(jk)] is the distance between the two values of the given time indices (i, j) in the k'th element of the warp. We propose a regularized distance metric function as follows:


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\rm dist}[{\cal F}(i),{\cal F}^p (j)] = ||{\cal F}(i) - {\cal F}^p (j)||^2 + w\,\,{\rm reg}, \end{equation}\end{document} dist[F(i),Fp(j)]=||F(i)Fp(j)||2+w reg,


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\rm reg} = ||\partial {\cal F}(i) - \partial {\cal F}^p (j)||^2 + ||\partial ^2 {\cal F}(i) - \partial ^2 {\cal F}^p (j)||^2 , \end{equation}\end{document} reg=||F(i)Fp(j)||2+||2F(i)2Fp(j)||2,
where [TeX:] $\partial {\cal F}$F and [TeX:] $\partial ^2 {\cal F}$2F are the first and second derivatives of [TeX:] ${\cal F}$F. The regularization term can be considered a smoothness penalty, where w is the weight (normally, w = 25).

To find the optimal warp, an accumulated distance matrix is created. The value of the element in the accumulated distance matrix is:


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\bm {\cal D}}(i,j) = {\rm dist}[{\cal F}(i),{\cal F}^p (j)] + \min (\phi ), \end{equation}\end{document}D(i,j)= dist[F(i),Fp(j)]+min(φ),


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \phi = [{\bm {\cal D}}(i - 1,j),{\bm {\cal D}}(i - 1, \,j - 1),{\bm {\cal D}}(i,j - 1)]. \end{equation}\end{document}φ=[D(i1,j),D(i1,j1),D(i,j1)].
A greedy search technique is employed to find the optimal warp [TeX:] ${\cal W}$W, such that [TeX:] ${\rm dist}({\cal W})$ dist(W) is minimum. We can now obtain the symmetric warps [TeX:] ${\cal W}_{1,2p}$W1,2p and [TeX:] ${\cal W}_{1p,2}$W1p,2 for [TeX:] $({\cal F}_1 ,{\cal F}_2 ^p )$(F1,F2p) and [TeX:] $({\cal F}_1 ^p ,{\cal F}_2)$(F1p,F2), respectively.


Optimal Warp Calculation

Note that we calculated symmetric warps [TeX:] ${\cal W}_{1,2p}$W1,2p and [TeX:] ${\cal W}_{1p,2}$W1p,2, and corresponding distance matrixes [TeX:] ${\bm {\cal D}}_{1,2p}$D1,2p and [TeX:] ${\bm {\cal D}}_{1p,2}$D1p,2 in the last step. However, the warps still have bias ([TeX:] ${\cal W}_{1,2p}$W1,2p biased toward [TeX:] ${\cal F}_1$F1 while [TeX:] ${\cal W}_{1p,2}$W1p,2 is biased toward [TeX:] ${\cal F}_2$F2). To minimize the effect of biasing on alignment, we first combine [TeX:] ${\bm {\cal D}}_{1,2p}$D1,2p and [TeX:] ${\bm {\cal D}}_{1p,2}$D1p,2 to make a new distance matrix [TeX:] ${\bm {\cal D}}_c$Dc as follows:


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\bm {\cal D}}_c = {\bm {\cal D}}_{1,2p} + {\bm {\cal D}}_{1p,2} . \end{equation}\end{document}Dc=D1,2p+D1p,2.
Once [TeX:] ${\bm {\cal D}}_c$Dc is obtained, global constraint based on [TeX:] ${\cal W}_{1,2p}$W1,2p and [TeX:] ${\cal W}_{1p,2}$W1p,2 is added into this matrix. Denote [TeX:] $W_c$Wc as the warps under the global constraint, which is restricted by the symmetric warps, as follows:


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \min ({\cal W}_{1,2} ,{\cal W}_{2,1} ) \le {\cal W}_c \le \max ({\cal W}_{1,2} ,{\cal W}_{2,1} ). \end{equation}\end{document}min(W1,2,W2,1)Wcmax(W1,2,W2,1).
Finally, warp [TeX:] ${\cal W}_c$Wc, which satisfies the following equation, is chosen as the final warp.


[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\cal W}_{\rm copt} = \arg _{{\cal W}_c } \min [{\rm dist}({\cal W}_c )]. \end{equation}\end{document}W copt=argWcmin[ dist(Wc)].
Figure 4 shows an intuitive explanation for the proposed optimal warp calculation. The grid represents the distance matrix [TeX:] ${\bm {\cal D}}_c$Dc. The horizontal and vertical axes represent the index of trajectories [TeX:] ${\cal F}_2$F2 and [TeX:] ${\cal F}_1$F1, respectively. The two bold lines represent the symmetric warps [TeX:] ${\cal W}_{1,2p}$W1,2p and [TeX:] ${\cal W}_{1p,2}$W1p,2. The gray area represents the search area constrained by [TeX:] ${\cal W}_{1,2p}$W1,2p and [TeX:] ${\cal W}_{1p,2}$W1p,2. The warp [TeX:] ${\cal W}_{\rm copt}$W copt inside the global constraint is considered as the final unbiased warp.

Fig. 4

An intuitive illustration of the UBDTW.



Experiments and Comparative Analysis

We evaluated our technique using both synthetic and real videos and compared it with RCB4 and STE techniques.5


Synthetic Data Evaluation

In the synthetic data evaluation, we generate planar trajectories 100 frames long using a pseudorandom number gen- erator. These trajectories are then projected onto two image planes using user-defined camera projection matrices. A 60-frames-long time warp is then applied to a section of one of the trajectory projections. The temporal alignment techniques are then applied to the synthetic trajectories. The test was repeated on 100 different synthetic trajectories and 100 similar trajectories with noise added. The added noise was a normally distributed random variate, with zero mean and variance [TeX:] $\sigma ^2 = 0.1$σ2=0.1. The mean absolute error between the warp obtained by different techniques and the ground truth is computed as the evaluation metric. The results are shown in Table 1. The percentage in the parentheses represents theimprovement obtained by an alignment technique with respect to the original error. Figure 5 shows a synchronization result with a synthetic trajectory. The performance of the RCB, STE and the proposed techniques are compared. It is clear that the proposed technique outperformed the other techniques.

Performance improvement of the proposed technique over existing techniques

SynData with noise25.5017.18(32.63%)3.45(86.47%)2.97(88.35%)

Fig. 5

(a) Synthetic trajectory; (b) result of temporal alignment for synthetic trajectories using RCB, STE and the proposed technique.



Real Data Evaluation

For the real video test, we use two videos (54 frames and 81 frames long, respectively) capturing the activity of lifting a coffee cup by different people. We tracked the coffee cup that can represent the activity in a video to generate feature trajectories. Since ground-truth information is not available, we used visual judgement to assess whether the alignment was correct or not.

Figure 6 shows some representative aligned frames in the 4th, 8th, and 12th elements of the alignment warp computed using the STE and the proposed technique. Note that if the coffee cup is at the same position in two frames, we marked it as “matched,” otherwise, “mismatched.” In the results obtained using the STE technique, only one pair of frames is matched, indicating that such technique can often result in erroneous alignments. The performance of the proposed technique is shown in the last two rows. It is observed that all the alignments are correct.

Fig. 6

Comparisons of temporal alignment results on real data using STE technique (shown in the first and second rows) and the proposed technique (shown in the third and fourth rows).




An efficient technique is proposed to synchronize video sequences captured from planar scenes and related by varying temporal offsets. The proposed UBDTW technique is able to remove the biasing and lead to accurate temporal alignment. In the future, we would like to extend this work to more general scenes.


1.  Y. Caspi and M. Irani, “Spatio-temporal alignment of sequences,” IEEE Trans. Pattern Anal. Mach. Intell. 24(11), 1409–1424 (2002). 10.1109/TPAMI.2002.1046148 Google Scholar

2.  L. Lee, R. Romano, and G. Stein, “Monitoring activities from multiple video streams: establishing a common coordinate frame,” IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 758–767 (2000). 10.1109/34.868678 Google Scholar

3.  M. A. Giese and T. Poggio, “Morphable models for the analysis and synthesis of complex motion patterns,” Int. J. Comput. Vis. 38(1), 59–73 (2000). 10.1023/A:1008118801668 Google Scholar

4.  C. Rao, A. Gritai, M. Shah, and T. F. S. Mahmood, “View-invariant alignment and matching of video sequences,” Proc. ICCV03, pp. 939–945 (2003). Google Scholar

5.  M. Singh, I. Cheng, M. Mandal, and A. Basu, “Optimization of symmetric transfer error for sub-frame video synchronization,” Proc. ECCV03, 554–567 (2008). Google Scholar

6.  R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed., Cambridge University Press, Cambridge, UK (2004). Google Scholar

Cheng Lu, Mrinal K. Mandal, "Efficient temporal alignment of video sequences using unbiased bidirectional dynamic time warping," Journal of Electronic Imaging 19(4), 040501 (1 October 2010). https://doi.org/10.1117/1.3488415

Back to Top