Today, a typical end-user of a multimedia system is usually overwhelmed with video collections, facing the problem of organizing them so that they are easily accessible. Thus, to enable an efficient browsing of these video anthologies, it is necessary to design techniques and methods for indexing and retrieving video data. Therefore, the issue of analyzing and automatically indexing the video content by retrieving highly representative information (e.g., shot boundaries) has been raised in the research community.
Several approaches have been proposed in the literature for automatic shot boundary detection (SBD), which can be basically classified according to the detection algorithm that each method implements.
The first group of the SBD methods exploits the variation of the color intensity histograms between consecutive frames. Based on the hypothesis that all frames that belong to the same scene are characterized by the same color histogram, then detecting a color histogram change is a metric for possible scene cut.1 Another group of methods exploits the classification of frames based on mathematical models, like the analysis of the statistics derived from a specific pixel area along the video sequence.2 Similarly, other methods are based on edge detection and edge comparison between successive frames,3 while some specialized methods for MPEG-coded signals have also been proposed.4, 5
However, all the aforementioned methods use a threshold parameter to distinguish shot boundaries and changes. Thus, a common challenge (stemming from the previously referred methods) prior to the SBD process is the selection of the appropriate threshold for identifying the level of variation, which in turn defines a shot boundary.6 If a global threshold is used for the detection of shot boundaries over the whole video, then successful detection rate may vary up to 20% even for the same video content.7 To improve the efficiency and eliminate this performance variation, some later works propose the use of an adaptive threshold, which can be dynamically determined based on the video content.8, 9 But even these methods require a lot of computational power to successfully estimate the appropriate threshold parameter, making their implementation a challenging issue, especially for real-time applications. Another approach uses supervised classifiers instead of thresholds.10
This paper introduces a novel method for SBD, which enables the quick and easy extraction of the most significant frames from a discrete cosine transform (DCT)-based encoded video, without requiring any threshold calculation. The proposed method makes use of a multimetric pixel-based algorithm, which calculates for each frame the mean pixel value differences across and at both sides of DCT block margins. Then, the normalized results indicate the magnitude of the tiling effect. The proposed method exploits the fact that during an abrupt scene change over an interframe, the motion estimation and compensation algorithms of the encoding process do not perform well, with the immediate outcome the intensification of the blockiness effect, which may be not perceptually observable (due to the low display duration of each frame), but it is measurable.
Proposed Block-based Method
Multimedia applications that distribute audiovisual content are mainly based on DCT-based digital encoding techniques (e.g., MPEG-1/2/4), which achieve high compression ratios by exploiting the spatial and temporal redundancy in video sequences. Most of the standards are based on motion estimation and compensation, using the block-based DCT. The use of the transform facilitates the exploitation in the compression technique of the various psychovisual redundancies by transforming the sequence to a domain, where different frequency ranges with dissimilar sensitivities at the human visual system (HVS) can be accessed independently.
The DCT operates on an block of image samples or residual values after prediction and creates , which is an block of coefficients. The action of the DCT can be described in terms of a transform matrix . The forward DCT is given by , where is a matrix of samples, is a matrix of coefficients, and is an transform matrix. The elements of are
The blockiness effect refers to a block pattern of size in the compressed sequence, which is the result of the independent quantization of individual blocks of block-based DCT. Due to the DCT, within a block , the luminance discontinuities between any pair of adjacent pixels are reduced by the encoding and compression process. On the contrary, for all the pairs of adjacent pixels, located across and on both edge sides of the border of adjacent DCT blocks, the luminance discontinuities are increased through the encoding process.
Especially for video services in the framework of the 3G/4G mobile communication systems, where the encoding bit rate is very low, the blockiness effect is the main present artifact. Especially during a scene change, where the motion estimation and compensation efficiency falls, the blockiness effect is intensified, without being usually noticeable by the viewer,11 but it is easily measurable by an image processing tool. Thus, by measuring the variance of the blockiness effect during a video sequence, it is possible to identify where and when scene change takes place.
To measure the intensity of the blockiness effect, the average luminance discontinuities at the boundaries of adjacent blocks are calculated by simply comparing the corresponding luminance pixel values. The larger the difference, the more severe is the blockiness effect. For this purpose, for each frame of the video sequence, the individual offsets of the block pixel pairs that Fig. 1 demonstrates are calculated as2 depicts a graphical representation of the offset that Eq. 3 calculates.
The vertical ⟨offset⟩ values of a frame can be defined asand height is calculated as is a function that normalizes within the range . Therefore, by applying Eq. 7 to encoded video sequences, the clipped fluctuation of the averaged offset (i.e., the blockiness effect) per frame can be deduced. Based on this and taking under consideration that during a scene change the blockiness effect instantaneously is strengthened, then Eq. 7 provides a quick and simple metric of scene changes.
Due to the fact that during an abrupt scene change, the values of the ⟨offset⟩ become significantly larger than these of an intrascene ⟨offset⟩, by applying the value normalization with Eq. 7, a clear association is deduced between the clipped ⟨offset⟩ values and the abrupt scene change. More specifically, all the measured clipped ⟨offset⟩ values coming from intrascene frames are relatively low (i.e., ), while the measured clipped ⟨offset⟩ values, resulting from a frame over an abrupt scene change, are equal to 1. The most important is that it is not observed more than a few middle values (i.e., around 0.5), which denote severe camera moving, such as zooming, panning, etc. Thus, the difference between the intrascene and interscene ⟨offset⟩ values is so intense that the requirement for any sophisticated threshold estimation for the shot boundary detection is eliminated.
Evaluation of the Proposed Method on Real Video Clips
To evaluate the proposed method, a video sequence of 1500 frames from the motion picture Spider-Man II was used as the test signal. The initial PAL (phase alternation line) MPEG-2 video content was transcoded to CIF (common intermediate format) MPEG-4 at advanced simple profile. On the final coded signal, an implementation12 of the aforementioned blockiness estimation algorithm was applied to perform the shot boundary detection. Fig. 3 depicts the deduced ⟨offset⟩ per frame, which was calculated by this procedure.
Based on Fig. 3, it is also experimentally proved that all the measured clipped ⟨offset⟩ values that come from intrascene frames are relatively low (i.e., ), while the measured clipped ⟨offset⟩ values, resulting from a frame over an abrupt scene change, are equal to 1.
To eliminate the case of a false frame report due to the blockiness propagation from a successfully detected scene-cut frame to its successive neighboring frames, an interval of frames from the last scene change detection (e.g., 25 frames) is considered, during which no scene change is reported even if it is detected.
The efficiency of the proposed method, with the aforementioned described configuration, was also tested on a set of various heterogeneous CIF MPEG-4 video clips encoded at , containing both media clips with abrupt and gradual scene cuts. The corresponding results are depicted in Table 1 , along with the performance, for the same encoding bit rate area, of two other existing threshold-exploited shot boundary detection methods for MPEG video13 (for method 1 see Ref. 14, and method 2 see Ref. 15).
Comparison of the proposed method for abrupt and gradual scene changes.
|Abrupt Scene Change|
|Gradual Scene Change|
From Table 1, we can deduce that although the proposed method performs similarly to existing threshold-exploited methods regarding the recall metric, it outperforms the rest of the methods for the precision of the scene detection for both abrupt and gradual scene changes, retaining at the same time significantly lower computational cost, due to the absence of a threshold parameter.
We presented a method for SBD without any threshold parameter. Using only the increment of the blockiness effect during a scene cut, the proposed method successfully detects where a scene cut occurs. The efficiency of the proposed technique was successfully tested on both abrupt and gradual scene changes and compared to other existing shot boundary detection methods.
This work was carried out within the “PYTHAGORAS II” research framework, jointly funded by the European Union and the Hellenic Ministry of Education.