Shot Boundary Detection (SBD) also known as a temporal video segmentation is a preprocessing task for multiple videos applications, such as indexing and retrieval. The SBD output provides coherent temporal units which are easy to manipulate. The Most previous works implement theirs frameworks based on visual features to measure similarity for transition detection task. However, the video is very enriched by data which could be beneficial. In this paper, referring to recent multimodal works, we propose to introduce the audio components to increase the SBD task. Firstly, we worked on candidate segments obtained by measuring similarity between low features (SURF, HSF) from original video. Then we used deep features obtained from trained model (Resnet-50) for visual similarity and we introduced the audio segmentation based on Power Spectrum Density (PSD) to contribute for transition detection. The proposed method is evaluated on the clip shots dataset. Experiments on this data show that the proposed multimodal approach can achieve a better performance compared with the state-of-the-art of methods that used visual approach.
|