Speaker change detection (SCD) is a preliminary step for many audio applications such as speaker segmentation
and recognition. Thus, its robustness is crucial to achieve a good performance in the later steps. Especially,
misses (false negatives) affect the results. For some applications, domain-specific characteristics can be used to
improve the reliability of the SCD. In broadcast news and discussions, the cooccurrence of shot boundaries and
change points provides a robust clue for speaker changes.
In this paper, two multimodal approaches are presented that utilize the results of a shot boundary detection
(SBD) step to improve the robustness of the SCD. Both approaches clearly outperform the audio-only approach
and are exclusively applicable for TV broadcast news and plenary discussions.
In this paper, we present an automatic extraction of goal events in soccer videos by using audio track features alone without relying on expensive-to-compute video track features. The extracted goal events can be used for high-level indexing and selective browsing of soccer videos. The detection of soccer video highlights using audio contents comprises three steps: 1) extraction of audio features from a video sequence, 2) event candidate detection of highlight events based on the information provided by the feature extraction Methods and the Hidden Markov Model (HMM), 3) goal event selection to finally determine the video intervals to be included in the summary. For this purpose we compared the performance of the well known Mel-scale Frequency Cepstral Coefficients (MFCC) feature extraction method vs. MPEG-7 Audio Spectrum Projection feature (ASP) extraction method based on three different decomposition methods namely Principal Component Analysis( PCA), Independent Component Analysis (ICA) and Non-Negative Matrix Factorization (NMF). To evaluate our system we collected five soccer game videos from various sources. In total we have seven hours of soccer games consisting of eight gigabytes of data. One of five soccer games is used as the training data (e.g., announcers' excited speech, audience ambient speech noise, audience clapping, environmental sounds). Our goal event detection results are encouraging.