25 October 1994 Audio indexing using speaker identification
Author Affiliations +
In this paper, a technique for audio indexing based on speaker identification is proposed. When speakers are known a priori, a speaker index can be created in real time using the Viterbi algorithm to segment the audio into intervals from a single talker. Segmentation is performed using a hidden Markov model network consisting of interconnected speaker sub- networks. Speaker training data is used to initiate sub-networks for each speaker. Sub- networks can also be used to model silence, or non-speech sounds such as musical theme. When no prior knowledge of the speakers is available, unsupervised segmentation is performed using a non-real time iterative algorithm. The speaker sub-networks are first initialized, and segmentation is performed by iteratively generating a segmentation using the Viterbi algorithm, and retraining the sub-networks based on the results of the segmentation. Since the accuracy of the speaker segmentation depends on how well the speaker sub-networks are initiated, agglomerative clustering is used to approximately segment the audio according to speaker for initialization of the speaker sub-networks. The distance measure for the agglomerative clustering is a likelihood ratio in which speed segments are characterized by Gaussian distributions. The distance between merged segments is recomputed at each stage of the clustering, and a duration model is used to bias the likelihood ratio. Segmentation accuracy using agglomerative clustering initialization matches accuracy using initialization with speaker labeled data.
© (1994) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Lynn D. Wilcox, Lynn D. Wilcox, Don Kimber, Don Kimber, Francine R. Chen, Francine R. Chen, "Audio indexing using speaker identification", Proc. SPIE 2277, Automatic Systems for the Identification and Inspection of Humans, (25 October 1994); doi: 10.1117/12.191878; https://doi.org/10.1117/12.191878

Back to Top