This research addresses the problem of automatically extracting semantic video scenes from daily movies using multimodal information. A 3-stage scene detection scheme is proposed. In the first stage, we use pure visual information to extract a coarse-level scene structure based on generated shot sinks. In the second stage, the audio cue is integrated to further refine scene detection results by considering various kinds of audio scenarios. Finally, in the third stage, we allow users to directly interact with the system so as to fine-tune the detection results to their own satisfaction. The generated scene structure can provide a compact yet meaningful abstraction of the video data, which will apparently facilitate the content access. Preliminary experiments on integrating multiple media cues for movie scene extraction have yielded encouraging results.