Moving object extraction plays a fundamental role in computer vision, pattern recognition, object-based video coding and indexing. However, the low-level visual homogeneity criteria (like color, texture, intensity, motion, and so on) for segmentation do not lead to regions that immediately correspond to meaningful objects from the human point of view, because a semantic video object can contain totally different gray levels, colors, textures, or even motion. In this paper, a novel moving object extraction algorithm is proposed by collaborative integration of the results of a semantic object generation algorithm and a temporal tracking technique. First, the semantic objects are generated by a seeded region aggregation procedure according to some perceptual visual models or by a human– computer interaction procedure. Second, the correspondence of these extracted semantic objects along the time axis is further exploited through a temporal tracking procedure.