Motion-compensated video coders typically segment a scene into arbitrary tiles, resulting in a compressed bitstream which is not physically or semantically related to the scene structure. This paper presents a method for segmenting video frames and coding motion of regions, where the regions are defined in terms of a number of different properties. The goal is a video coder which gives good compression while identifying coherent regions in a manner useful for both human users and automated scene-understanding processes. Both a supervised and an unsupervised clustering algorithm are used to segment an image sequence; both algorithms make use of multiple features including motion, texture, position, and color. By utilizing both the structure and motion information, we preserve the semantic/structural content of the different regions, and simultaneously remove the redundancy (in successive frames) by describing the motion information in each region with a six-parameter affine model. In the supervised clustering algorithm, the first frame is manually segmented and used as training data. The classification of subsequent frames is done automatically, by using a MAP estimate, and modeling the n-dimensional feature-space as jointly Gaussian. The unsupervised algorithm is an iterative process that reassigns the classification of each point to the region corresponding to the nearest mean among each region of the segmentation from the previous iteration. In both algorithms, the distance and/or the mean is an n-dimensional measurement, n being the number of features used.