The recognition of human postures and gestures is considered to be highly relevant semantic information in videos and surveillance systems. We present a new three-step approach to classifying the posture or gesture of a person based on segmentation, classification, and aggregation. A background image is constructed from succeeding frames using motion compensation and shapes of people are segmented by comparing the background image with each frame. We use a modified curvature scale space (CSS) approach to classify a shape. But a major drawback to this approach is its poor representation of convex segments in shapes: Convex objects cannot be represented at all since there are no inflection points. We have extended the CSS approach to generate feature points for both the concave and convex segments of a shape. The key idea is to reflect each contour pixel and map the original shape to a second one whose curvature is the reverse: Strong convex segments in the original shape are mapped to concave segments in the second one and vice versa. For each shape a CSS image is generated whose feature points characterize the shape of a person very well. The last step aggregates the matching results. A transition matrix is defined that classifies possible transitions between adjacent frames, e.g. a person who is sitting on a chair in one frame cannot be walking in the next. A valid transition requires at least several frames where the posture is classified as "standing-up". We present promising results and compare the classification rates of postures and gestures for the standard CSS and our new approach.
Many TV broadcasters and film archives are planning to make their
collections available on the Web. However, a major problem with large
film archives is the fact that it is difficult to search the content
visually. A video summary is a sequence of video clips extracted from
a longer video. Much shorter than the original, the summary preserves
its essential messages. Hence, video summaries may speed up the search
Videos that have full horizontal and vertical resolution will usually
not be accepted on the Web, since the bandwidth required to transfer
the video is generally very high. If the resolution of a video is
reduced in an intelligent way, its content can still be understood. We
introduce a new algorithm that reduces the resolution while preserving
as much of the semantics as possible.
In the MoCA (movie content analysis) project at the University of
Mannheim we developed the video summarization component and tested it
on a large collection of films. In this paper we discuss the
particular challenges which the reduction of the video length poses,
and report empirical results from the use of our summarization tool.
The live-wire approach is a well-known algorithm based on a graph search to locate boundaries for image segmentation. We will extend the original cost function, which is solely based on finding strong edges, so that the approach can take a large variety of boundaries into account. The cost function adapts to the local characteristics of a boundary by analyzing a user-defined sample using a continuous wavelet decomposition. We will finally extend the approach into 3D in order to segment objects in volumetric data, e. g., from medical CT and MR scans.
We present a method for analyzing and resynthesizing inhomogeneously textured regions in images for the purpose of advanced compression. First the user defines image blocks so that they cover regions with homogeneous texture. These blocks are each transformed in turn. For the transform we use the so called Principle Component Analysis. After the transform into the new domain we statistically analyze the resulting coefficients. To resynthesize new texture we generate random numbers that exactly meet these statistics. Using the inverse transform the random coefficients are finally transformed back into the spatial domain. The visual appearance of the resulting artificial texture matches the original to a very high degree.
Video conferencing and high quality video-on-demand services are very desirable for many Internet users. However, Internet access channels ranging from wireless connections to high- speed ATM networks mean great heterogeneity with respect to bandwidth. Hierarchical video encoders that scale and distribute video data over different layers enable users to adapt video quality to the capacity of their Internet connection. However, the construction of the layers at the encoder determines the video quality that can be expected at the receiver. To achieve an optimal configuration of the different layers with respect to visual quality, we propose a hybrid scaling algorithm that scales video data both in spatial and temporal dimension. Using a quality metric based on properties of the human visual system our algorithm calculates an optimal ratio between spatial and temporal information. Additionally, we present experimental results that demonstrate the capabilities of our approach.