3D-Video systems allow a user to perceive depth in the viewed scene and to display the scene from arbitrary viewpoints
interactively and on-demand. This paper presents a prototype implementation of a 3D-video streaming
system using an IP network. The architecture of our streaming system is layered, where each information layer
conveys a single coded video signal or coded scene-description data. We demonstrate the benefits of a layered
architecture with two examples: (a) stereoscopic video streaming, (b) monoscopic video streaming with remote
multiple-perspective rendering. Our implementation experiments confirm that prototyping 3D-video streaming
systems is possible with today's software and hardware. Furthermore, our current operational prototype demonstrates
that highly heterogeneous clients can coexist in the system, ranging from auto-stereoscopic 3D displays
to resource-constrained mobile devices.
A 3D video stream is typically obtained from a set of synchronized cameras, which are simultaneously capturing
the same scene (multiview video). This technology enables applications such as free-viewpoint video which
allows the viewer to select his preferred viewpoint, or 3D TV where the depth of the scene can be perceived
using a special display. Because the user-selected view does not always correspond to a camera position, it may
be necessary to synthesize a virtual camera view. To synthesize such a virtual view, we have adopted a depth image-based rendering technique that employs one depth map for each camera. Consequently, a remote rendering
of the 3D video requires a compression technique for texture and depth data. This paper presents a predictivecoding
algorithm for the compression of depth images across multiple views. The presented algorithm provides
(a) an improved coding efficiency for depth images over block-based motion-compensation encoders (H.264), and
(b), a random access to different views for fast rendering. The proposed depth-prediction technique works by
synthesizing/computing the depth of 3D points based on the reference depth image. The attractiveness of the
depth-prediction algorithm is that the prediction of depth data avoids an independent transmission of depth for
each view, while simplifying the view interpolation by synthesizing depth images for arbitrary view points. We
present experimental results for several multiview depth sequences, that result in a quality improvement of up
to 1.8 dB as compared to H.264 compression.
This paper presents a software framework providing a platform for parallel and distributed processing of video
data on a cluster of SMP computers. Existing video-processing algorithms can be easily integrated into the
framework by considering them as atomic processing tiles (PTs). PTs can be connected to form processing graphs
that model the data flow of a specific application. This graph also defines the data dependencies that determine
which tasks can be computed in parallel. Scheduling of the tasks in this graph is carried out automatically using
a pool-of-tasks scheme. The data format that can be processed by the framework is not restricted to image data,
such that also intermediate data, like detected feature points or object positions, can be transferred between PTs.
Furthermore, the processing can optionally be carried out efficiently on special-purpose processors with separate
memory, since the framework minimizes the transfer of data. Finally, we describe an example application for a
multi-camera view-interpolation system that we successfully implemented on the proposed framework.
Emerging 3-D displays show several views of the scene simultaneously. A direct transmission of a selection of these views is impractical, because various types of displays support a different number of views and the decoder has to interpolate the intermediate views. The transmission of multiview image information can be simplified by only transmitting the texture data for the central view and a corresponding depth map. Additional to the coding of the texture data, this technique requires the efficient coding of depth maps. Since the depth map represents the scene geometry and thereby covers the 3-D perception of the scene, sharp edges corresponding to object boundaries, should be preserved. We propose an algorithm that models depth maps using piecewise-linear functions (platelets). To adapt to varying scene detail, we employ a quadtree decomposition that divides the image into blocks of variable size, each block being approximated by one platelet. In order to preserve sharp object boundaries, the support area of each platelet is adapted to the object boundary. The subdivision of the quadtree and the selection of the platelet type are optimized such that a global rate-distortion trade-off is realized. Experimental results show that the described method can improve the resulting picture quality after compression of depth maps by 1-3 dB when compared to a JPEG-2000 encoder.
An efficient way to transmit multi-view images is to send the texture image together with a corresponding depth image. The depth image specifies the distance between each pixel and the camera. With this information, arbitrary views can be generated at the decoder. In this paper, we propose a new algorithm for the coding of depth images that provides an efficient representation of smooth regions as well as geometric features such as object contours. Our algorithm uses a segmentation procedure based on a quadtree decomposition and models the depth image content with piecewise linear functions. We achieved a bit-rate as low as 0.33 bit/pixel, without any entropy coding. The attractivity of the coding algorithm is that, by exploiting specific properties of depth images, no degradations are shown along discontinuities, which is important for perceived depth.
Global-motion estimators are an important part of current video-coding systems like MPEG-4, content analysis
and description systems like MPEG-7, and many video-object segmentation algorithms. Feature-based motion
estimators use the motion vectors obtained for a set of selected points to calculate the parameters of the globalmotion
model. This involves the detection of feature points, the computation of correspondences between two sets
of features, and the motion parameter estimation. In this paper, we will present a feature-based global-motion
estimation system and discuss each of its parts in detail. The idea is to provide an overview of a general purpose
feature-based motion estimator and to point out the important design aspects. We evaluate the performance
of di erent feature detection algorithms, propose an efficient feature-correspondence algorithm, and we compare
the di erence between a non-linear parameter estimation and a linear approximation. Finally, the RANSAC
based robust parameter estimation is examined, we show why it does not reach its theoretical performance,
and propose a modification to increase its accuracy. Our global-motion estimator has an average accuracy of ≈0.15 pixels with real-time execution.
Global-motion estimation algorithms as they are employed in the MPEG-4 or H.264 video coding standards describe motion with a set of abstract parameters. These parameters model the camera motion, but they cannot be directly related to a physical meaning like rotation angles or the focal-length. We present a two step algorithm to factorize these abstract parameters into physically meaningful operations. The first step applies a fast linear estimation method. In an optional second step, these parameters can be refined with a non-linear optimization algorithm. The attractivity of our algorithm is its combination with the multi-sprite concept that allows for unrestricted rotational camera motion, including varying of focal-lengths. We present results for several sequences, including the well-known stefan sequence, which can only be processed with the multi-sprite approach.
Object-oriented coding in the MPEG-4 standard enables the separate processing of foreground objects and the scene background (sprite). Since the background sprite only has to be sent once,
transmission bandwidth can be saved. This paper shows that the concept of merging several views of a non-changing scene background into a single background sprite is usually not the most efficient way to transmit the background image. We have found that the counter-intuitive approach of splitting the background into several independent parts can reduce the overall amount of data. For this reason, we propose an algorithm that provides an optimal partitioning
of a video sequence into independent background sprites (a multi-sprite), resulting in a significant reduction of the involved coding cost. Additionally, our algorithm results in background sprites with better quality by ensuring that the sprite resolution has at least the final display resolution throughout the sequence.
Even though our sprite generation algorithm creates multiple sprites
instead of a single background sprite, it is fully compatible with the existing MPEG-4 standard. The algorithm has been evaluated with several test-sequences, including the well-known Table-tennis and Stefan sequences. The total coding cost could be reduced by factors of about 2.7 or even higher.
Many TV broadcasters and film archives are planning to make their
collections available on the Web. However, a major problem with large
film archives is the fact that it is difficult to search the content
visually. A video summary is a sequence of video clips extracted from
a longer video. Much shorter than the original, the summary preserves
its essential messages. Hence, video summaries may speed up the search
Videos that have full horizontal and vertical resolution will usually
not be accepted on the Web, since the bandwidth required to transfer
the video is generally very high. If the resolution of a video is
reduced in an intelligent way, its content can still be understood. We
introduce a new algorithm that reduces the resolution while preserving
as much of the semantics as possible.
In the MoCA (movie content analysis) project at the University of
Mannheim we developed the video summarization component and tested it
on a large collection of films. In this paper we discuss the
particular challenges which the reduction of the video length poses,
and report empirical results from the use of our summarization tool.
We propose an automatic camera calibration algorithm for court sports. The obtained camera calibration parameters are required for applications that need to convert positions in the video frame to real-world coordinates or vice versa. Our algorithm uses a model of the arrangement of court lines for calibration. Since the court
model can be specified by the user, the algorithm can be applied to a variety of different sports.
The algorithm starts with a model initialization step which locates the court in the image without any user assistance or a-priori knowledge about the most probable position. Image pixels are classified as court line pixels if they pass several tests including color and local texture constraints. A Hough transform is applied to
extract line elements, forming a set of court line candidates. The subsequent combinatorial search establishes correspondences between lines in the input image and lines from the court model. For the succeeding input frames, an abbreviated calibration algorithm is used, which predicts the camera parameters for the new image
and optimizes the parameters using a gradient-descent algorithm.
We have conducted experiments on a variety of sport videos (tennis, volleyball, and goal area sequences of soccer games). Video scenes with considerable difficulties were selected to test the robustness of the algorithm. Results show that the algorithm is very robust to occlusions, partial court views, bad lighting conditions, or
This paper presents a new algorithm for video-object segmentation,
which combines motion-based segmentation, high-level object-model
detection, and spatial segmentation into a single framework.
This joint approach overcomes the disadvantages of these algorithms
when applied independently. These disadvantages include the low semantic accuracy of spatial segmentation and the inexact object boundaries obtained from object-model matching and motion segmentation. The now proposed algorithm alleviates three problems common to all motion-based segmentation algorithms. First, it completes object areas that cannot be clearly distinguished
from the background because their color is near the background color.
Second, parts of the object that are not considered to belong
to the object since they are not moving, are still added to the object mask. Finally, when several objects are moving, of which only one is of interest, it is detected that the remaining regions
do not belong to any object-model and these regions are removed from the foreground. This suppresses regions erroneously considered as moving or objects that are moving but that are completely irrelevant to the user.
In this paper, we propose a new system for video object detection
based on user-defined models. Object models are described by
'model graphs' in which nodes represent image regions and edges
denote spatial proximity. Each node is attributed with color and
shape information about the corresponding image region. Model
graphs are specified manually based on a sample image of the
object. Object recognition starts with automatic color segmentation of the input image. For each region, the same features are extracted as specified in the model graph. Recognition is based on finding a
subgraph in the image graph that matches the model graph. Evidently, it is not possible to find an isomorph subgraph, since node and edge attributes will not match exactly. Furthermore, the automatic segmentation step leads to an oversegmented image. For this reason, we employ inexact graph matching, where several nodes of the image graph may be mapped onto a single node in the model graph. We have applied our object recognition algorithm to cartoon sequences. This class of sequences is difficult to handle with current automatic segmentation algorithms because the motion estimation has difficulties arising from large homogeneous regions and because the object appearance is typically highly variable. Experiments show that our algorithm can robustly detect the specified objects and also accurately find the object boundary.
This paper presents a fully software-based MPEG-2 encoder architecture, which uses scene-change detection to optimize the Group-of-Picture (GOP) structure for the actual video sequence. This feature enables easy, lossless edit cuts at scene-change positions and it also improves overall picture quality by providing good reference frames for motion prediction. Another favorable aspect is the high coding speed obtained, because the encoder is based on a novel concept for parallel MPEG coding on SMP machines. This concept allows the use of advanced frame-based coding algorithms for motion estimation and adaptive quantization, thereby enabling high- quality software encoding in real-time. Our proposal can be combined with the conventional parallel computing approach on slice basis, to further improve parallelization efficiency. The concepts in the current SAMPEG implementation for MPEG-2 are directly applicable to MPEG-4 encoders.