In today multimedia audio-video communication systems, data compression plays a fundamental role by reducing the bandwidth waste and the costs of the infrastructures and equipments. Among the different compression standards, the MPEG-4 is becoming more and more accepted and widespread. Even if one of the fundamental aspects of this standard is the possibility of separately coding video objects (i.e. to separate moving objects from the background and adapt the coding strategy to the video content), currently implemented codecs work only at the full-frame level. In this way, many advantages of the flexible MPEG-4 syntax are missed. This lack is due both to the difficulties in properly segmenting moving objects in real scenes (featuring an arbitrary motion of the objects and of the acquisition sensor), and to the current use of these codecs, that are mainly oriented towards the market of DVD backups (a full-frame approach is enough for these applications).
In this paper we propose a codec for MPEG-4 real-time object streaming, that codes separately the moving objects and the scene background. The proposed codec is capable of adapting its strategy during the transmission, by analysing the video currently transmitted and setting the coder parameters and modalities accordingly. For example, the background can be transmitted as a whole or by dividing it into “slightly-detailed” and “highly detailed” zones that are coded in different ways to reduce the bit-rate while preserving the perceived quality. The coder can automatically switch in real-time, from one modality to the other during the transmission, depending on the current video content. Psychovisual masks and other video-content based measurements have been used as inputs for a Self Learning Intelligent Controller (SLIC) that changes the parameters and the transmission modalities.
The current implementation is based on the ISO 14496 standard code that allows Video Objects (VO) transmission (other Open Source Codes like: DivX, Xvid, and Cisco’s Mpeg-4IP, have been analyzed but, as for today, they do not support VO). The original code has been deeply modified to integrate the SLIC and to adapt it for real-time streaming. A personal RTP (Real Time Protocol) has been defined and a Client-Server application has been developed. The viewer can decode and demultiplex the stream in real-time, while adapting to the changing modalities adopted by the Server according to the current video content.
The proposed codec works as follows: the image background is separated by means of a segmentation module and it is transmitted by means of a wavelet compression scheme similar to that used in the JPEG2000. The VO are coded separately and multiplexed with the background stream. At the receiver the stream is demultiplexed to obtain the background and the VO that are subsequently pasted together.
The final quality depends on many factors, in particular: the quantization parameters, the Group Of Video Object (GOV) length, the GOV structure (i.e. the number of I-P-B VOP), the search area for motion compensation. These factors are strongly related to the following measurement parameters (that have been defined during the development): the Objects Apparent Size (OAS) in the scene, the Video Object Incidence factor (VOI), the temporal correlation (measured through the Normalized Mean SAD, NMSAD). The SLIC module analyzes the currently transmitted video and selects the most appropriate settings by choosing from a predefined set of transmission modalities. For example, in the case of a highly temporal correlated sequence, the number of B-VOP is increased to improve the compression ratio. The strategy for the selection of the number of B-VOP turns out to be very different from those reported in the literature for B-frames (adopted for MPEG-1 and MPEG-2), due to the different behaviour of the temporal correlation when limited only to moving objects. The SLIC module also decides how to transmit the background. In our implementation we adopted the Visual Brain theory i.e. the study of what the “psychic eye” can get from a scene. According to this theory, a Psychomask Image Analysis (PIA) module has been developed to extract the visually homogeneous regions of the background. The PIA module produces two complementary masks one for the visually low variance zones and one for the higly variable zones; these zones are compressed with different strategies and encoded into two multiplexed streams. From practical experiments it turned out that the separate coding is advantageous only if the low variance zones exceed 50% of the whole background area (due to the overhead given by the need of transmitting the zone masks). The SLIC module takes care of deciding the appropriate transmission modality by analyzing the results produced by the PIA module.
The main features of this codec are: low bitrate, good image quality and coding speed. The current implementation runs in real-time on standard PC platforms, the major limitation being the fixed position of the acquisition sensor. This limitation is due to the difficulties in separating moving objects from the background when the acquisition sensor moves. Our current real-time segmentation module does not produce suitable results if the acquisition sensor moves (only slight oscillatory movements are tolerated). In any case, the system is particularly suitable for tele surveillance applications at low bit-rates, where the camera is usually fixed or alternates among some predetermined positions (our segmentation module is capable of accurately separate moving objects from the static background when the acquisition sensor stops, even if different scenes are seen as a result of the sensor displacements). Moreover, the proposed architecture is general, in the sense that when real-time, robust segmentation systems (capable of separating objects in real-time from the background while the sensor itself is moving) will be available, they can be easily integrated while leaving the rest of the system unchanged.
Experimental results related to real sequences for traffic monitoring and for people tracking and afety control are reported and deeply discussed in the paper. The whole system has been implemented in standard ANSI C code and currently runs on standard PCs under Microsoft Windows operating system (Windows 2000 pro and Windows XP).