The scene boundary detection is important in the semantic understanding of video data and is usually determined by coherence between shots. To measure the coherence, two approaches have been proposed. One is a discrete approach and the other one is a continuous approach. In this paper, we use the continuous approach and propose some modifications on the causal First-In-First-Out(FIFO) short-term memory-based model. One modification is that we allow dynamic memory size in computing coherence reliably regardless of the size of each shot. Another modification is that some shots can be removed from the memory buffer not by the FIFO rule. These removed shots have no or small foreground objects. Using this model, we detect scene boundaries by computing shot coherence. In computing coherence, we add one new term which is the number of intermediate shots between two comparing shots because the effect of intermediate shots is important in computing shot recall. In addition, we also consider shot activity because this is important to reflect human perception. We experiment our computing model on different genres of videos and have obtained reasonable results.
In this paper, an efficient tool to extract video objects from video sequences is presented. With this tool, it is possible to segment video content in a user-friendly manner to provide easy manipulation of video content. The tool is comprised of two stages. Firstly, the initial object extraction is performed using the Recursive Shortest Spanning Tree (RSST) algorithm and the Binary Partition Tree (BPT) technique. Secondly, automatic object tracking is performed using a single frame forward region tracking method. In the first stage, an initial partition is created using the RSST algorithm which allows the user to specify the initial number of regions. This process is followed by progressive binary merging of these regions to create the BPT. The purpose of creating the BPT is to allow the user to browse the content of the scene in a hierarchical manner. This merging step creates the binary tree with nearly double the user-specified number of homogenous regions. User interaction then allows grouping particular regions into objects. In the second stage, each subsequent frame is segmented using the RSST and corresponding regions are identified using a forward region tracking method.
We propose a multi-scale and multi-modal analysis and processing scheme for audio-video data. Using a non-linear scale-space technique audio-video is analyzed and processed such that it is invariant under various imaging and hearing conditions. Degradations due to Lyapunov and structural instabilities are suppressed by this scale-space technique without destroying essential semantic relations. On the basis of an audio-video segmentation its arrangements are quantified in terms of spatio-temporal inclusion relations and dynamic ordening relations by means of scaling connectivity relations. These relations infer a topological structure on top of the audio-video scale-space inducing a unimodal and multi-modal semantics. Our scheme is illustrated separately for video, audio and audio-video material the latter pointing out the added value of integrating audio and video.
A practical method for creating a high dimensional index structure that adapts to the data distribution and scales well with the database size, is presented. Typical media descriptors, such as texture features, are high dimensional and are not uniformly distributed in the feature space. The performance of many existing methods degrade if the data is not uniformly distributed. The proposed method offers an efficient solution to this problem. First, the data's marginal distribution along each dimension is characterized using a Gaussian mixture model. The parameters of this model are estimated using the well known Expectation-Maximization method. These model parameters can also be estimated sequentially for on-line updating. Using the marginal distribution information, each of the data dimensions can be partitioned such that each bin contains approximately an equal number of objects. Experimental results on a real image texture data set are presented. Comparisons with existing techniques, such as the well known VA-File, demonstrate a significant overall improvement.
A fundamental task in video analysis is to organize and index multimedia data in a meaningful manner so as to facilitate user access for tasks such as browsing and retrieval. This paper addresses the problem of automatic index generation of movie databases based on audiovisual information. In particular, given a movie we first extract key movie events including two-speaker dialog scenes, multiple-speaker dialog scenes and hybrid scenes by using the proposed window-based sweep algorithm and the K-means clustering algorithms. Following event detection, the identity of each individual speaker in a dialog scene is recognized based on a statistical maximum likelihood approach. The identification relies on the likelihood ratio calculation between the incoming speech data and Gaussian mixture models of the speakers and the background. It is evident that the event and the speaker identity information will serve as a crucial part of the movie index table. Preliminary experimental results show that, by integrating multiple media information, we can obtain robust and meaningful event detection and speaker identification results.
Content-based image retrieval (CBIR) is receiving much attention because of the ever growing amount of pictorial content in digital libraries world wide. Much of this content is insufficiently supplied with textual metadata because it is beyond either financial or time bars to generate those data. CBIR systems operate only on the images by extraction of visual primitives like color, texture or shape. However, there is a downside to this approach: It is not possible to extract full semantic information from images alone, a problem known as the semantic gap. This paper introduces an approach that aims at bridging this gap and thereby improving the performance of the system. This is achieved by making use of user feedback to cluster images into different thematic groups. The feedback is used for the global improvement of the system instead of just in the scope of one query. This leads to a system that is continuously learning the semantics of the image base. A prototype has been implemented and an evaluation measuring it against a commercial system has been done. The results show a significant increase both in recall and in precision.
The current state of the art in content description (MPEG-7) does not provide a rich set of tools to create functional metadata (metadata that contains not only the description of the content but also a set of methods that can be used to interpret, change or analyze the content). This paper presents a framework of which the primary goal is the integration of functional metadata into the existing standards. Whenever it is not only important what is in the multimedia content, but also what is happening with the information in the content, functional metadata can be used to describe this. Some examples are: news tickers, sport results, online auctions. In order to extend content description schemes with extra functionality, MPEG-7 based descriptors are defined to allow the content creator to add his own properties and methods to the multimedia data, thus making the multimedia data self describing and manipulatable. These descriptors incorporate concepts from object technology such as objects, interfaces and events. Descriptors allow the content creator to add properties to these objects and interfaces, methods can be defined using a descriptor and activated using events. The generic use of these properties and methods are the core of the functional metadata framework. A complete set of MPEG-7 based descriptors and descriptor schemes is presented, enabling the content creator to add functional metadata to the multimedia data. An implementation of the proposed framework has been created proving the principles of functional metadata. This paper presents a method for adding extra functionality to metadata and hence to multimedia data. It is shown that doing so preserves existing content description methods and that the functional metadata extends the possibilities of the use of content description.
This paper presents an XML-based graphic database system for 3D graphic data. We describe a 3D database system supporting semantics of 3D objects and content-based retrievals. The data model underlying the 3D graphic database system represents 3D scenes using domain objects and their spatial relations. An XML-based data modeling language called 3DGML has been designed to support the data model. It offers an object-oriented 3D image modeling mechanism that separates low-level implementation details of 3D objects from their semantic roles in a 3D scene. The user can pose a visual query based on the contents of 3D images including 3D shapes and spatial relations. The query based on the shapes of 3D objects retrieves 3D images containing objects of similar shapes to a given object. The similarity of objects is determined by comparing their contours. We believe our work is one of the earliest efforts to support content-based retrieval for 3D graphic images.
Smart rooms provide advanced interfaces for networked information systems. Smart rooms include a variety of sensors that can analyze the behavior of persons in the room; these sensors allow people to issue commands without direct contact with equipment. Video is one important modality for smart room input--video analysis can be used for determining the presence of people in the room, gesture analysis, facial analysis, etc. This paper outlines the architecture of a real-time video analysis system for smart rooms. The system uses multiple cameras, each with its own video signal processor. We use algorithms that can be performed in real-time to capture basic information about the persons in the room.
The increased amount of visual data in several applications necessitates sophisticated indexing techniques for retrieval based on the image content. The recent JPEG2000 standard addresses the need for content based coding and manipulation of visual media. Future multimedia databases are expected to store images in the JPEG2000 format. Hence, it is crucial to develop indexing and retrieval systems that operate in the framework. Several content based indexing techniques have been proposed in the wavelet domain which is the technique of choice in the JPEG2000 standard. However, most of these techniques rely on extracting low level features such as color, texture and shape that represent the global image content. In this paper, we propose a region based indexing technique in the JPEG2000 framework. Specific regions of interest (ROI) (which insure the reconstructed quality in the image) are tracked and analyzed through different layers of the wavelet transform in the coding process. Shape features are extracted from the ROI sketch in the uncompressed domain. Texture and color features are extracted in the compressed domain at different wavelet resolutions corresponding to these regions. Indexing and retrieval are based on a combination of these features. Extensive simulations have been performed in the JPEG2000 framework. Experimental results demonstrate that compared to the exiting wavelet based indexing approaches the proposed scheme has superior performance.
Most of the content-based image retrieval systems are based on RGB color space. Using the average amount of red, green, and blue is appropriate for natural images. But, for the case of e-catalog images, the number of colors for a given catalog image is only a few and the average color of an image is meaningless to customers. This paper presents a color comparison scheme based on the HSI color ratio to improve the accuracy of retrieval on e-catalog images. We have divided hue by 30 degrees, resulting in 12 colors. By considering saturation and intensity, and eliminating some duplicate combinations, we further divided each hue into 15 categories, thus resulting in 186 representative colors which is quite smaller than 1.7 million colors of 24 bit RGB case. The resulting 186-element HSI histogram is represented using presence bitmap vector(186 bit) and ratio vector(93 bytes). For the e-catalog images, most of the presence vector bits are 0 since there are only a few colors. We have implemented the prototype retrieval system and showed the usefulness of the proposed system by comparing measures such as precision and search speed with RGB histogram based schemes.
Recently, the web sites such as e-business sites and shopping mall sites deal with lots of image information. To find a specific image from these image sources, we usually use web search engines or image database engines which rely on keyword only retrievals or color based retrievals with limited search capabilities. This paper presents an intelligent web image retrieval system. We propose the system architecture, the texture and color based image classification and indexing techniques, and representation schemes of user usage patterns. The query can be given by providing keywords, by selecting one or more sample texture patterns, by assigning color values within positional color blocks, or by combining some or all of these factors. The system keeps track of user's preferences by generating user query logs and automatically add more search information to subsequent user queries. To show the usefulness of the proposed system, some experimental results showing recall and precision are also explained.
Retrieving images from a large image dataset using image content as a key is an important issue. In this paper, we present a new content-based image retrieval approach using a wavelet transform and subband image segmentation. For the image retrieval, we first decompose the image using a wavelet transform and adopt a vector quantization(VQ) algorithm to perform automatic segmentation based on image features such as color and texture. The wavelet transform decomposes the image into 4 subbands(LL,LH,HL,HH). Only the LL component is further decomposed until the desired depth is reached. The image segmentation is performed using HSI color and texture features of the low pass subband component image. The VQ provides a transformation from the raw pixel data to a small group of homogeneous classes which are coherent in color and feature sapce. For managing a large image dataset, image compression is usually considered. In that sense the segmentation of a compressed image or subband image is more efficient compared with using an uncompressed image when the compressed image preserves the information needed for the image segmentation task. An important aspect of the system is that using a subband image of the wavelet transform can reduce the size and noise of the image. Thus, we can subsequently reduce the computational burden for the image segmentation. The experimental results of the proposed image retrieval system confirm the feasibility of our approach in retrieving accuracy and in lowering computational cost compared to using the original image.
Extraction of repetitive patterns of the main melody in a given music piece is investigated in this research. A dictionary-based approach is proposed to achieve the task. The input to the proposed system is a piece of music consisting of numerical music scores (e.g. the MIDI file format), and other music forms such as the sound wave have to be converted to numerical music scores first. In the system, segmentation is done based on the tempo information and a music score is decomposed into bars. Each bar is indexed, and a bar index table is built accordingly. Then, an adaptive dictionary-based algorithm known as the Lempel Ziv 78 (LZ-78) is modified and applied to the bar-represented music scores to extract repetitive patterns. The LZ78 algorithm is slightly modified to achieve better results, and the modified LZ78 is named the ¡§Exhaustive Search with Progressive LEngth¡¨ (ESPLE). After this step, pruning is applied to this dictionary to remove non-repeating patterns. Modified LZ78 and pruning are repetitively applied to the updated dictionary, which is generated from the previous cycle, until the dictionary converges. Experiments are performed on MIDI files to demonstrate the superior performance of the proposed algorithm.
While most previous work on musical instrument recognition is focused on the classification of single notes in monophonic music, a scheme is proposed in this paper for the distinction of instruments in continuous music pieces which may contain one or more kinds of instruments. Highlights of the system include music segmentation into notes, harmonic partial estimation in polyphonic sound, note feature calculation and normalization, note classification using a set of neural networks, and music piece categorization with fuzzy logic principles. Example outputs of the system are `the music piece is 100% guitar (with 90% likelihood)' and `the music piece is 60% violin and 40% piano, thus a violin/piano duet'. The system has been tested with twelve kinds of musical instruments, and very promising experimental results have been obtained. An accuracy of about 80% is achieved, and the number can be raised to 90% if misindexings within the same instrument family are tolerated (e.g. cello, viola and violin). A demonstration system for musical instrument classification and music timbre retrieval is also presented.
By applying video smoothing techniques to real-time video transmission, the peak rate and rate variability of compressed video streams can be significantly reduced. Moreover, statistical multiplexing of the smoothed traffic can substantially improve network utilization. In this paper we propose a new smoothing scheme, which exploits statistical multiplexing gain that can be obtained after smoothing of individual video streams. We present a new bandwidth allocation algorithm that allows for responsive interactivity. The local re-smoothing algorithm is carried out using an iterative process.
How to facilitate efficient video manipulation and access in a web-based environment is becoming a popular trend for video applications. In this paper, we present a web-oriented video management and application processing system, based on our previous work on multimedia database and content-based retrieval. In particular, we extend the VideoMAP architecture with specific web-oriented mechanisms, which include: (1) Concurrency control facilities for the editing of video data among different types of users, such as Video Administrator, Video Producer, Video Editor, and Video Query Client; different users are assigned various priority levels for different operations on the database. (2) Versatile video retrieval mechanism which employs a hybrid approach by integrating a query-based (database) mechanism with content- based retrieval (CBR) functions; its specific language (CAROL/ST with CBR) supports spatio-temporal semantics of video objects, and also offers an improved mechanism to describe visual content of videos by content-based analysis method. (3) Query profiling database which records the `histories' of various clients' query activities; such profiles can be used to provide the default query template when a similar query is encountered by the same kind of users. An experimental prototype system is being developed based on the existing VideoMAP prototype system, using Java and VC++ on the PC platform.
With the advent of set-top boxes, the convergence of TV (broadcasting) and PC (Internet) is set to enter the home environment. Currently, a great deal of activity is occurring in developing standards (TV-Anytime Forum) and devices (TiVo) for local storage on Home Media Servers (HMS). These devices lie at the heart of convergence of the triad: communications/networks - content/media - computing/software. Besides massive storage capacity and being a communications 'gateway', the home media server is characterised by the ability to handle metadata and software that provides an easy to use on-screen interface and intelligent search/content handling facilities. In this paper, we describe a research prototype HMS that is being developed within the GigaCE project at the Telematica Instituut . Our prototype demonstrates advanced search and retrieval (video browsing), adaptive user profiling and an innovative 3D component of the Electronic Program Guide (EPG) which represents online presence. We discuss the use of MPEG-7 for representing metadata, the use of MPEG-21 working draft standards for content identification, description and rights expression, and the use of HMS peer-to-peer content distribution approaches. Finally, we outline explorative user behaviour experiments that aim to investigate the effectiveness of the prototype HMS during development.
An automatic web content classification system is proposed in this research for web information filtering. A sample group of web contents are first collected via commercial search engines. Then, they are classified into different subject group and more related web pages can be searched for further analysis. It can free from the troublesome and routine process that are performed by human beings in most search engines. And the clustered information can be updated at any specified time automatically. Preliminary experimental results are used to demonstrate the effectiveness of the performance of the proposed system.
All traffic models for MPEG-like encoded variable bit rate (VBR) video can be categorized into (i) data rate models (DRMs), and (ii) frame size models (FSMs). Almost all proposed VBR traffic models are DRMs. Since DRMs generate only data arrival rate, they are good for estimating average packet-loss and ATM buffer over-flowing probabilities, but fail to identify such details as percentage of frames affected. FSMs generate sizes of individual MPEG frames, and are good for studying frame loss rate in addition to data loss rate. Among three previously proposed FSMs: (i) one generates frame sizes for full-length movies without preserving GOP-periodicity; (ii) another generates frame sizes for full-length movies without preserving size-based video-segment transitions; and (iii) the third generates VBR video traffic for news videos from scene content description provided to it presupposing a proper segmentation. In this paper, we propose two segmentation techniques for VBR videos - (a) Equal Number of GOPs in all shot classes (ENG), and (b) Geometrically Increasing Interval Lengths for shot classes (GIIL). Each technique partitions the GOPs in the video into size-based shot classes. Frames in each class produce three data-sets one each for I-, B-, and P-type frames. Each of these data-sets can be modeled with an axis shifted Gamma distribution. Markov renewal processes model interclass transitions. We have used QQ plots to show visual similarity of model-generated VBR video data-sets with original data-set. Leaky-bucket simulation study has been used to show similarity of data and frame loss rates between model-generated videos and original video. Our study of frame-based VBR video revealed GIIL segmentation technique separates the I-, B-, and P- frames in well behaved shot classes whose statistical properties can be captured by Gamma-based models.
Current well-known web browsers are limited to graphical User Interfaces (GUIs) that are geared towards an iconic or hieroglyphic system of communication. Modern speech-enabled user interfaces are built over GUIs and apply a 'Say What You See' (SWYS) paradigm. GUIs and SWYS fail to satisfy the user's need for natural language multimedia interfaces. The first voice based OS interfaces such as the Vocal/Auditory Multimedia Browser (VAMB) combined a spoken quasi-natural language input with a graphical display, audio and verbal output and supported a rudimentary browsing capability. The MultiMedia Browser (MMB) improves VAMB's speech recognition capabilities and includes a complete web browser with advanced page navigation. MMB applies a What You Say Is What You Get (WYSIWYG) paradigm that utilizes both an Associative Calculus Environment (ACE) and Dynamic Hierarchal Limited Vocabulary (DHLV). The associative calculus is a set of rules that specifies how sentences using the ACE syntax are to be interpreted relative to graphical displays. DHLV dynamically minimizes the contents of the vocabulary pool depending on the current multimedia configuration.
Due to increased interests in interactive personalized multimedia services, the design of continuous media (CM) severs in support of this functionality has received attention from industry and academia. These systems automatically create a sequence of CM segments optimized to the interests of each user based on their predefined preferences and impromptu queries. Applications such as distance learning, news-on-demand, interactive training, and home shopping would be significantly improved with this functionality. Two critical issues are 1) automatic creation of an optimized script for each user, and 2) data management and retrieval to maximize the performance of servers in a multi-user environment. Due to the maximized flexibility of presentation and potential conflicts among requirements from concurrent users while sharing huge amount of CM data without redundancy, the design of a CM sever that supports these applications, especially data placement and retrieval scheduling, is challenging. This paper investigates and proposes data placement and retrieval scheduling techniques on multi-disk CM servers to support such applications. These include how to share the same content among multiple users, how to compose a personalized content on demand for each user and support continuous display of the edited content without any jitters or interruptions, termed hiccups. The proposed techniques solve the problems by providing a fast retrieval of CM data using random data placement across disks with deadline-driven scheduling and prefetching them with a minimal latency to statistically guarantee a continuous display. Simulation results demonstrate the feasibility of on-demand composition and continuous display of personalized content with a hiccup probability less than a millionth. They also show less than a second startup latency, which is acceptable for most interactive applications.
Video (and other multimedia sources) distribution starts to implement industrial solutions that supposes no quality of service (QoS) properties for the network. To overcome congestion problems in the core of a worldwide Internet network, mirrors sites at the edges of the network are dispatched. Thus the QoS problem is only relevant for the network extremities. Nevertheless, this strategy implies to replicate the multimedia database (denoted at MDB) at multiple edge points to meet the real-time constraints and to establish specific mechanisms between mirror sites to satisfy customer needs as for video distribution. For each of both kind of constraints, we propose a unique data/network representation.
Interval caching can boost the throughput of video server by caching consecutive video request in a global cache. In this paper, we propose a novel cache admission control and replacement algorithm called ROC (Resist-Overload Capability) to efficiently manage cache usage in video server with interval caching. First, we introduce a deterministic cache admission control scheme to guarantee the QoS but serve under-utilization of the cache resources. Then a statistic-multiplex based admission control scheme is presented to improve the efficiency of cache usage by converting the characteristic of VBR video into the number of memory pages that the video interval requires during the service round. The statistic scheme suffers from the heavy convolution computation which deceases its efficiency. Thirdly, we simplify the convolution computation by using the Central Limit Theorem, and propose the Resist-Overload Capability metric to characterize the capability that resists the occurrences of cache overload. The related ROC based admission control scheme and replacement algorithm are proposed accordingly. The simulation results indicate that ROC scheme highly improves the efficiency of cache management for VBR video server with interval caching.
Transmission of the real-time components, such as video and voice of multimedia streams over internets requires pre-allocation of network bandwidth from source to destination, as well as CPU cycles, I/O bandwidth, etc. in the server and in the client providing multimedia services. This paper presents a distributed version of the Utility Model for admission control and Quality of Service (QoS) adaptation of a multi server multimedia service provider. We propose a broker for managing the resources of the servers. This version of the Utility Model is quasi-distributed, meaning that computations for resource allocation are done at a single site (the broker), but the resources considered are distributed over multiple servers. This paper presents the architecture of the broker and the algorithm used by the broker to select the sessions, so that the QoS requirements are met while revenue is maximized. The QoS adaptation policy used to achieve fault tolerance during server failure is described.
Modern optical transmission technology using wavelength division multiplexing can support aggregate transmission speeds on the order of hundreds of gigabits per second, but photonic switching speeds are relatively more constrained at the present time. Burst switching has recently been proposed as a solution, where long segments of data with identical destinations and quality-of-service objectives are handled together as unified switchable entities. In this paper we examine application of this concept to video-on-demand servers, and analyze the performance. A burst-switching distributed-server architecture is proposed. It is shown that the switch performance may be enhanced by using recent concepts of bifurcated input queueing switches.
MPEG family codecs generate variable-bit-rate (VBR) compressed video with significant multiple-time-scale bit rate variability. Smoothing techniques remove the periodic fluctuations generated by the codification modes. However, global efficiency concerning network resource allocation remains low due to scene-time-scale variability. RCBR techniques provide suitable means to achieving higher efficiency. Among all RCBR techniques described in literature, 2RCBR mechanism seems to be especially suitable for video-on demand. The method takes advantage of the knowledge of the stored video to calculate the renegotiation intervals and of the client buffer memory to perform work-ahead buffering techniques. 2RCBR achieves 100% bandwidth global efficiency with only two renegotiation levels. The algorithm is based on the study of the second derivative of the cumulative video sequence to find out sharp-sloped inflection points that point out changes in the scene complexity. Due to its nature, 2RCBR becomes very adequate to deliver MPEG2 scalable sequences into the network cause it can assure a constant bit rate to the base MPEG2 layer and use the higher rate intervals to deliver the enhanced MPEG2 layer. However, slight changes in the algorithm parameters must be introduced to attain an optimal behavior. This is verified by means of simulations on MPEG2 video patterns.
The proxy mechanisms widely used in WWW systems offer low-delay data delivery by a means of ``proxy server''. By applying the proxy mechanism to the video transfer, we expect a real-time and interactive video streaming without introducing extra load on the system. In addition, if the proxy appropriately adjusts the quality of cached video data to the user's demand, video streams can be delivered to users considering their heterogeneous QoS requirements. In this paper, we propose proxy caching mechanisms that can achieve a high-quality video transfer considering the user's demand and the available bandwidth. In our system, a video stream is divided into pieces. The proxy caches them in local buffer, adjusts their quality if necessary, transmits them to users, replaces them with cached data, and retrieves them from the video server, considering user's requirement. We evaluate the proposed video caching mechanisms and compare their performance in terms of the required buffer size, the play-out delay and the video quality. Consequently, the validity of the video quality adjustment in the proxy is confirmed.
With the dramatic growth of multimedia streams, the efficient distribution of stored videos has become a major concern. There are two basic caching strategies: the whole caching strategy and the caching strategy based on layered encoded video, the latter can satisfy the requirement of the highly heterogeneous access to the Internet. Conventional caching strategies assign each object a cache gain by calculating popularity or density popularity, and determine which videos and which layers should be cached. In this paper, we first investigate the delivery model of stored video based on proxy, and propose two novel caching algorithms, DPLayer (for layered encoded caching scheme) and DPWhole (for whole caching scheme) for multimedia proxy caching. The two algorithms are based on the resource allocation model of dynamic programming to select the optimal subset of objects to be cached in proxy. Simulation proved that our algorithms achieve better performance than other existing schemes. We also analyze the computational complexity and space complexity of the algorithms, and introduce a regulative parameter to compress the states space of the dynamic programming problem and reduce the complexity of algorithms.
This paper reports our design, and implementation of an automatic lecture-room camera-management system. The motivation for building this system is to facilitate online lecture access and reduce the expense of producing high quality lecture videos. The goal of this project is a camera-management system that can perform as a human video-production team. To achieve this goal, our system collects audio/video signals available in the lecture room and uses the multimodal information to direct our video cameras to interesting events. Compared to previous work--which has tended to be technology centric--we started with interviews with professional video producers and used their knowledge and expertise to create video production rules. We then targeted technology components that allowed us to implement a substantial portion of these rules, including the design of a virtual video director, a speaker cinematographer, and an audience cinematographer. The complete system is installed in parallel with a human-operated video production system in a middle-sized corporate lecture room, and used for broadcasting lectures through the web. The system¡*s performance was compared to that of a human operator via a user study. Results suggest that our system's quality is close to that of a human-controlled system.