This paper presents our latest work on analyzing and understanding
the content of learning media such as instructional and training
videos, based on the identification of video frame types. In
particular, we achieve this goal by first partitioning a video
sequence into homogeneous segments where each segment contains
frames of the same image type such as slide or web-page; then we
categorize the frames within each segment into one of the
following four classes: slide, web-page, instructor and picture-in-picture, by analyzing various
visual and text features. Preliminary experiments carried out on
two seminar talks have yielded encouraging results. It is our
belief that by classifying video frames into semantic image
categories, we are able to better understand and annotate the
learning media content and subsequently facilitate its content
access, browsing and retrieval.
In content-based image retrieval (CBIR), in order to alleviate learning in the high-dimensional space, Fisher discriminant analysis (FDA) and multiple discriminant analysis (MDA) are commonly used to find an optimal discriminating subspace that the data are clustered in the reduced feature space, in which the probabilistic structure of the data could be simplified and captured by simpler model assumption, e.g., Gaussian mixtures. However, due to the two reasons (i) the real number of clases in the image database is usually unknown; and (ii) the image retrieval system acts as a classifier to divide the images into two classes, relevant and irrelevant, the effective dimension of projected subspace is usually one. In this paper, a novel hybrid feature dimension reduction techniqe is proposed to construct descriptive and discriminant features at the same time by maximizing the Rayleigh coefficient. The hybrid LDA and PCA analysis not only increases the effective dimension of the projected subspace, but also offers more flexibility and alternatives to LDA and PCA. Extensive tests on benchmark and real image databases have shown the superior performances of the hybrid analysis.
Nowadays most digital cameras have the functionality of taking short video clips, with the length of video ranging from several seconds to a couple of minutes. The purpose of this research is to develop an algorithm which extracts an optimal set of keyframes from each short video clip so that the user could obtain proper video frames to print out. In current video printing systems, keyframes are normally obtained by evenly sampling the video clip over time. Such an approach, however, may not reflect highlights or regions of interest in the video. Keyframes derived in this way may also be improper for video printing in terms of either content or image quality. In this paper, we present an intelligent keyframe extraction approach to derive an improved keyframe set by performing semantic analysis of the video content. For a video clip, a number of video and audio features are analyzed to first generate a candidate keyframe set. These features include accumulative color histogram and color layout differences, camera motion estimation, moving object tracking, face detection and audio event detection. Then, the candidate keyframes are clustered and evaluated to obtain a final keyframe set. The objective is to automatically generate a limited number of keyframes to show different views of the scene; to show different people and their actions in the scene; and to tell the story in the video shot. Moreover, frame extraction for video printing, which is a rather subjective problem, is considered in this work for the first time, and a semi-automatic approach is proposed.
With the multimedia content description interface MPEG-7, we have powerful tools for video indexing, based on which content-based search-and-retrieval with respect to separate shots and scenes in video can be performed. We especially focus on the parametric motion descriptor. The motion parameters, being finally coded in the descriptor values, require robust content extraction methods. In this paper, we introduce our approach to the extraction of global motion from video. For this purpose, we apply a constraint feature point selection and matching approach in order to find correspondences in images. Subsequently, an M-estimator is used for robust estimation of the motion model parameters. We evaluate the performance of our approach using affine and biquadratic motion models, also in comparison with a standard least-median-of-squares based approach to global motion estimation.
In this paper, we present a general guideline to establish the relation of noise distribution model and its corresponding error metric. By designing error metrics, we obtain a much richer set of distance measures besides the conventional Euclidean distance or SSD (sum of the squared difference) and the Manhattan distance or SAD (sum of the absolute difference). The corresponding nonlinear estimations such as harmonic mean, geometric mean, as well as their generalized nonlinear operations are derived. It not only offers more flexibility than the conventional metrics but also discloses the coherent relation between the noise model and its corresponding error metric. We experiment with different error metrics for similarity noise estimation and compute the accuracy of different methods in three kinds of applications: content-based image retrieval from a large database, stereo matching, and motion tracking in video sequences. In all the experiments, robust results are obtained for noise estimation based on the proposed error metric analysis.
Content-based image retrieval involves a search throughout a database of stored images for the best match for the query image. The task is re-formulated as the global optimization problem of finding the correct mapping between the corresponding points of the query image and the database image. For 2-dimensional grayscale images, the quality of the match is evaluated as the difference between the pixel values in the area of the intersection of the two images: the minimum value of the difference indicates a potential match between the images, with the corresponding optimal values of the parameters defining the mapping. The stated problem is a nonlinear, multimodal global optimization problem. In general form, the mapping includes the rigid body transform and the local object deformation. If there is no prior information available about the images, the search space of potential solutions becomes so large that the brute force approach becomes intractable. The classical optimization techniques fail due to the presence of many local minima and the non-convex shape of the nonlinear function defining the difference between the images. The following stochastic optimization techniques are compared in the paper: parallel simulated annealing, multi-start, and hybrid evolutionary algorithm. The methods differ in the degree to which they utilize global and local search, and in the strategy of the global search. The comparison is presented for the grayscale images, with different initial settings.
This paper proposes a novel approach to accomplish the automatic segmentation of singing voice within music signals, based on the difference between the dynamic harmonic content of singing voice and that of musical instrument signals. The obtained results are compared with those of another approach proposed in the literature, considering the same music database. For both techniques, an accuracy rate around 80% is obtained, even using a more rigorous performance measure for our approach only. As an advantage, the new procedure presents lower computational complexity. In addition, we discuss other results obtained by extending the tests over the whole database (upholding the same performance level) and by discriminating the error types (boundaries shifted in time, insertion and deletion of
singing segments). The analysis of these errors suggests some alternative ways of reducing them, as for example, to adopt a confidence level based on a minimum harmonic content for the input signals. In this way, considering only signals with confidence level equal to one, the obtained performance is improved to almost 87%.
Music genre provides an efficient way to index songs in the music database, and can be used as an effective means to retrieval music of a similar type, i.e. content-based music retrieval. In addition to other features, the temporal domain features of a music signal are exploited so as to increase the classification rate in this research. Three temporal techniques are examined in depth. First, the hidden Markov model (HMM) is used to emulate the time-varying properties of music signals. Second, to further increase the classification rate, we propose another feature set that focuses on the residual part of music signals. Third, the overall classification rate is enhanced by classifying smaller segments from a test material individually and making decision via majority voting. Experimental results are given to demonstrate the performance of the proposed techniques.
The Music Information Retrieval (MIR) and Music Digital Library (MDL) research communities have long noted the need for formal evaluation mechanisms. Issues concerning the unavailability of freely-available music materials have greatly hindered the creation of standardized test collections with which these communities could scientifically assess the strengths and weaknesses of their various music retrieval techniques. The International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) is being developed at the University of Illinois at Urbana-Champaign (UIUC) specifically to overcome this hindrance to the scientific evaluation of MIR/MDL systems. Together with its subsidiary Human Use of Music Information Retrieval Systems (HUMIRS) project, IMIRSEL will allow MIR/MDL researchers access to the standardized large-scale collection of copyright-sensitive music materials and standardized test queries being housed at UIUC's National Center for Supercomputing Applications (NCSA). Virtual Research Labs (VRL), based upon NCSA's Data-to-Knowledge (D2K) tool set, are being developed through which MIR/MDL researchers will interact with the music materials under a "trusted code" security model.
Techniques for semantic weighting and decomposition of XML schemas are investigated in this work for their efficient management. Two approaches are proposed to calculate the weights of XML elements. The first one is based on the analysis of links and their attributes while the other one is based on the information propagated from all reachable nodes. We analyze the influence of different types of links on the weights of XML elements. The weights of XML elements are then used to decompose an XML schema and to choose representatives of decomposed clusters. These two methods provide consistent results. The decomposition of an XML schema can be conducted via two methods: the repetition-based decomposition and the weight-based decomposition.
It is shown that the weight-based solution can achieve a multi-resolution decomposition result.
This paper introduces the principal approach and describes the basic architecture and current implementation of the knowledge-based multimedia adaptation framework we are currently developing. The framework can be used in Universal Multimedia Access scenarios, where multimedia content has to be adapted to specific usage environment parameters (network and client device capabilities, user preferences). Using knowledge-based techniques (state-space planning), the framework automatically computes an adaptation plan, i.e., a sequence of media conversion operations, to transform the multimedia resources to meet the client's requirements or constraints. The system takes as input standards-compliant descriptions of the content (using MPEG-7 metadata) and of the target usage environment (using MPEG-21 Digital Item Adaptation metadata) to derive start and goal states for the planning process, respectively. Furthermore, declarative descriptions of the conversion operations (such as available via software library functions) enable existing adaptation algorithms to be invoked without requiring programming effort. A running example in the paper illustrates the descriptors and techniques employed by the knowledge-based media adaptation system.
This paper introduces a novel paradigm for integrated retrieval and browsing in content-based visual information retrieval systems. The proposed approach uses feature transformations and distance measures for content-based media access and similarity measurement. The first innovation is that distance space is visualised in a 3D user interface: 2D representations of media objects are shown on the image plane. The floor plane is used to show their distance relationships. Queries can interactively be defined by browsing through the 3D space and selecting media objects as positive or negative examples. Each selection operation defines hyper-clusters that are used for querying, and causes query execution and distance space adaptation in a background process. In order to help the user understanding distance space, descriptions are visualised in diagrams and associated with media objects. Changes in distance space are visualised by tree-like graphs. Furthermore, the user is enabled to select subspaces of distance space and select new distance metrics for them. This allows dealing with multiple similarity judgements in one retrieval process. The proposed components for visual data mining will be implemented in the visual information retrieval project VizIR. All VizIR components can be arbitrarily combined to sophisticated retrieval applications.
Because of the transition from analog to digital technologies, content owners are seeking technologies for the protection of copyrighted multimedia content. Encryption and watermarking are two major tools that can be used to prevent unauthorized consumption and duplication. In this paper, we generalize an idea in a recent paper that embeds a binary pattern in the form of a binary image in the LL and HH bands at the second level of Discrete Wavelet Transform (DWT) decomposition. Our generalization includes all four bands (LL, HL, LH, and HH), and a comparison of embedding a watermark at first and second level decompositions. We tested the proposed algorithm against fifteen attacks. Embedding the watermark in lower frequencies is robust to a group of attacks, and embedding the watermark in higher frequencies is robust to another set of attacks. Only for rewatermarking and collusion attacks, the watermarks extracted from all four bands are identical. Our experiments indicate that first level decomposition appear advantageous for two reasons: The area for watermark embedding is maximized, and the extracted watermarks are more textured with better visual quality.
Digital fingerprinting has been widely used to protect multimedia content from being used for unauthorized purposes. Digital fingerprints are often embedded in the host media signal using watermarking techniques that are known to be resistant to a variety of processing attacks. However, one cost-effective strategy to attack digital fingerprints is collusion, where several colluders average their individual copies to disrupt the underlying fingerprints. Recently, a new class of fingerprinting codes, called anti-collusion codes (ACC), has been proposed for use with code-modulated data embedding. In designing digital fingerprints that are resistant to collusion attacks, there are several important design considerations: how can we accommodate as many users as possible for a given fingerprint dimensionality, and how can we identify the colluders effectively from the colluded signal? In this work, we identify an underlying similarity between the colluder detection problem and the multiuser detection problem from code division multiple access (CDMA). We propose that fingerprints can be constructed using sequence sets satisfying the Welch Bound Equality (WBE). WBE sequences have been shown to be optimal in synchronous CDMA. In order to identify the colluders when employing WBE-based ACC, we further propose a detection algorithm utilizing sphere decoding that identifies the colluders from the colluded signal. We evaluate the performance of the proposed WBE-based ACC fingerprints with our proposed detection algorithm through simulations, and show that the algorithm performs well at moderate noise levels. Finally, we compare our design scheme against orthogonal fingerprints and the BIBD
anti-collusion codes proposed earlier, and show that the proposed WBE-based ACC and detection algorithm have better performance than BIBD-based ACC under the same configuration.
Scalable coding is a technology that encodes a multimedia signal in a scalable manner where various representations can be extracted from a single codestream to fit a wide range of applications. Many new scalable coders such as JPEG 2000 and MPEG-4 FGS offer fine granularity scalability to provide near continuous optimal tradeoff between quality and rates in a large range. This fine granularity scalability poses great new challenges to the design of encryption and authentication systems for scalable media in Digital Rights Management (DRM) and other applications. It may be desirable or even mandatory to maintain a certain level of scalability in the encrypted or signed codestream so that no decryption or re-signing is needed when legitimate adaptations are applied. In other words, the encryption and authentication should be scalable, i.e., adaptation friendly. Otherwise secrets have to be shared with every intermediate stage along the content delivery system which performs adaptation manipulations. Sharing secrets with many parties would jeopardize the overall security of a system since the security depends on the weakest component of the system. In this paper, we first describe general requirements and desirable features for an encryption or authentication system for scalable media, esp. those not encountered with the non-scalable case. Then we present an overview of the current state of the art of technologies in scalable encryption and authentication. These technologies include full and selective encryption schemes that maintain the original or coarser granularity of scalability offered by an unencrypted scalable codestream, layered access control and block level authentication that reduce the fine granularity of scalability to a block level, among others. Finally, we summarize existing challenges and propose future research directions.
There have been a large number of methods proposed for encrypting images by shared key encryption mechanisms. All the existing techniques are applicable to primarily non-compressed images. However, most imaging applications including digital photography, archiving, and internet communications nowadays use images in the JPEG domain. Application of the existing shared key cryptographic schemes for these images requires conversion back into spatial domain. In this paper we propose a shared key algorithm that works directly in the JPEG domain, thus enabling shared key image encryption for a variety of applications. The scheme directly works on the quantized DCT coefficient domain and the resulting noise-like shares are also stored in the JPEG format. The decryption process is lossless. Our experiments indicate that each share image is approximately the same size as the original JPEG retaining the storage advantage provided by JPEG.
More and more digital services provide capability of distributing digital content to end-users through high-band networks, such as satellite systems. In such systems, Digital Right Management has become more and more important and is encountering great challenges. Digital watermarking is proposed as a possible solution for the digital copyright tracking and enforcement. The nature of DRM systems puts high requirements on the watermark's robustness, uniqueness, easy detection, accurate retrieval and convenient management. We have developed a series of feature-based watermarking algorithms for digital video for satellite transmission. In this paper, we will first describe a general secure digital content distribution system model and the requirements of watermark as one mechanism of DRM in digital content distribution applications. Then we will present a few feature-based digital watermarking methods in detail which are integrated with a dynamic watermarking schema to protect the digital content in a dynamic environment. For example, a watermark which is embedded in the DFT feature domain is invariant to rotation, scale and translation. Our proposed DFT domain watermarking schemas in which exploit the magnitude property of the DFT feature domain will allow both robust and easy watermark tracking and detection in the case of copyright infringement using cameras or camcorders. This DFT feature-based watermarking algorithm is able to tolerate large angle rotation and there is no need to search for possible rotated angles, which reduces the complexity of the watermark detection process and allows fast retrieval and easy management. We will then present a wavelet feature-based watermark algorithm for dynamic watermark key updates and key management, and we will conclude the paper with the summary, pointing our future research directions.
In this work, we consider the problem of assigning OVSF (Orthogonal
Variable Spreading Factor) codes to arriving calls for multi-rate
code-division multiple access systems, and propose a sequence of
algorithms to solve this problem from different angles. First, we
introduce two new policies, called FCA (Fixed Code Assignment) with
fixed set partitioning and DCA (Dynamic Code Assignment) with call
admission control under an objective to maximize the average data
throughput of the system. Numerical simulation confirms that optimized
FCA and DCA perform better than DCA with a greedy policy as the traffic load increases and high-rate calls become dominant. Second, a
suboptimal DCA with call admission control is examined. The objective
is to generate an average data throughput of the system close to that of the optimal scheme while demanding much lower design and implementation complexity than the optimal scheme. By means of capacity or class partitioning and partial resource sharing, we can significantly reduce the computational complexity, thus achieving good design and implementation scalability. Numerical evaluation shows the superior performance of the proposed schemes with low complexity.
The research on the Novelty Detection System (NDS) (called as VENUS) at the authors' universities has generated exciting results. For example, we can detect an abnormal behavior (such as cars thefts from the parking lot) from a series of video frames based on the cognitively motivated theory of habituation. In this paper, we would like to describe the implementation strategies of lower layer protocols for using large-scale Wireless Sensor Networks (WSN) to NDS with Quality-of-Service (QoS) support. Wireless data collection framework, consisting of small and low-power sensor nodes, provides an alternative mechanism to observe the physical world, by using various types of sensing capabilities that include images (and even videos using Panoptos), sound and basic physical measurements such as temperature. We do not want to lose any 'data query command' packets (in the downstream direction: sink-to-sensors) or have any bit-errors in them since they are so important to the whole sensor network. In the upstream direction (sensors-to-sink), we may tolerate the loss of some sensing data packets. But the 'interested' sensing flow should be assigned a higher priority in terms of multi-hop path choice, network bandwidth allocation, and sensing data packet generation frequency (we hope to generate more sensing data packet for that novel event in the specified network area).
The focus of this paper is to investigate MAC-level Quality of Service (QoS) issue in Wireless Sensor Networks (WSN) for Novelty Detection applications. Although QoS has been widely studied in other types of networks including wired Internet, general ad hoc networks and mobile cellular networks, we argue that QoS in WSN has its own characteristics. In wired Internet, the main QoS parameters include delay, jitter and bandwidth. In mobile cellular networks, two most common QoS metrics are: handoff call dropping probability and new call blocking probability. Since the main task of WSN is to detect and report events, the most important QoS parameters should include sensing data packet transmission reliability, lifetime extension degree from sensor sleeping control, event detection latency, congestion reduction level through removal of redundant sensing data. In this paper, we will focus on the following bi-directional QoS topics: (1) Downstream (sink-to-sensor) QoS: Reliable data query command forwarding to particular sensor(s). In other words, we do not want to lose the query command packets; (2) Upstream (sensor-to-sink) QoS: transmission of sensed data with priority control. The more interested data that can help in novelty detection should be transmitted on an optimal path with higher reliability. We propose the use of Differentiated Data Collection. Due to the large-scale nature and resource constraints of typical wireless sensor networks, such as limited energy, small memory (typically RAM < 4K bytes) and short communication range, the above problems become even more challenging. Besides QoS support issue, we will also describe our low-energy Sensing Data Transmission network Architecture. Our research results show the scalability and energy-efficiency of our proposed WSN QoS schemes.
In recent years, there has been tremendous interests and progresses in the field of wireless communications. Call admission control (CAC) is the key component to maximize the system utilization under certain QoS constraints such as call blocking rates. Among the CACs, Markov decision process (MDP) approach is a popular method to optimize certern objectives of interest. However, the computation complexity for deriving optimal policies make this approach less accessible to those with large problem size. In this paper, we will address this issue of how the optimal solutions fluctuate as the traffic condition changed using sensitivity analysis technique, in order to cut down unnecessary computing time if optimal policy did not change as the traffic conditions vary. First of all, the LP problem is solved by simplex method to examine the best policy when the optimal solution is found, then the sensitivity analysis technique is used by adding perturbation on traffic parameters to indicate the range to which optimal bases are invariant. The analytical results for computation complexity reduction is shown to analyze the performance under various traffic conditions.
Many applications require network performance bounds, or Quality
of Service (QoS), for their proper operation. This is achieved
through the appropriate allocation of network resources; however,
providing end-to-end QoS is becoming more complex, due to the
increasing heterogeneity of networks. For example, end-to-end QoS
can be provided through the concatenation of services across
multiple networks (domains), but each domain may employ different
network technologies as well as different QoS methodologies. As a
result, management strategies are needed to provide QoS across
multiple domains in a scalable and economically feasible manner.
This paper describes a microeconomic-based middleware architecture
that allows the specification and acquisition of QoS and resource
policies. The architecture consists of users, bandwidth brokers,
and network domains. Executing applications, users require network
QoS obtained via middleware from a bandwidth broker. Bandwidth
brokers then interact with one another to provide end-to-end QoS
connections across multiple domains. This is done in a BGP manner
which recursively provides end-to-end services in a scalable
fashion. Using this framework, this paper describes management
strategies to optimally provision and allocate end-to-end
connections. The methods maintain a low blocking probability, and
maximize utility and profit, which are increasingly important as
network connectivity evolves as an industry.
Dynamic quality of service (QoS) mapping control with relative service differentiation network is a futuristic framework to achieve high-quality end-to-end video streaming. The proper QoS mapping in terms of delay and loss between categorized-packet video and proportional differentiated services (DiffServ, DS) network can improve video quality under the same cost constraint. However, network congestion caused by traffic load fluctuation still remains as the main hinderance in providing persistently better service even though the resource provisioning policy of underlying network is well-established. To address this issue, we propose a class-based feedback control to enhance relative service differentiation-aware video streaming. The major idea of our proposal is to employ explicit congestion notification (ECN) mechanism in conjunction with the QoS mapping control at the ingress of a DiffServ domain. It is possible that not only the network congestion status is notified to end-host video applications but also a reactive QoS mapping control is triggeredat the ingress side. NS 2-based simulation results will be
presented to show the enhanced performance of the QoS mapping control framework.
In this paper, we are investigating a dynamic admission control (DAC) scheme that is designed for guaranteed wireless video transmission over the IEEE 802.11e wireless LAN (WLAN) environment. To guarantee differentiated QoS services for network-adaptive video streaming, the proposed DAC is designed to utilize the video codec's layering characteristic as well as differentiation-capability of IEEE 802.11e MAC (multiple access control). Especially in order to match the time-varying hostile wireless environment, limited wireless resources for transmission opportunities are required to be dynamically reserved, coordinated, and utilized. Proposed realization of DAC is composed with three sub modules: reservation-based call admission control (CAC), dynamic service resource allocation, and on-flow service differentiation modules. To evaluate the performance of proposed DAC, we apply it to the wireless streaming of ITU-T H.263+ streams over the IEEE 802.11e WLAN, network simulator (NS-2) based simulation results show that it achieves both acceptable receiver-side video quality and efficient resource utilization in face of network loads and channel variations.
This paper describes how parallel retrieval is implemented in the content-based visual information retrieval framework VizIR. Generally, two major use cases for parallelisation exist in visual retrieval systems: distributed querying and simultaneous multi-user querying. Distributed querying includes parallel query execution and querying multiple databases. Content-based querying is a two-step process: transformation of feature space to distance space using distance measures and selection of result set elements from distance space. Parallel distance measurement is implemented by sharing example media and query parameters between querying threads. In VizIR, parallelisation is heavily based on caching strategies. Querying multiple distributed databases is already supported by standard relational database management systems. The most relevant issues here are error handling and minimisation of network bandwidth consumption. Moreover, we describe strategies for distributed similarity measurement and content-based indexing. Simultaneous multi-user querying raises problems such as caching of querying results and usage of relevance feedback and user preferences for query refinement. We propose a 'real' multi-user querying environment that allows users to interact in defining queries and browse through result sets simultaneously. The proposed approach opens an entirely new field of applications for visual information retrieval systems.
Iris and face biometric systems are under intense study as a multimodal pair due in part to the ability to acquire both with the same capture system. While several successful research efforts have considered facial imagesas part of an iris-face multimodal biometric system, there is little work in the area exploring the iris recognition problem under different poses of the subjects. This is due to the fact that most commercial iris recognition systems depend on the high performance algorithm patented by Daugman, which does not take into consideration the pose and illumination variations in iris acquisition. Hence there is an impending need for sophisticated iris detection systems that localize the iris region for different poses and different facial views.
In this paper we present a non-frontal/non-ideal iris acquisition technique where iris images are extracted out of regular visual video sequences. This video sequence is captured 3 feet around the subject in a 90-degree arc from the profile view to the frontal view. We present a novel design for an iris detection filter that detects the location of the iris, the pupil and the sclera using a Laplacian of Gaussian ellipse detection technique. Experimental results show that the proposed approach can localize the iris location in facial images for a wide range of pose variations including semi-frontal views.
Nowadays, the video coding standards for object based video coding and the tools for multimedia content description are available. Hence, we have powerful tools that can be used for content-based video coding, description, indexing and organization. In the past, it was difficult to extract higher level semantics, such as video objects, automatically. In this paper, we present a novel approach to moving object region detection. For this purpose, we developed a framework which applies bidirectional global motion estimation and compensation in order to identify potential foreground object regions. After spatial image segmentation, the results are assigned to image segments, and further diffused over the image region. This enables robust object region detection also in cases, where the investigated object does not move completely all the time. Finally, each image segment can be classified as being either situated in the foreground or in the background. Subsequent region merging delivers foreground object masks which can be used in order to define the region-of-attention for content based video coding, but also for contour based object classification.
The proxy mechanism widely used in WWW systems offers low-delay and scalable delivery of data by means of a "proxy server". By applying proxy mechanism to video streaming systems, high-quality and low-delay video distribution can be accomplished without imposing extra load on the system. We have proposed proxy caching mechanisms to accomplish the high-quality and highly-interactive video streaming services. In our proposed mechanisms, proxies communicate with each other, retrieve a missing video data from an appropriate server by taking into account transfer delay and offerable quality. In addition, the quality of cached video data can be adapted appropriately in the proxy to cope with the client-to-client heterogeneity, in terms of the available bandwidth, end-system performance, and user preferences on the perceived video quality. In this paper, to verify the practicality of our mechanisms, we implemented them on a real system for MPEG-4 video streaming services, and conducted experiments. Through evaluations, it was shown that our proxy caching system can provide users with a continuous and high-quality video distribution in accordance with network condition.