The fusion of depth acquired actively with the depth estimated passively proved its significance as an improvement strategy for gaining depth. This combination allows us to benefit from two sources of modalities such that they complement each other. To fuse two sensor data into a more accurate depth map, we must consider the limitations of active sensing such as low lateral resolution while combining it with a passive depth map. We present an approach for the fusion of active time-of-flight depth and passive stereo depth in an accurate way. We propose a multimodal sensor fusion strategy that is based on a weighted energy optimization problem. The weights are generated as a result of combining the edge information from a texture map and active and passive depth maps. The objective evaluation of our fusion algorithm shows an improved accuracy of the generated depth map in comparison with the depth map of every single modality and with the results of other fusion methods. Additionally, a visual comparison of our result shows a better recovery on the edges considering the wrong depth values estimated in passive stereo. Moreover, the left and right consistency check on the result illustrates the ability of our approach to consistently fuse sensors.
With the advent of light field acquisition technologies, the captured information of the scene is enriched by having both angular and spatial information. The captured information provides additional capabilities in the post processing stage, e.g. refocusing, 3D scene reconstruction, synthetic aperture etc. Light field capturing devices are classified in two categories. In the first category, a single plenoptic camera is used to capture a densely sampled light field, and in second category, multiple traditional cameras are used to capture a sparsely sampled light field. In both cases, the size of captured data increases with the additional angular information. The recent call for proposal related to compression of light field data by Joint Picture Expert Group (JPEG), also called JPEG Pleno, reflects the need of a new and efficient light field compression solution. In this paper, we propose a compression solution for sparsely sampled light field data. Each view of multi-camera system is interpreted as a frame of multi-view sequences. The pseudo multi-view sequences are compressed using state-ofart Multiview-extension of High Efficiency Video Coding (MV-HEVC). A subset of four light field images from Stanford dataset are compressed, on four bit-rates in order to cover the low to high bit-rates scenarios. The comparison is made with state-of-art reference encoder HEVC and its real-time implementation x265. The rate distortion analysis shows that the proposed compression scheme outperforms both reference schemes in all tested bit-rate scenarios for all the test images. The average BD-PSNR gain of 1.36 dB over HEVC and 2.15 dB over x265 is achieved using the proposed compression scheme.
The ongoing success of three-dimensional (3D) cinema fuels increasing efforts to spread the commercial success of 3D to new markets. The possibilities of a convincing 3D experience at home, such as three-dimensional television (3DTV), has generated a great deal of interest within the research and standardization community. A central issue for 3DTV is the creation and representation of 3D content. Acquiring scene depth information is a fundamental task in computer vision, yet complex and error-prone. Dedicated range sensors, such as the Time of-Flight camera (ToF), can simplify the scene depth capture process and overcome shortcomings of traditional solutions, such as active or passive stereo analysis. Admittedly, currently available ToF sensors deliver only a limited spatial resolution. However, sophisticated depth upscaling approaches use texture information to match depth and video resolution. At Electronic Imaging 2012 we proposed an upscaling routine based on error energy minimization, weighted with edge information from an accompanying video source. In this article we develop our algorithm further. By adding temporal consistency constraints to the upscaling process, we reduce disturbing depth jumps and flickering artifacts in the final 3DTV content. Temporal consistency in depth maps enhances the 3D experience, leading to a wider acceptance of 3D media content. More content in better quality can boost the commercial success of 3DTV.
Multi-view three-dimensional television relies on view synthesis to reduce the number of views being transmitted. Arbitrary views can be synthesized by utilizing corresponding depth images with textures. The depth images obtained from stereo pairs or range cameras may contain erroneous values, which entail artifacts in a rendered view. Post-processing of the data may then be utilized to enhance the depth image with the purpose to reach a better quality of synthesized views. We propose a Partial Differential Equation (PDE)-based interpolation method for a reconstruction of the smooth areas in depth images, while preserving significant edges. We modeled the depth image by adjusting thresholds for edge detection and a uniform sparse sampling factor followed by the second order PDE interpolation. The objective results show that a depth image processed by the proposed method can achieve a better quality of synthesized views than the original depth image. Visual inspection confirmed the results.
The 3D video quality is of highest importance for the adoption of a new technology from a user’s point of view. In this paper we evaluated the impact of coding artefacts on stereoscopic 3D video quality by making use of several existing full reference 2D objective metrics. We analyzed the performance of objective metrics by comparing to the results of subjective experiment. The results show that pixel based Visual Information Fidelity metrics fits subjective data the best. The 2D stereoscopic video quality seems to have dominant impact on the coding artefacts impaired stereoscopic videos.
Multi-view three-dimensional television requires many views, which may be synthesized from two-dimensional images with accompanying pixel-wise depth information. This depth image, which typically consists of smooth areas and sharp transitions at object borders, must be consistent with the acquired scene in order for synthesized views to be of good quality. We have previously proposed a depth image coding scheme that preserves significant edges and encodes smooth areas between these. An objective evaluation considering the structural similarity (SSIM) index for synthesized views demonstrated an advantage to the proposed scheme over the High Efficiency Video Coding (HEVC) intra mode in certain cases. However, there were some discrepancies between the outcomes from the objective evaluation and from our visual inspection, which motivated this study of subjective tests. The test was conducted according to ITU-R BT.500-13 recommendation with Stimulus-comparison methods. The results from the subjective test showed that the proposed scheme performs slightly better than HEVC with statistical significance at majority of the tested bit rates for the given contents.
Integral Imaging is a technique to obtain true color 3D images that can provide full and continuous motion parallax for several viewers. The depth of field of these systems is mainly limited by the numerical aperture of each lenslet of the microlens array. A digital method has been developed to increase the depth of field of Integral Imaging systems in the reconstruction stage. By means of the disparity map of each elemental image, it is possible to classify the objects of the scene according to their distance from the microlenses and apply a selective deconvolution for each depth of the scene. Topographical reconstructions with enhanced depth of field of a 3D scene are presented to support our proposal.
Depth-Image-Based Rendering (DIBR) of virtual views is a fundamental method in three dimensional 3-D video applications to produce different perspectives from texture and depth information, in particular the multi-view-plus-depth (MVD) format. Artifacts are still present in virtual views as a consequence of imperfect rendering using existing DIBR methods. In this paper, we propose an alternative DIBR method for MVD. In the proposed method we introduce an edge pixel and interpolate pixel values in the virtual view using the actual projected coordinates from two adjacent views, by which cracks and disocclusions are automatically filled. In particular, we propose a method to merge pixel information from two adjacent views in the virtual view before the interpolation; we apply a weighted averaging of projected pixels within the range of one pixel in the virtual view. We compared virtual view images rendered by the proposed method to the corresponding view images rendered by state-of-theart methods. Objective metrics demonstrated an advantage of the proposed method for most investigated media contents. Subjective test results showed preference to different methods depending on media content, and the test could not demonstrate a significant difference between the proposed method and state-of-the-art methods.
Complex multidimensional capturing setups such as plenoptic cameras (PC) introduce a trade-off between various
system properties. Consequently, established capturing properties, like image resolution, need to be described
thoroughly for these systems. Therefore models and metrics that assist exploring and formulating this trade-off
are highly beneficial for studying as well as designing of complex capturing systems. This work demonstrates the
capability of our previously proposed sampling pattern cube (SPC) model to extract the lateral resolution for
plenoptic capturing systems. The SPC carries both ray information as well as focal properties of the capturing
system it models. The proposed operator extracts the lateral resolution from the SPC model throughout an
arbitrary number of depth planes giving a depth-resolution profile. This operator utilizes focal properties of the
capturing system as well as the geometrical distribution of the light containers which are the elements in the SPC
model. We have validated the lateral resolution operator for different capturing setups by comparing the results
with those from Monte Carlo numerical simulations based on the wave optics model. The lateral resolution
predicted by the SPC model agrees with the results from the more complex wave optics model better than both
the ray based model and our previously proposed lateral resolution operator. This agreement strengthens the
conclusion that the SPC fills the gap between ray-based models and the real system performance, by including
the focal information of the system as a model parameter. The SPC is proven a simple yet efficient model for
extracting the lateral resolution as a high-level property of complex plenoptic capturing systems.
New display technologies enable the usage of 3D-visualization in a medical context. Even though user performance seems to be enhanced with respect to 2D thanks to the addition of recreated depth cues, human factors, and more particularly visual comfort and visual fatigue can still be a bridle to the widespread use of these systems. This study aimed at evaluating and comparing two different 3D visualization systems (a market stereoscopic display, and a state-of-the-art multi-view display) in terms of quality of experience (QoE), in the context of interactive medical visualization. An adapted methodology was designed in order to subjectively evaluate the experience of users. 14 medical doctors and 15 medical students took part in the experiment. After solving different tasks using the 3D reconstruction of a phantom object, they were asked to judge their quality of the experience, according to specific features. They were also asked to give their opinion about the influence of 3D-systems on their work conditions. Results suggest that medical doctors are opened to 3D-visualization techniques and are confident concerning their beneficial influence on their work. However, visual comfort and visual fatigue are still an issue of 3D-displays. Results obtained with the multi-view display suggest that the use of continuous horizontal parallax might be the future response to these current limitations.
Autostereoscopic multi view displays require multiple views of a scene to provide motion parallax. When an observer
changes viewing angle different stereoscopic pairs are perceived. This allows new perspectives of the scene to be seen
giving a more realistic 3D experience. However, capturing arbitrary number of views is at best cumbersome, and in some
occasions impossible. Conventional stereo video (CSV) operates on two video signals captured using two cameras at two
different perspectives. Generation and transmission of two views is more feasible than that of multiple views. It would be
more efficient if multiple views required by an autostereoscopic display can be synthesized from these sparse set of views.
This paper addresses the conversion of stereoscopic video to multiview video using the video effect morphing. Different
morphing algorithms are implemented and evaluated. Contrary to traditional conversion methods, these algorithms disregard
the physical depth explicitly and instead generate intermediate views using sparse sets of correspondence features
and image morphing. A novel morphing algorithm is also presented that uses scale invariant feature transform (SIFT) and
segmentation to construct robust correspondences features and qualitative intermediate views. All algorithms are evaluated
on a subjective and objective basis and the comparison results are presented.
Presentations on multiview and lightfield displays have become increasingly popular. The restricted number of views
implies an unsmooth transition between views if objects with sharp edges are far from the display plane. The
phenomenon is explained by inter-perspective aliasing. This is undesirable in applications where a correct perception of
the scene is required, such as in science and medicine. Anti-aliasing filters have been proposed in the literature, and are
defined according to the minimum and maximum depth present in the scene. We suggest a method that subdivides the
ray-space and adjusts the anti-aliasing filter to the scene contents locally. We further propose new filter kernels based on
the ray space frequency domain that assures no aliasing, yet keeping maximum information unaltered. The proposed
method outperforms filters of earlier works. Different filter kernels are compared. Details of the output are sharper using
a proposed filter kernel, which also preserves the most information.
Accurate depth maps are a pre-requisite in three-dimensional television, e.g. for high quality view synthesis, but
this information is not always easily obtained. Depth information gained by correspondence matching from two or
more views suffers from disocclusions and low-texturized regions, leading to erroneous depth maps. These errors
can be avoided by using depth from dedicated range sensors, e.g. time-of-flight sensors. Because these sensors
only have restricted resolution, the resulting depth data need to be adjusted to the resolution of the appropriate
texture frame. Standard upscaling methods provide only limited quality results. This paper proposes a solution
for upscaling low resolution depth data to match high resolution texture data. We introduce We introduce
the Edge Weighted Optimization Concept (EWOC) for fusing low resolution depth maps with corresponding
high resolution video frames by solving an overdetermined linear equation system. Similar to other approaches,
we take information from the high resolution texture, but additionally validate this information with the low
resolution depth to accentuate correlated data. Objective tests show an improvement in depth map quality in
comparison to other upscaling approaches. This improvement is subjectively confirmed in the resulting view
Broadcasting of high definition (HD) stereobased 3D (S3D) TV are planned, or has already begun, in Europe, the US,
and Japan. Specific data processing operations such as compression and temporal and spatial resampling are commonly
used tools for saving network bandwidth when IPTV is the distribution form, as this results in more efficient recording
and transmission of 3DTV signals, however at the same time it inevitably brings quality degradations to the processed
video. This paper investigated observers quality judgments of state of the art video coding schemes (simulcast
H.264/AVC or H.264/MVC), with or without added temporal and spatial resolution reduction of S3D videos, by
subjective experiments using the Absolute Category Rating method (ACR) method. The results showed that a certain
spatial resolution reduction working together with high quality video compressing was the most bandwidth efficient way
of processing video data when the required video quality is to be judged as "good" quality. As the subjective experiment
was performed in two different laboratories in two different countries in parallel, a detailed analysis of the interlab
differences was performed.
Different compression formats for stereo and multiview based 3D video is being standardized and software players capable
of decoding and presenting these formats onto different display types is a vital part in the commercialization and evolution
of 3D video. However, the number of publicly available software video players capable of decoding and playing multiview
3D video is still quite limited. This paper describes the design and implementation of a GPU-based real-time 3D
video playback solution, built on top of cross-platform, open source libraries for video decoding and hardware accelerated
graphics. A software architecture is presented that efficiently process and presents high definition 3D video in real-time
and in a flexible manner support both current 3D video formats and emerging standards. Moreover, a set of bottlenecks in
the processing of 3D video content in a GPU-based real-time 3D video playback solution is identified and discussed.
Autostereoscopic multiview 3D displays have been available for number of years, capable of producing a perception of
depth in a 3D image without requiring user-worn glasses. Different approaches to compress these 3D images exist. Two
compression schemes, and how they affect the 3D image with respect to induced distortion, is investigated in this paper:
JPEG 2000 and H.264/AVC. The investigation is conducted in three parts: objective measurement, qualitative subjective
evaluation, and a quantitative user test. The objective measurement shows that the Rate-Distortion (RD) characteristic
of the two compression schemes differ in character as well as in level of PSNR. The qualitative evaluation is performed
at bitrates where the two schemes have the same RD fraction and a number of distortion characteristics are found to be
significantly different. However, the quantitative evaluation, performed using 14 non-expert viewers, indicates that the
different distortion types do not significantly contribute to the overall perceived 3D quality. The used bitrate, and the
content of the original 3D image, is the two factors that most significantly affect the perceived 3D image quality. In
addition, the evaluation results suggest that viewers prefer less apparent depth and motion parallax when being exposed to
compressed 3D images on an autostereoscopic multiview display.
Common autostereoscopic 3D displays are based on multi-view projection. The diversity of resolutions and
number of views of such displays implies a necessary flexibility of 3D content formats in order to make broadcasting
efficient. Furthermore, distribution of content over a heterogeneous network should adapt to an available
network capacity. Present scalable video coding provides the ability to adapt to network conditions; it allows for
quality, temporal and spatial scaling of 2D video. Scalability for 3D data extends this list to the depth and the
view domains. We have introduced scalability with respect to depth information. Our proposed scheme is based
on the multi-view-plus-depth format where the center view data are preserved, and side views are extracted
in enhancement layers depending on depth values. We investigate the performance of various layer assignment
strategies: number of layers, and distribution of layers in depth, either based on equal number of pixels or histogram
characteristics. We further consider the consequences to variable distortion due to encoder parameters.
The results are evaluated considering their overall distortion verses bit rate, distortion per enhancement layer, as
well as visual quality appearance. Scalability with respect to depth (and views) allows for an increased number
of quality steps; the cost is a slight increase of required capacity for the whole sequence. The main advantage is,
however, an improved quality for objects close to the viewer, even if overall quality is worse.
The two-dimensional quality metric Peak-Signal-To-Noise-Ratio (PSNR) is often used to evaluate the quality of coding schemes for different types of light field based 3D-images, e.g. integral imaging or multi-view. The metric results in a single accumulated quality value for the whole 3D-image. Evaluating single views -- seen from specific viewing angles -- gives a quality matrix that present the 3D-image quality as a function of viewing angle. However, these two approaches do not capture all aspects of the induced distortion in a coded 3D-image. We have previously shown coding schemes of similar kind for which coding artifacts are distributed differently with respect to the 3D-image's depth. In this paper we propose a novel metric that captures the depth distribution of coding-induced distortion. Each element in the resulting quality vector corresponds to the quality at a specific depth. First we introduce the proposed full-reference metric and the operations on which it is based. Second, the experimental setup is presented. Finally, the metric is evaluated on a set of differently coded 3D-images and the results are compared, both with previously proposed quality metrics and with visual inspection.
To provide sufficient 3D-depth fidelity, integral imaging (II) requires an increase in spatial resolution of several orders of
magnitude from today's 2D images. We have recently proposed a pre-processing and compression scheme for still II-frames
based on forming a pseudo video sequence (PVS) from sub images (SI), which is later coded using the H.264/MPEG-4
AVC video coding standard. The scheme has shown good performance on a set of reference images. In this paper we first
investigate and present how five different ways to select the SIs when forming the PVS affect the schemes compression
efficiency. We also study how the II-frame structure relates to the performance of a PVS coding scheme. Finally we
examine the nature of the coding artifacts which are specific to the evaluated PVS-schemes. We can conclude that for all
except the most complex reference image, all evaluated SI selection orders significantly outperforms JPEG 2000 where
compression ratios of up to 342:1, while still keeping PSNR > 30 dB, is achieved. We can also confirm that when selecting
PVS-scheme, the scheme which results in a higher PVS-picture resolution should be preferred to maximize compression
efficiency. Our study of the coded II-frames also indicates that the SI-based PVS, contrary to other PVS schemes, tends to
distribute its coding artifacts more homogenously over all 3D-scene depths.