This paper presents a Kinect–stereo camera fusion system that significantly improves the accuracy of depth map acquisition. The typical Kinect depth map suffers from missing depth values and errors, resulting from a single Kinect input. To ameliorate such problems, the proposed system couples a Kinect with a stereo RGB camera to provide an additional disparity map. Kinect depth map and the disparity map are efficiently fused in real time by exploiting a spatiotemporal Markov random field framework on a graphics processing unit. An efficient temporal data cost is proposed to maintain the temporal coherency between frames. We demonstrate the performance of the proposed system on challenging real-world examples. Experimental results confirm that the proposed system is robust and accurate in depth video acquisition.