Semantic scene parsing is considerable in many intelligent field, including perceptual robotics. For the past few years, pixel-wise prediction tasks like semantic segmentation with RGB images has been extensively studied and has reached very remarkable parsing levels, thanks to convolutional neural networks (CNNs) and large scene datasets. With the development of stereo cameras and RGBD sensors, it is expected that additional depth information will help improving accuracy. In this paper, we propose a semantic segmentation framework incorporating RGB and complementary depth information. Motivated by the success of fully convolutional networks (FCN) in semantic segmentation field, we design a fully convolutional networks consists of two branches which extract features from both RGB and depth data simultaneously and fuse them as the network goes deeper. Instead of aggregating multiple model, our goal is to utilize RGB data and depth data more effectively in a single model. We evaluate our approach on the NYU-Depth V2 dataset, which consists of 1449 cluttered indoor scenes, and achieve competitive results with the state-of-the-art methods.