Automatic extraction of buildings from remote sensing data is an attractive research topic, useful for several applications, such as cadastre and urban planning. This is mainly due to the inherent artifacts of the used data and the differences in viewpoint, surrounding environment, and complex shape and size of the buildings. This paper introduces an efficient deep learning framework based on convolutional neural networks (CNNs) toward building extraction from orthoimages. In contrast to conventional deep approaches in which the raw image data are fed as input to the deep neural network, in this paper the height information is exploited as an additional feature being derived from the application of a dense image matching algorithm. As test sites, several complex urban regions of various types of buildings, pixel resolutions and types of data are used, located in Vaihingen in Germany and in Perissa in Greece. Our method is evaluated using the rates of completeness, correctness, and quality and compared with conventional and other “shallow” learning paradigms such as support vector machines. Experimental results indicate that a combination of raw image data with height information, feeding as input to a deep CNN model, provides potentials in building detection in terms of robustness, flexibility, and efficiency.