Due to now outdated construction technology, houses which have not been retrofitted since construction typically fail to meet modern energy performance levels. However, identifying at a city scale which houses could benefit the most from retrofit solutions is currently a labour intensive process. In this paper, a system that uses a vehicle mounted camera to capture pictures of residential buildings and then performs semantic segmentation to differentiate components of captured buildings is presented. An ensemble of U-Net semantic segmentation models are trained to identify walls, roofs, chimneys, windows and doors from building façade images and differentiate between window and door instances which are partially visible or obscured. Results show that the ensemble of U-Net models achieved high accuracy in identifying walls, roofs and chimneys, moderate accuracy in identifying windows and low accuracy in identifying doors and instances of windows and doors which were partially visible or obscured. When U-Net models were retrained to identify doors or windows, irrespective of partially visible and obscured instances, a significant rise in door and window identification accuracy was observed. It is believed that a larger training dataset would produce significantly improved results across all classes. The results presented here prove the operational feasibility in the first part of a process to combine this model with high-resolution thermography and GPS for automating building retrofitting evaluations.