Most methods of monocular visual three dimensional (3D) scene recognition involve supervised machine learning. However, these methods often rely on prior knowledge. Specifically, they learn the image scene as part of a training dataset. For this reason, when the sampling equipment or scene is changed, monocular visual 3D scene recognition may fail. To cope with this problem, a new method of unsupervised learning for monocular visual 3D scene recognition is here proposed. First, the image is made using superpixel segmentation based on the CIELAB color space values L, a, and b and on the coordinate values x and y of pixels, forming a superpixel image with a specific density. Second, a spectral clustering algorithm based on the superpixels’ color characteristics and neighboring relationships was used to reduce the dimensions of the superpixel image. Third, the fuzzy distribution density functions representing sky, ground, and façade are multiplied with the segment pixels, where the expectations of these segments are obtained. A preliminary classification of sky, ground, and façade is generated in this way. Fourth, the most accurate classification images of sky, ground, and façade were extracted through the tier-1 wavelet sampling and Manhattan direction feature. Finally, a depth perception map is generated based on the pinhole imaging model and the linear perspective information of ground surface. Here, 400 images of Make3D Image data from the Cornell University website were used to test the algorithm. The experimental results showed that this unsupervised learning method provides a more effective monocular visual 3D scene recognition model than other methods.