Conventional optical video surveillance systems usually just record what they view, but they can’t make sense of what they are viewing. With lots of useless video information stored and transmitted, waste of memory space and increasing the bandwidth are produced every day. In order to reduce the overall cost of the system, and improve the application value of the monitoring system, we use the Kinect sensor with CMOS infrared sensor, as a supplement to the traditional video surveillance system, to establish the natural user interface system for indoor surveillance. In this paper, the architecture of the natural user interface system, complex background monitoring object separation, user behavior analysis algorithms are discussed. By the analysis of the monitoring object, instead of the command language grammar, when the monitored object need instant help, the system with the natural user interface sends help information. We introduce the method of combining the new system and traditional monitoring system. In conclusion, theoretical analysis and experimental results in this paper show that the proposed system is reasonable and efficient. It can satisfy the system requirements of non-contact, online, real time, higher precision and rapid speed to control the state of affairs at the scene.