Person detection, tracking and following is a key enabling technology for mobile robots in many human–robot interaction applications. In this article, we present a system which is composed of visual human detection, video tracking and following. The detection is based on YOLO(You only look once), which applies a single convolution neural network(CNN) to the full image, thus can predict bounding boxes and class probabilities directly in one evaluation. Then the bounding box provides initial person position in image to initialize and train the KCF(Kernelized Correlation Filter), which is a video tracker based on discriminative classifier. At last, by using a stereo 3D sparse reconstruction algorithm, not only the position of the person in the scene is determined, but also it can elegantly solve the problem of scale ambiguity in the video tracker. Extensive experiments are conducted to demonstrate the effectiveness and robustness of our human detection and tracking system.