Sign language is described by their significance primarily hand posture changes. But traditional colour-based detection methods are possible be influenced by complex background, skin tones and other parts of body. In order to overcome such problems, this article adopted the method based on RGB-D to detect the gesture area in the video. Then, the adaptively extracting key frame of sign language is adopted, according to the change of gesture area. So the problem is converted into obtaining the standard static gesture image. Then the identification results are sent to NAO robot. Well the human-robot interaction is completed. Experimental results showed that combination of colour space and depth threshold can greatly reduce the influence of complex background and skin colour region. Key frame extraction is a steady foundation for improving the rate of hand gesture recognition.