Recently, region-based convolutional neural networks(R-CNNs) have achieved significant success in the field of object detection, but their accuracy is not too high for small objects and similar objects, such as the gestures. To solve this problem, we present an online hard example testing(OHET) technology to evaluate the confidence of the R-CNNs' outputs, and regard those outputs with low confidence as hard examples. In this paper, we proposed a cascaded networks to recognize the gestures. Firstly, we use the region-based fully convolutional neural network(R-FCN), which is capable of the detection for small object, to detect the gestures, and then use the OHET to select the hard examples. To enhance the accuracy of the gesture recognition, we re-classify the hard examples through VGG-19 classification network to obtain the final output of the gesture recognition system. Through the contrast experiments with other methods, we can see that the cascaded networks combined with the OHET reached to the state-of-the-art results of 99.3% mAP on small and similar gestures in complex scenes.