Scene text detection is an important step for the text-based information extraction system. This problem is challenging due to the variations of size, unknown colors, and background complexity. We present a novel algorithm to robustly detect text in scene images. To segment text candidate connected components (CC) from images, a text probability map consisting of the text position and scale information is estimated by a text region detector. To filter out the non-text CCs, a hierarchical model consisting of two classifiers in cascade is utilized. The first stage of the model estimates text probabilities with unary component features. The second stage classifier is trained with both probability features and similarity features. Since the proposed method is learning-based, there are very few manual parameters required. Experimental results on the public benchmark ICDAR dataset show that our algorithm outperforms other state-of-the-art methods.