Most existing supervised machine learning frameworks assume there is no mistake or false interpretation on the training samples. However, this assumption may not be true in practical applications. In some cases, if human being is involved in providing training samples, there may be errors in the training set. In this paper, we study the effect of imperfect training samples on the supervised machine learning framework. We focus on the mathematical framework that describes the learnability of noisy training data. We study theorems to estimate the error bounds of generated models and the required amount of training samples. These errors are dependent on the amount of data trained and the probability of the accuracy of training data. Based on the effectiveness of learnability on imperfect annotation, we describe an autonomous learning framework, which uses cross-modality information to learn concept models. For instance, visual concept models can be trained based on the detection result of Automatic Speech Recognition, Closed Captions, or prior detection results of the same modality. Those detection results on an unsupervised training set serve as imperfect labeling for the models-to-build. A prototype system based on this learning technique has been built. Promising results have been shown on these experiments.