Investigators are people who are listed as members of corporate organizations but not entered as authors in an article.
Beginning with journals published in 2008, investigator names are required to be included in a new bibliographic field in
MEDLINE citations. Automatic extraction of investigator names is necessary due to the increase in collaborative
biomedical research and consequently the large number of such names. We implemented two discriminative SVM
models, i.e., SVM and structural SVM, to identify named entities such as the first and last names of investigators from
online medical journal articles. Both approaches achieve good performance at the word and name chunk levels. We
further conducted an error analysis and found that SVM and structural SVM can offer complementary information about
the patterns to be classified. Hence, we combined the two independently trained classifiers where the SVM is chosen as
a base learner with its outputs enhanced by the predictions from the structural SVM. The overall performance especially
the recall rate of investigator name retrieval exceeds that of the standalone SVM model.
"Investigator Names" is a newly required field in MEDLINE citations. It consists of personal names listed as members
of corporate organizations in an article. Extracting investigator names automatically is necessary because of the
increasing volume of articles reporting collaborative biomedical research in which a large number of investigators
participate. In this paper, we present an SVM-based stacked sequential learning method in a novel application -
recognizing named entities such as the first and last names of investigators from online medical journal articles. Stacked
sequential learning is a meta-learning algorithm which can boost any base learner. It exploits contextual information by
adding the predicted labels of the surrounding tokens as features. We apply this method to tag words in text paragraphs
containing investigator names, and demonstrate that stacked sequential learning improves the performance of a nonsequential
base learner such as an SVM classifier.
Traditional classifiers are trained from labeled data only. Labeled samples are often expensive to obtain, while unlabeled
data are abundant. Semi-supervised learning can therefore be of great value by using both labeled and unlabeled data for
training. We introduce a semi-supervised learning method named decision-directed approximation combined with
Support Vector Machines to detect zones containing information on grant support (a type of bibliographic data) from
online medical journal articles. We analyzed the performance of our model using different sizes of unlabeled samples,
and demonstrated that our proposed rules are effective to boost classification accuracy. The experimental results show
that the decision-directed approximation method with SVM improves the classification accuracy when a small amount of
labeled data is used in conjunction with unlabeled data to train the SVM.