There is a strong demand for developing automated tools for extracting pertinent information from the biomedical
literature that is a rich, complex, and dramatically growing resource, and is increasingly accessed via the web. This paper
presents a hybrid method based on contextual and statistical information to automatically identify two MEDLINE
citation terms: NIH grant numbers and databank accession numbers from HTML-formatted online biomedical
documents. Their detection is challenging due to many variations and inconsistencies in their format (although
recommended formats exist), and also because of their similarity to other technical or biological terms. Our proposed
method first extracts potential candidates for these terms using a rule-based method. These are scored and the final
candidates are submitted to a human operator for verification. The confidence score for each term is calculated using
statistical information, and morphological and contextual information. Experiments conducted on more than ten
thousand HTML-formatted online biomedical documents show that most NIH grant numbers and databank accession
numbers can be successfully identified by the proposed method, with recall rates of 99.8% and 99.6%, respectively.
However, owing to the high false alarm rate, the proposed method yields F-measure rates of 86.6% and 87.9% for NIH
grants and databanks, respectively.