19 January 2009 Locating and parsing bibliographical references in HTML medical articles
Author Affiliations +
Proceedings Volume 7247, Document Recognition and Retrieval XVI; 724708 (2009); doi: 10.1117/12.805946
Event: IS&T/SPIE Electronic Imaging, 2009, San Jose, California, United States
Abstract
Bibliographical references that appear in journal articles can provide valuable hints for subsequent information extraction. We describe our statistical machine learning algorithms for locating and parsing such references from HTML medical journal articles. Reference locating identifies the reference sections and then decomposes them into individual references. We formulate reference locating as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles from 100 journals achieves near perfect precision and recall rates for locating references. Reference parsing is to identify components, e.g. author, article title, journal title etc., from each individual reference. We implement and compare two reference parsing algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, and then a search algorithm systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.
© (2009) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jie Zou, Daniel Le, George R. Thoma, "Locating and parsing bibliographical references in HTML medical articles", Proc. SPIE 7247, Document Recognition and Retrieval XVI, 724708 (19 January 2009); doi: 10.1117/12.805946; https://doi.org/10.1117/12.805946
PROCEEDINGS
12 PAGES


SHARE
KEYWORDS
Feature extraction

Machine learning

Statistical modeling

Binary data

Associative arrays

Genetic algorithms

Crystals

Back to Top