Translator Disclaimer
10 February 2010 A case study on rule-based and CRF-based author extraction methods
Author Affiliations +
Information extraction (IE) is the task of automatically extracting structured information from unstructured documents. A typical application of IE is to process a set of documents written in a natural language and populate a database with the information extracted. This paper presents a case study on author extraction from unstructured documents. A rulebased method and a CRF-based (Conditional Random Field) method are implemented for this task. The rule-based method involves defining a set of heuristic rules and leveraging prior knowledge on author names and affiliations to identify metadata. The CRF-based method involves preparing a labeled training dataset, defining a set of feature functions, learning a CRF model, and applying the model to label new documents. We evaluate and compare the performance of the two methods through experiments, and give some useful hints for application developers on the choice of heuristics and formal methods when addressing the real-world information extraction problems.
© (2010) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Shengwen Yang and Yuhong Xiong "A case study on rule-based and CRF-based author extraction methods", Proc. SPIE 7540, Imaging and Printing in a Web 2.0 World; and Multimedia Content Access: Algorithms and Systems IV, 754005 (10 February 2010);

Back to Top