Translator Disclaimer
7 February 2011 Title identification of web article pages using HTML and visual features
Author Affiliations +
Proceedings Volume 7879, Imaging and Printing in a Web 2.0 World II; 78790K (2011) https://doi.org/10.1117/12.876708
Event: IS&T/SPIE Electronic Imaging, 2011, San Francisco Airport, California, United States
Abstract
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
© (2011) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jian Fan, Ping Luo, and Parag Joshi "Title identification of web article pages using HTML and visual features", Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 78790K (7 February 2011); https://doi.org/10.1117/12.876708
PROCEEDINGS
5 PAGES


SHARE
Advertisement
Advertisement
RELATED CONTENT

Automatic selection of visual features and classifiers
Proceedings of SPIE (December 23 1999)
DOM-based print-link detection for web article extraction
Proceedings of SPIE (February 07 2011)
Visual data mining
Proceedings of SPIE (October 25 2004)

Back to Top