7 February 2011 Title identification of web article pages using HTML and visual features
Author Affiliations +
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
© (2011) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jian Fan, Jian Fan, Ping Luo, Ping Luo, Parag Joshi, Parag Joshi, } "Title identification of web article pages using HTML and visual features", Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 78790K (7 February 2011); doi: 10.1117/12.876708; https://doi.org/10.1117/12.876708

Back to Top