7 February 2011 Title identification of web article pages using HTML and visual features
Author Affiliations +
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
© (2011) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jian Fan, Ping Luo, Parag Joshi, "Title identification of web article pages using HTML and visual features", Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 78790K (7 February 2011); doi: 10.1117/12.876708; https://doi.org/10.1117/12.876708

Back to Top