Translator Disclaimer
12 April 2004 Web mining for topics defined by complex and precise predicates
Author Affiliations +
Abstract
The enormous growth of the World Wide Web has made it important to perform resource discovery efficiently for any given topic. Several new techniques have been proposed in the recent years for this kind of topic specific web-mining, and among them a key new technique called focused crawling which is able to crawl topic-specific portions of the web without having to explore all pages. Most existing research on focused crawling considers a simple topic definition that typically consists of one or more keywords connected by an OR operator. However this kind of simple topic definition may result in too many irrelevant pages in which the same keyword appears in a wrong context. In this research we explore new strategies for crawling topic specific portions of the web using complex and precise predicates. A complex predicate will allow the user to precisely specify a topic using Boolean operators such as "AND", "OR" and "NOT". Our work will concentrate on defining a format to specify this kind of a complex topic definition and secondly on devising a crawl strategy to crawl the topic specific portions of the web defined by the complex predicate, efficiently and with minimal overhead. Our new crawl strategy will improve the performance of topic-specific web crawling by reducing the number of irrelevant pages crawled. In order to demonstrate the effectiveness of the above approach, we have built a complete focused crawler called "Eureka" with complex predicate support, and a search engine that indexes and supports end-user searches on the crawled pages.
© (2004) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Ching-Cheng Lee and Sushma Sampathkumar "Web mining for topics defined by complex and precise predicates", Proc. SPIE 5433, Data Mining and Knowledge Discovery: Theory, Tools, and Technology VI, (12 April 2004); https://doi.org/10.1117/12.542359
PROCEEDINGS
8 PAGES


SHARE
Advertisement
Advertisement
Back to Top