There are two types of web crawling strategies deployed by web search engines, viz., breadth first search strategy and ``best'' first search strategy. Breadth first search strategy endeavors to build a general index of the web covering any conceivable topic by endeavoring to search a significant portion of the web. The ``best'' first search focuses on the retrieval of pages which are relevant to a particular given topic. A crawler using a ``best'' first search strategy is known as a ``focused crawler''.
A focused crawler has the following main components: (a) A way to determine if a particular web page is relevant to the given topic, and (b) a way to determine how to proceed from a known set of pages. An early search engine which deployed the focused crawling strategy was proposed in [1] based on the intuition that relevant pages often contain relevant links. It searches deeper when relevant pages are found, and stops searching at pages not as relevant to the topic. Unfortunately, the above crawlers show an important drawback when the pages about a topic are not directly connected in which case the crawling might stop pre-maturely.
This problem is tackled in [3] where reinforcement learning permits credit assignment during the search process, and hence, allowing off-topic pages to be included in the search path. However, this approach requires a large number of training examples, and the method can only be trained offline. In [2], a set of classifiers are trained on examples to estimate the distance of the current page from the closest on-topic page. But the training procedure is quite complex.
Our focused crawler aims at providing a simpler alternative for
overcoming the issue that immediate pages which are lowly ranked
related to the topic at hand. The idea is to recursively execute an
exhaustive search up to a given depth , starting from the ``relatives'' of a highly
ranked page. Hence, a set of candidate pages is
obtained by retrieving pages reachable within a given
perimeter from a set of initial seeds. From the set of
candidate pages, we look for the page which has the best score
with respect to the topic at hand. This page and its
``relatives'' are inserted into the set of pages from which to
proceed the crawling process. Our assumption is that an
``ancestor'' with a good reference is likely to have other
useful references in its descendants further down the lineage
even if immediate scores of web pages closer to the ancestor
are low. We define a degree of relatedness
with respect to the page with
the best score. If
is large, we
will include more distant ``cousins'' into the set of seeds
which are further and further away from the highest scored
page.
This device overcomes the difficulties of using reinforcement learning in assigning credits, without the burden of solving a dynamic programming problem. These ideas may be considered as an extension to [1,2], as the use of a degree of relatedness extends the concept of child pages in [1] while avoiding the complex issue of inherence of scores, and the use of a perimeter is similar to the ``layer'' concept used in [2].
The web is a directed graph
where
is a node (i.e. a
page) and
is a
directed link from node
to node
. The crawler, given
a seed node
, explores the web
and builds the visit
tree
, where
collects the visited
pages, and
. The border
are the nodes in
that have outlinks to nodes outside
. The depth,
, is the maximum distance from the seed node, and
the degree of relatedness
is defined as follows: an
ancestor of
is a node
for which there exists a path
in
with
arcs to
such that
. Then, if
, the
degree relatives of
are the nodes
for which
holds. Finally,
the score of a page
is the
relative importance of page
about a
particular topic. Important pages are scored high.
Given and
, the algorithm is as follows:
In order to validate our algorithm, we carried out some preliminary experiments. The purpose was to understand the behaviors of the algorithm and to clarify whether the method can be used to speedup a basic focused crawler (a crawler without sophisticated mechanisms for tunneling through lowly ranked pages). The data set we used was WT10G, distributed by CSIRO in Australia, that contains a snap shot of a portion of the world wide web consisting of 1,692,096 documents from 11,680 servers. We used a page classifier based on naive Bayes method [4] to assign a score to each page with respect to some randomly chosen topic.
The measures used to characterize the behaviour of the crawler
is the precision
, where
is the number of relevant pages
that the crawler has visited, and
is the total number of pages visited. The experiment
illustrated in Figure 1 obtains
as a function of
, and
respectively. Plotted is the number of web pages
visited against the number of relevant pages retrieved. The
topic used is ``Linux'' which has 2600 relevant pages in the
dataset.
![]() |
Note that for our
crawler becomes a basic focused crawler. This experiment shows
that, at the beginning of the crawling, the basic focused
crawler is more efficient than our crawlers. But, as
increases the crawlers with
give better
results. More precisely, Figure 1 shows
that choosing either
and/or
produce the best results
in the later stage of the search process. This observation was
confirmed by other experiments (not shown due to lack of
space) on other general and special topics.
The results have an interesting interpretation. The seeds, which
are on-topic pages, are probably directly connected to other
on-topic pages. Those pages can be easily collected at the
beginning by a basic focused crawler. Then, the crawling task
becomes more difficult making crawlers with and/or
more
efficient. In the above example, pages addressing Linux are
often tightly linked to each other and hence, can be collected
quite easily. It is the retrieval of pages which are isolated
and difficult to find where the proposed algorithm gives
better results. However, the approach cannot tunnel
indefinitely as large values of
and
may force the retrieval of
more and more pages which are not immediately relevant to a
topic at hand.
One of the main features of our proposed algorithm is its ability to tunnel through pages with a low score. An experiment has been conducted to give some indication on how this ability of the algorithm is employed. The upper graph in Figure 2
![]() |
Finally, note that even if the presented results are preliminary and further experiments are needed to access its capabilities, and establish a direct comparison with other focused crawlers, the algorithm is considerably less computationally demanding than the ideas expressed in [1,2]. In addition, we do not require a large training set, and the training procedure is relatively simple.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
The translation was initiated by Markus Hagenbuchner on
2003-03-28