One of the difficult challenges in the area of Web related research
is that of clustering or classifying web pages. Clustering refers to the
grouping of pages into categories, in a fashion similar to Yahoo
or the Open Directory .
These two directories, however, are maintained entirely by human editors,
using no automated techniques. Manual techniques are not scalable to the
entire web: although there are over 1 billion pages on the web
(http://www.inktomi.com/webmap/) , Yahoo and Open Directory each have
fewer than 2 million urls in their hierarchy.
We are currently investigating techniques to efficiently cluster the
entire web. Traditional IR approaches are not appropriate in the context
of the web, due to both the enormous size and hyperlinked nature of the
web. We plan to use recently developed techniques that allow for
similarity searches in high dimensional spaces (for instance)
http://theory.stanford.edu/~indyk/vldb99.ps to allow for offline
clustering of the web. Even with the newer techniques, the resource
requirements will be large, especially as precision requirements are
raised. Supercomputing resources will be a valuable asset in performing
clustering and other mining operations on the contents of the web. Such
resources will allow us to explore and evaluate more of the available
clustering options as we develop the most effective techniques.