STANFORD DIGITAL LIBRARIES TECHNOLOGIES
PROJECTS	DOCUMENTS	PEOPLE
SEMINARS	TESTBED	RESOURCES

Projects In Brief

HOME

PROJECTS

Retrieving Information

Information Tiling

PalmPilot Infrastructure

Power Browsing

PB Summarization Movie

PB Navigation Movie

PB Forms Movie

Query Translator

SDLIP

Value Filtering

WebBase

Interpreting Information

Web Clustering

Managing Information

Archival Repositories

Archiving Movie

InterBib

Medical Transport Info

PhotoBrowser

Sharing Information

Diet ORB

Digital Wallets

Mobile Security

DLI1 Projects

AHA

ComMentor

DLITE

Google

GLOSS

FAB

Grassroots

Metadata Architecture

RManage/FIRM

SenseMaker

SCAM

Shopping Models, U-PAI

SONIA

STARTS

WebWriter

Web Clustering

One of the difficult challenges in the area of Web related research is that of clustering or classifying web pages. Clustering refers to the grouping of pages into categories, in a fashion similar to Yahoo Yahoo or the Open Directory . These two directories, however, are maintained entirely by human editors, using no automated techniques. Manual techniques are not scalable to the entire web: although there are over 1 billion pages on the web (http://www.inktomi.com/webmap/) , Yahoo and Open Directory each have fewer than 2 million urls in their hierarchy.

We are currently investigating techniques to efficiently cluster the entire web. Traditional IR approaches are not appropriate in the context of the web, due to both the enormous size and hyperlinked nature of the web. We plan to use recently developed techniques that allow for similarity searches in high dimensional spaces (for instance) http://theory.stanford.edu/~indyk/vldb99.ps to allow for offline clustering of the web. Even with the newer techniques, the resource requirements will be large, especially as precision requirements are raised. Supercomputing resources will be a valuable asset in performing clustering and other mining operations on the contents of the web. Such resources will allow us to explore and evaluate more of the available clustering options as we develop the most effective techniques.

Questions or Comments? Send email to dlwebmaster@db.stanford.edu

PROJECTS

DOCUMENTS

PEOPLE

SEMINARS

TESTBED

RESOURCES

SPONSORS/PARTNERS