The Stanford WebBase Project

Our DLI2 WebBase project builds on our previous Google activity. It has the following goals:

Provide a storage infrastructure for Web-like content
Store a sizeable portion of the Web
Enable researchers to easily build indexes of page features across large sets of pages
Distribute Webbase content via multicast channels

Some of the challenges we are addressing include:

The huge size of the information space
- Very large URL space
- Content size
Limited resources
- Disk space
- Memory
- Time
- Bandwidth
- Web server tolerance
Continuous update and maintenance
- Smart crawling to maximize freshness and likely relevance
- Crawler parallelism
- Crawling robustness; interruptible crawls
Making it easy for researchers to use the facilities
- Standard access API
- API for building new indexes
- Convenient distribution of entire content

The project is developing the following facilities:

Smart crawling technology. This will allow us to crawl 'valuable' sites more frequently or more deeply. The measure of 'value' is, of course, itself an important research topic. Google's page rank is one such measure.
Web repository. This infrastructure will hold large numbers of Web pages, and will allow experimental search and analysis over that information. We are working on the following components:
- An archive to hold the (compressed) pages that our crawler acquires. This archive will enable random access via an index. The archive is also capable of streaming its entire content to a client.
- An index factory. This facility will enable researchers to easily add novel indexes to the Webbase. To use the index factory, the researcher implements a feature extraction engine. When presented with a Web page, such an engine computes some page property of interest to the researcher. Examples: reading level, genre, link property analyses... Once the engine is constructed, the researcher requests that the archive be 'streamed past' the extraction engine. Based on the engine's output, the index factory will add an appropriate index to the Webbase. This index will then be available to the community. Webbase will also keep the index up to date as the archive is refreshed through new crawls.
Wide-Area Web Data Distribution. Our distribution machinery will allow researchers everywhere to take advantage of our collected data. Rather than forcing all index creation and data analysis to run at the site where the data is located, our wide-area data distribution facility will multicast the archive's content through multicast channels. Channels may vary in bandwidth and content. Subscribers specify the parameters of their channel.

See Junghoo Cho's Powerpoint presentation for more detail.