Fetching Web Pages from the WebBase Web Page Repository

Gary Wesley (Bluegee gmail)
Updated October 2013

Herein is described how to retrieve Web pages from the Stanford WebBase Archive,
a World Wide Web page repository built as part of the Stanford Digital Libraries Project
by members of the Stanford InfoLab.

WebBase visited by over 37 countries

The Repository

This web repository is over 260TB ( uncompressed size as of August 2011 ) of 7Billion web pages intended for research into topics such as web graph analysis and election or disaster press coverage ( we have a workbench for press coverage analysis and coding ).
The general text crawls are each about 0.5TB compressed ( 1.5TB uncompressed ). Sizes below are in compressed units. We now effectively have rudimentary time series data. General crawls use almost the same site list each time. Building the client software. Lists of sites with page counts is available via the "sites" links below. Architecture diagram. Our web crawler or spider is named WebVac. Technical report: Stanford WebBase Components and Applications. We are working in cooperation with the Library of Congress and the California Digital Library.

We now have tools for computational sociology in our Web Sociologist's Workbench. It was used for election coverage analysis by the Stanford Communication Department. Picture of a sample screen. (The letter in the checkbox label is a keyboard shortcut.) Here is a 2007 report on our efforts. A version of this is being used for a memetic epidemiology project with the Stanford Medical School involving Myspace blogs.

We have a collection of the links from each of the general crawls. These are available upon request via ftp.

We have a C++ tool to convert from our format to ARC version 1 format (Internet Archive and Heretrix). We are developing one for WARC (now ISO) and International Internet Preservation Consortium (IIPC) standard. County, city, state and federal crawls through 2008 have been converted to ARC. We are considering converting the entire operation to WARC.

Wibbi:
If you don't want to bother with the client because you will not be building custom handlers, there is now a Web interface to the crawls. There are several custom filters to choose from like and and or. Wibbi will give slower throughput than our C++ client, even with no filtering. A Windows/Linux browser limit (Except Opera and Firefox 2.0.0.1+) causes you to only be able to download 4GB at a time. Since the filters are run on our server, it is possible to filter more data than that but not to reach that limit.

If you decide to use the data, please email Gary ( Jeez cs stanford Edu) for our funding requests.
We would also appreciate knowing of any papers that come out of your usage.

WebVac spider

WebVac crawls depth first, generally to a depth of 7 levels and fetches a maximum of 10k pgs per site.
We only follow links to pages within the domain. Til 2007 our general policy was to gather a 1.5TB sample.
Now we crawl a list of sites, til the list is done or the month is over. We retry unavailables several times.
We pause 1 to ( almost always ) 12 seconds between pages, depending on ipaddress bottlenecks.
For the federal government crawls, we take up to 150,000 pages to 12 levels over
a fairly static group of sites.

Architecture

Overall system screenshot from 2007 Screen shot

Client Software ( RunHandlers )

If you don't want to bother with the client because you will not be building custom handlers, there is now a Web interface to get pieces of up to 4GB of the 2003-present crawls on Wibbi .

These instructions assume Internet access to the machine hosting WebBase data and a CVS checkout of the WebBase code or an ftp get.

We allow specification of machine, port, first site and last site for the stream (e.g. www.ibm.com). Distribrequestor.pl and getpages.pl also take those arguments. The webpage repository is organised by site, so offset means offset within the site.

RunHandlers is supported on 32-bit GNU/Linux and Solaris systems with GNU make (gmake), g++ ( <= 3.4.0), Perl 5.05+, and W3C's libwww.

Fetch the latest WebBase client source code from ftp://db.stanford.edu/pub/webbase.

Unroll the source code. For example, GNU tar can do this with

> tar xfz webbase-client-????-??-??.tar.gz

Follow the instructions in the source code's README.client.

> chdir dli2/src/WebBase/ && more README.client

Build everything:
(Use a 32-bit Linux box)

Make sure the library path includes W3C's libwww .
This library must be installed by a system administrator with root privileges.

Make sure environment variable WEBBASE points to WebBase:
setenv WEBBASE [absolute path]/WebBase

(1) Run GNU make:

   WebBase/> ./configure

   WebBase/> make client
   
     If you get:
	handlers/extract-hosts.h:27:21: WWWCore.h: No such file or directory

	handlers/extract-hosts.h:28:21: HTParse.h: No such file or directory

     Your include path may be wrong:
     We expect it to be in /usr/local/include/w3c-libwww/WWWCore.h,
     so you may need to change this in Makefile.in and configure.
     (Order MAY matter)
     Rerun ./configure.

         To use later gcc versions:
                 Here's the hack:

               After running ./configure, do the following:

               1. add -fpermissive in the CPPFLAGS on line 68 in the makefile
               2. comment out
   lines 34 and 35 in hashlookup/hashlookup.h
          extern unsigned int hashlookup_error;
          extern unsigned int verbose_error;

(2) Test your build.
     (a) Turn on cat-handler, which simply outputs what it receives.
            In inputs/webbase.conf, set
            CAT_ON = 1
     (b) Try RunHandlers on a local example file:
            bin/RunHandlers inputs/webbase.conf \
           "file:///handlers/example-50-pages"
          [50 sample pages are printed]

Now try the network version:

Method 1:
Run scripts/distribrequestor.pl to start a distributor:
(either chmod +x scripts/*.pl or invoke it with "perl")
args: (must be in this order)
# host
# port
# num pages
# starting web site (optional) e.g. www.ibm.com
# ending web site (optional)
# offset in bytes within web site (optional)

[example run:]
WebBase/scripts> distribrequestor.pl wb1 7008 100

distrib daemon returned 171.64.75.151 7160
(use as ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100" )

WebBase/scripts>

Now you can invoke RunHandlers with the above info:
( cut and paste it from the echo)

WebBase/scripts> ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100"

will print back 100 sample pages. All instances of RunHandlers connected to
the above port will share the same pool of pages. To get an independent
stream, run distribrequestor.pl to get a new port.

Method 2:
You can also use our one-step script getpages.pl (no need to specify a first site )
(either chmod +x scripts/*.pl or invoke it with "perl")

[example run:]

args: (must be in this order)
# num pages
# host
# port
# starting web site (optional) e.g. www.ibm.com
# ending web site (optional)
# offset in bytes within web site (optional)


WebBase/scripts> getpages.pl 2 wb1 7008 www.ibm.com www.ibm.com (only give me www.ibm.com)

Starting getpages.pl using Perl 5.6.0
Do you want to run
/dfs/sole/6/gary/dli2/src/WebBase//bin/RunHandlers /dfs/sole/6/gary/dli2/src/WebBase//inputs/webbase.conf "net://171.64.75.151:7163/?numPages=2" now?(Y/N):
WebBase/scripts> Y

To get all of the page, set CAT_ON = 1 in the inputs/*.conf.

If you get the ERROR:
bin/RunHandlers: error while loading shared libraries: libwwwcore.so.0:
cannot open shared object file: No such file or directory
you don't have your paths set right.
setting a variable called LD_LIBRARY_PATH where you're about to run the
WebBase client. For example, if you found your libwwwcore.so in your
/opt/somewhere/lib/libwwwcore.so, then you could tell your system:
setenv LD_LIBRARY_PATH /opt/somewhere/lib

Return codes:
contact us to report these:

blank page means there is no server running on that port
If you get a line of just numbers and not much else:
256 means I have a distributor running on a server with no data or a dangling
softlink
32512 is usually a missing softlink on the server
( fix is ln -s /u/gary/WebBase.centos/bin/runhandlers /lfs/1/tmp/webbase/runhandlers )
or it is missing shared libraries: libwwwutils.so.0

Note on the output:

This next line is just a separator, so that RunHandler knows it is getting a new page:
==P=>>>>=i===<<<<=T===>=A===<=!Jung[...] -- page separator
URL: http://www.powa.org/ -- page URL
Date: June 3, 2004                -- when crawled
Position: 695                         -- bytes into the site so far
DocId: 1                                 -- sequential page id within site
HTTP/1.1 200 OK                -- response to our http request

Death threat:
If a distributor is inactive for a while, it may be killed by us so that we can reuse the resources.
To restart at the same point you must start a new distributor @ the offset where it left off
( + 1 to prevent getting the previous page again).

Putting out a contract:
If you are done, you can run distribrelease.pl [remote-host] [host port] [stream port]
from the same machine you requested on. We will immediately kill the distributor for you.
We especially recommend this if you are running many requests in 1 day so that we do not run out of resources.

If you specify firstSite/lastSite, please note that you can only use the root
(e.g. www.ibm.com) not a page within the site (e.g. 01net.com/envoyerArticle/1 )
and dont include the http:// part.

-------------------------------------------------------------------

To create a new webpage stream handler:

You can use the other handlers in the distribution as templates.
To add a new handler, add the following to the appropriate places:
* 1) #include "myhandler.h" into handlers/all_handlers.h
* 2) handler.push_back(new MyHandler()); into handlers/all_handlers.h
         (following the template of the handlers already there)
* 3) in Makefile, add entries for your segments to compile
         in the line: HANDLER_OBJS = jhandler.o [...]
*opt)in Makefile, customize your build if necessary by adding a line
           jhandler_CXXFLAGS = -Iyour-include-dir --your-switches [...]
          (following the template of the handlers already there)

We also have a one-button script called scripts/addHandler.pl that will
prompt you for all your pieces and put them in place, without you having
to do the above file surgery yourself.

GLOSSARY

WebVac - the WebBase web crawler or spider. Used to be called Pita.

RunHandlers - (formerly "process") an executable that indexes a stream,
file or repository.
Made up basically of a feeder and one or more handlers.

handler - the interface that any index-building piece of code must implement.
          The interface's main (only) method will provide a page and associated
          metadata and the implementor of the method can do whatever he wants
          with it.

feeder - the interface for receiving a page stream from any kind of source
          (directly from the repository, via Webcat, via network, etc.). The
          key method of the interface is "next" which advances the stream by one
          page. After calling next, various other methods can be used to get the
          associated metadata for the current page in the stream. Can also be used
          to build indexes if the index-building code is written to process page
          streams

distributor - a program that disseminates pages to multiple clients
over the network, supporting session ID's, etc... a generalization of what
Distributor.cc in Text -index/ does.

offset - used in distributor requests to specify how many bytes to start from
the beginning of the site.

DocId - DocId is computed within the download. If you download any portion of the crawl,
even from the middle, it will begin with 0. If you download all the crawl,
it will be monotonically increasing from start to end.

flutes