Fetching Web Pages from the WebBase Web Page Repository

InfoLab was Database Group
                                                                                                                                       Gary Wesley <Gary at InfoLab.Stanford.Edu>
Updated  November 16 2009


Herein is described how to retrieve Web pages from the Stanford WebBase Archive,
a World Wide Web page repository built as part of the Stanford Digital Libraries Project
by members of the
Stanford InfoLab.

WB8 & WB19 not working

You should see better performance since we are now compressing our stream.

Support is very limited because Gary is at 1/4 time now.

Countries using WebBase
WebBase has visited 28 countries

The Repository

This web repository is over 145TB ( uncompressed size as of July 2009 ) of various web pages intended for research into topics such as web graph analysis and election or disaster press coverage ( we have  a workbench for press coverage analysis and coding ).
The general text crawls are each about  0.5TB compressed  ( 1.5TB uncompressed ). Sizes below are in compressed units.  We now effectively have rudimentary time series data. General crawls use the same site list each time. Building the client software.   Lists of sites with page counts is available via the "sites" links below.  Architecture diagram. Our web crawler or spider is named WebVac ( it was called Pita ). Technical report: Stanford WebBase Components and Applications.  We are working in cooperation with the Library of Congress and the California Digital Library.

We now have tools for computational sociology in our Web Sociologist's Workbench. It was used for election coverage analysis by the Stanford Communication Department. Picture of  a sample screen. (The letter in the checkbox label is a keyboard shortcut.) Here is  a 2007 report on our efforts. A version of this is being used for a memetic epidemiology project with the Stanford Medical School 
involving Myspace blogs.

A current project is duplicate detection because newspaper crawls carry so many duplicated both within and across papers (like wire stories). This will be integrated into the Web Sociologist's Workbench.

We have a collection of the links from each of the general crawls. These are available upon request via ftp.


We have a C++ tool to convert from our format to ARC version 1 format (Internet Archive and Heretrix). We will developing one for WARC (now  ISO) and International Internet Preservation Consortium (IIPC) standard. County, city, state and federal  crawls through  2008 have been converted to ARC.



Wibbi:

If you don't want to bother with the client because you will not be building custom handlers, there is now  a Web interface to the crawls. There are several custom filters to choose from like and and or. Wibbi will give slower throughput than our C++ client, even with no filtering. A Windows/Linux browser limit (Except  Opera and Firefox 2.0.0.1+) causes you to only be able to download 4GB at  a time. Since the filters are run on our server, it is possible to filter more data than that but not to reach that limit.

If you decide to use the data, please  email Gary  for our funding.
We would also appreciate knowing of any papers that come out of your usage.


The Crawls

General Monthly Crawls
US Government
State and Local Governments
Newspapers
Universities
California 2003 Governor Recall
2004 National Elections
2005 California Special Election
Hurricane Katrina aftermath
2006 Mid Term Elections
Virginia Tech shooting
2008 Presidential Primaries and Election

General Crawls
2004   2005   2006   2007  2008
  2009  


 Host       Port     Million pgs             Date         Mimetype   Type of web crawl  

                                                
WB9        7003     119      343GB         1/2001         Text     general crawl    site list

WB9        7005      44      152GB         3/2002         Text     general crawl    site list(use 2002getpages.pl) 

WB1        7006      96      406GB         6/2003         Text     general crawl    site list

WB1        7008      96      423GB         8/2003         Text     general crawl    site list

WB1        7010     102      451GB        10/2003         Text     general crawl    site list

                             526GB
WB5        7012      36                   12/2003         Text     general crawl    site list
           7032      14                   12/2003         Image    general crawl    site list  

2004

WB1        7103      95      450GB         3/2004         Text     general crawl    site list
   
WB1        7114      6       447GB         4/2004         Image    general crawl  
site list

                                457GB
WB2        7107     11.5                   7/2004         Text     general crawl   site list
           7117      4.2                   7/2004         Image    general crawl   site list
           7127      0.02                  7/2004         Audio    general crawl   site list
           7137      2.3                   7/2004         Other    general crawl   site list

WB23       7108      72      363GB         8/2004         Text     general crawl   site list


                             474GB
WB22       7109      36                    9/2004         Text     general crawl   site list
           7119       7                    9/2004         Image    general crawl   site list

WB4        7190     105       495GB       10/2004         Text     general crawl   site list

                             1561GB
[by special arrangement]
           7192     37                    12/2004         Text      general crawl  site list

           7193     14                    12/2004         Image     general crawl  site list
           7194     0.08                  12/2004         Audio     general crawl  site list
           7195     7.7                   12/2004         Other     general crawl  site list

             


Host       Port     Million pgs             Date         Mimetype   Type of web crawl  


2005                               

                              980GB
WB23       7601     27                     1/2005         Text       general crawl  site list

           7611     6                      1/2005         Image      general crawl  site list
           7621     0.04                   1/2005         Audio      general crawl  site list
           7631     3.5                    1/2005         Other      general crawl  site list 

this deeper next crawl was done with pagemax of 20k per site instead of the usual 10k:
WB1        7603     85        440GB        3/2005         Text       general crawl  site list

WB18       7489     0.48      192GB      3-5/2005         Audio      general audio  site list

WB3        7604     98        480GB        4/2005         Text       general crawl  site list 

WB3        7605     79        460GB        5/2005         Text       general crawl  site list
 

WB8        7606    101        503GB        6/2005         Text       general crawl  site list

                              487GB
WB16       7658     9.5                    8/2005         Text       general crawl 
site list
           7668     3.4                                   Image                     site list
           7658     .02                                   Audio                     site list
           7678     2                                     Other                     site list       

WB3        7609     97        490GB        9/2005         Text       general crawl  site list

WB3        7610     97        508GB       10/2005         Text       general crawl  site list

WB18       7691     93        527GB       11/2005         Text       general crawl  site list

                                                                  945GB
WB1        7612     20.7                  12/2005         Text       general crawl 
site list
           7622      7                    12/2005         Image      general crawl  site list
           7632     0.04                  12/2005         Audio      general crawl  site list
           7642     4.5                   12/2005         Other      general crawl 
site list
                                                                                      

                                                                                     


Host      Port   Million pgs             Date         Mimetype   Type of web crawl  


2006

WB1        7701     98        515GB        1/2006         Text       general crawl  site list

WB19       7702     93        490GB        2/2006         Text       general crawl  site list

WB15       7703     95        497GB        3/2006         Text       general crawl  site list

WB17       7704     92        493GB        4/2006         Text       general crawl  site list

WB17       7705     93        499GB        5/2006         Text       general crawl  site list    

WB5        7706     90        497GB        6/2006         Text       general crawl  site list

WB19       7707     92        501GB        7/2006         Text       general crawl  site list

WB5        7708     93        515GB        8/2006         Text       general crawl  site list

WB8        7709     90        502GB        9/2006         Text       general crawl  site list

WB15       7710     90        497GB        10/2006        Text       general crawl  site list

WB22       7730     10        353GB        10-11/2006     Image      general crawl  site list

WB16       7711     90        506GB        11/2006        Text       general crawl  site list

WB1        7712     90        511GB        12/2006        Text       general crawl  site list

WB1        7713     0.5       222GB        12/2006        Audio      general crawl  site list

                                                                                      

    


Host     Port       Million pgs           Date        Mimetype   Type of web crawl      


2007

WB2        7118     87        502GB        1/2007         Text       general crawl  site list

WB11       7106    103        590GB        2/2007         Text       general crawl  site list

WB23       7161    102        578GB        3/2007         Text       general crawl  site list

WB5        7163    100        578GB        4/2007         Text       general crawl  site list

WB15       7239     98        573GB        5/2007         Text       general crawl  site list

WB4        7260     98        590GB        6/2007         Text       general crawl 
site list

WB15       7579     86        525GB        7/2007         Text       general crawl  site list

WB9        7262     87        514GB        8/2007         Text       general crawl  site list

WB11       7266     79        486GB        9/2007         Text       general crawl  site list

WB3        7272     80        492GB       10/2007         Text       general crawl  site list

WB19       7289     80        497GB       11/2007         Text       general crawl  site list

WB4        7291     79        494GB       12/2007         Text       general crawl  site list



Host   Port        Million pgs            Date      Mimetype   Type of web crawl      


2008

WB20        7320     81        498GB        1/2008         Text       general crawl  site list

WB21        7298     79        496GB        2/2008         Text       general crawl  site list

WB20        7299     80        507GB        3/2008         Text       general crawl  site list

WB11        7301     68        439GB        4/2008         Text       general crawl  site list

WB22        7476     66        500GB        5/2008         Text       general crawl  site list

WB6         7482     67        485GB        6/2008         Text       general crawl  site list

WB7         7483     75        498GB        7/2008         Text       general crawl  site list

WB2         7495     77        516GB        8/2008         Text       general crawl  site list

WB21        7518     76        522GB        9/2008         Text       general crawl  site list

WB22        7220     77        526GB       10/2008         Text       general crawl  site list

WB2         7571     77        522GB       11/2008         Text       general crawl  site list

WB7         7572     61        430GB       12/2008         Text       general crawl  site list



Host  Port Million pgs       Date  Mimetype Type of web crawl      


2009

WB23     7565           75          525GB          1/2009       Text            general crawl    site list
WB23     7587           75          515GB          2/2009       Text            general crawl   
site list

WB8       7254           72          506GB          3/2009       Text            general crawl    site list
WB2       7578           71          498GB          4/2009       Text            general crawl    site list
WB7       7633           71          497GB          5/2009       Text            general crawl    site list
WB15     7650           69          505GB          6/2009       Text            general crawl    site list
WB15     7651           67          482GB          7/2009       Text            general crawl    site list
WB2       7659           64          466GB          8/2009       Text            general crawl    site list
WB7       7671           64          478GB          9/2009       Text            general crawl    site list
WB12     7688           64          468GB       10/2009       Text            general crawl    site list

We crawled  a  small subset of the general crawl list weekly January-May 2006 (available on  Wibbi ).
Around 2 million pages and 12.5GB of highest rank sites per week.



Specialized Crawls

University

Host       Port     Million pgs             Date      Mimetype   Type of web crawl    

WB1        7022      .28     1GB         11/2002         Text    U Cal@Berkeley site list

WB8        7050      .35    13GB          8/2003         All     Stanford University www.stanford.edu
[ we crawl 202 (now 239) Stanford sites in our monthly text crawl ]

WB1        7300      .4      2GB         11/2004         Text    US CS        site list 

                                7.6GB
WB1        7440      .14                  1/2005         Text    U Cal@Berkeley site list
           7641      .07                  1/2005         Image   U Cal@Berkeley site list
           7492      .0001                1/2005         Audio   U Cal@Berkeley site list
           7443      .02                  1/2005         Other   U Cal@Berkeley site list

                              3GB
WB3        7060      .040   1.5GB         6/2005         Text    Stanford University site list
           7061      .038   125MB         6/2005         Image   Stanford University site list
           7062      60pgs                6/2005         Audio   Stanford University site list
           7063      .011   1.4GB         6/2005         Other   Stanford University site list

                                                  62GB

WB11             7688       0.9                                            10/2009                     Text         Stanford University         site list
                        7689      .03                                             10/2009                     Image      Stanford University       
  site list
                        7690      .001                                           10/2009                     Audio      Stanford University         
site list
                        7692       .2                                              10/2009                     Other       Stanford University        
site list


 [ we crawled 239 Stanford sites in our monthly text crawl (as of 11-2009 it is 1277)]


Government

US Government
    .mil is in the general crawl


Host     Port  Million pgs           Date         Mimetype   Type of web crawl         


                               213GB
WB4        7567      4.3                  7/2003         Text     US Government 
site list
                        270GB
WB3        7506      3.4                  6/2004         Text     US Government  site list
           7516      1.6                  6/2004         Image                  
site list
           7516      .003                 6/2004         Audio                   site list
           7536      1.2                  6/2004         Other                   site list

                            274GB
[by request]7508   3.2                    8/2004         Text     US Government  site list
           7518      1.7                  8/2004         Image                   site list
           7538      1.2                  8/2004         Other                   site list

                            259GB
WB1        7509      2.8                  9/2004         Text     US Government  site list
           7519      1.5                  9/2004         Image                   site list
           7529      .006                 9/2004         Audio                   site list
           7539      1.1                  9/2004         Other                   site list

                            274GB
[by request]7570   2.9                    10/2004         Text    US Govt early Oct  site list
           7580      1.5                  10/2004         Image                      site list
           7590      2.2                  10/2004         Other                      site list

                            280GB
[by request]7573     3.0                  10/2004         Text   US Govt ,very late Oct site list
           7583      1.6                  10/2004         Image                         site list
           7563      0.004                10/2004         Audio                         site list
           7593      1.2                  10/2004         Other                         site list

                            283GB                          
WB8        7511      3.0                  11/2004    Text   US Govt+election, early Nov site list
           7521      1.6                  11/2004    Image                              site list
           7531      .004                 11/2004    Audio                              site list
           7541      1.3                  11/2004    Other                              site list

                            277GB
[by request]7512      2.9                 12/2004         Text   US Government site list
            7522      1.5                 12/2004         Image                site list
            7532     .004                 12/2004         Audio                site list
            7542      1.2                 12/2004         Other                site list


Host       Port     Million pgs             Date         Mimetype   Type of web crawl  


2005                         
                                  274GB

[upon request]       3.0                  1/2005         Text   US Government, January site list
           7781      1.5                  1/2005         Image                         site list
           7791     .004                  1/2005         Audio                         site list
           7792      1.2                  1/2005         Other                         site list

                                  483GB
WB3        7644      2.5                  4/2005         Text  US Govt .gov + election site list
           7614      1.3                  4/2005         Image
           7624     .003                  4/2005         Audio
           7634      1.1                  4/2005         Other


Next 3: 20,000/site max on .gov only
                                    363GB
WB18       7607      4.0                  6-7/2005       Text   US .gov       site list
           7617      2.0                  6-7/2005       Image  US .gov       site list

           7627      .004                 6-7/2005       Audio  US .gov       site list
           7637      1.7                  6-7/2005       Other  US .gov       site list

                                   336GB (updated site list from LOC)
WB4        7799      3.3                  9/2005         Text   US .gov       site list
           7719      1.1                  9/2005         Image  US .gov       site list
           7729                           9/2005         Audio  US .gov       site list
           7739      1.4                  9/2005         Other  US .gov       site list

                                   233GB
WB8        8012      2.2                 12/2005         Text   US .gov       site list
           8022      1.1                 12/2005         Image  US .gov       site list
           8032      0.004               12/2005         Audio  US .gov       site list
           8042      1.0                 12/2005         Other  US .gov       site list

 From here on we crawl up to 150,000 pages per  .gov site to  a depth of 12 quarterly.
For those below, we have removed the site list from ca.gov, which are state site list for California.
ca.gov are about 100GB for each crawl and can be made available upon request. These are also in the
state crawls.

2006                              484GB
WB2        8001      5.6                  3/2006          Text   US .gov         site list
           8011      2.8                  3/2006          Image  US .gov         site list
            8021      0.007                3/2006          Audio  US .gov         site list
            8031      2.2                  3/2006          Other  US .gov         site list
                                                                       
                                    658GB
WB21       8041      6.1                 6-7/2006         Text   US .gov         site list
           8051      3.3                 6-7/2006         Image  US .gov         site list
           8052      0.01                6-7/2006         Audio  US .gov         site list
            8053      2.3                 6-7/2006         Other  US .gov         site list       

                                    726GB
WB1        7100      6.6                 9-10/2006        Text   US .gov         site list
           7101      3.3                                  Image                  site list
           7102      0.01                                 Audio                  site list
           7104      2.9                                  Other                  site list
                                                                    
                                    609GB
WB9        7149      7.0                  12/2006         Text   US .gov         site list
           7150      3.0                                  Image                  site list
           7151      0.01                                 Audio                  site list
           7152      3.0                                  Other                  site list
                                                                      
2007                              681GB
WB2        7157      8.1                  3/2007          Text   US .gov         site list
           7158      3.4                                  Image                  site list
           7159      0.01                                 Audio                  site list
           7160      3.1                                  Other                  site list

(Updated our list of site list here. )

                                     613GB
WB15       7255      7.0                  6/2007          Text   US .gov         site list
           7256      3.0                                  Image                  site list
           7257      0.01                                 Audio                  site list
           7258      2.8                                  Other                  site list
 

( California ca.gov is not crawled from here on except as part of the state crawls )                                                                        636GB
WB[fixing]       7267      5.5                  9/2007          Text   US .gov         site list
           7268      2.7                                  Image                  site list
           7269      0.01                                 Audio                  site list
           7270      2.4                                  Other                  site list
  

 
                                    629 GB
WB23       7292      5.4                 12/2007          Text   US .gov         site list
           7293      2.5                                  Image                  site list
           7295      0.01                                 Audio                  site list
           7296      2.3                                  Other                  site list
         
2008

                                    654 GB
WB7        7369      7.4                  3/2008          Text   US .gov         site list
           7370      3.4                                  Image                  site list
           7371      0.01                                 Audio                  site list
           7372      3.0                                  Other                  site list

                                    650 GB
WB6        7484      5.5                  6/2008          Text   US .gov         site list
           7485      2.5                                  Image                  site list
           7488      0.01                                 Audio                  site list
           7490      2.7                                  Other                  site list

                                    755 GB
WB2        7510      8.0                  9/2008          Text   US .gov         site list
           7513      3,2                                  Image                  site list
           7514      0.01                                 Audio                  site list
           7515      3.3                                  Other                  site list

                                    762 GB
WB6        7574      7.3                 12/2008          Text   US .gov         site list
           7575      3.2                                  Image                  site list
           7576      0.01                                 Audio                  site list
           7577      3.2                                  Other                  site list

2009

                                    880 GB
WB11       7645      8.0                 03/2009          Text   US .gov         site list
           7512      3.6                                  Image                  site list
           7522      0.01                                 Audio                  site list
           7532      3.4                                  Other                  site list

                                    826 GB
WB14       7646      7.1                 06/2009          Text   US .gov         site list
           7647      3.3                                  Image                  site list
           7648      0.01                                 Audio                  site list
           7649      3.0                                  Other                  site list

Updated site list from: http://www.lib.lsu.edu/gov/index.html, 230 new sites

Also got 2682 new sites by feeding back a crawl and looking for links to federal sites
not already in list.
NASA to a depth of 6 instead of 12
USGS and NOAA limited to 8 levels

                            1600 GB
WB1        7663      9.9                 09/2009          Text   US .gov         site list
           7665      4.7                                  Image                  site list
           7667      0.02                                 Audio                  site list
           7670      4.7                                  Other                  site list

Even tighter page limits on NASA,USGS,NOAA below





 
State, County and Local Governments


Host     Port     Million pgs            Date         Mimetype   Type of web crawl  


These sitelists were compiled from the site http://www.statelocalgov.net

State                               211GB
WB11       7204      2.3                  5/2005         Text    State govt   site list
           7214      0.7                  5/2005         Image   State govt   site list
           7224     .005                  5/2005         Audio   State govt   site list
           7234      1.4                  5/2005         Other   State govt   site list

County                              90GB
WB8        7264      1.2                  5/2005         Text    County govt  site list
           7274      0.5                  5/2005         Image   County govt  site list
           7284     .060                  5/2005         Audio   County govt  site list   
           7294      0.5                  5/2005         Other   County govt  site list

City and town                       188GB
WB6        7664      2.5                  5/2005         Text    City govt    site list
           7674      1.2                  5/2005         Image   City govt    site list
           7684     .001                  5/2005         Audio   City govt    site list
           7694      1.0                  5/2005         Other   City govt    site list  

 
 Post Katrina crawl
State                               217GB

WB2        7465      2.1                  9/2005         Text    State govt   site list
           7466      0.7                  9/2005         Image   State govt   site list
           7467     .060                  9/2005         Audio   State govt   site list
           7468      1.3                  9/2005         Other   State govt   site list    


2006

State                              280GB
WB17       7365      2.0                  4/2006         Text    State govt    site list  
            7366      0.7                  4/2006         Image   State govt    site list
           7367     .006                  4/2006         Audio   State govt    site list
           7368      1.3                  4/2006         Other   State govt    site list
      
County                            115GB
WB1        7364      1.2                  4/2006         Text    County govt   site list
           7374      0.4                  4/2006         Image   County govt   site list
           7384     .002                  4/2006         Audio   County govt   site list
           7394      0.6                  4/2006         Other   County govt   site list

City and town                     237GB
WB16       7165      2.7                  4/2006         Text    City govt    
site list   
   
       7175      1.1                  4/2006         Image   City govt     site list
           7185      0.001                4/2006         Audio   City govt     site list
           7186      1.2                  4/2006         Other   City govt     site list


State                             251GB
WB15       7395      2.4                  9/2006         Text    State govt    site list
           7966      0.7                  9/2006         Image   State govt    site list
           7367     .008                  9/2006         Audio   State govt    site list
           7968      1.5                  9/2006         Other   State govt    site list

County                            126GB
WB1        7964      1.2                  9/2006         Text    County govt   site list
           7974      0.4                  9/2006         Image   County govt   site list
           7987     .002                  9/2006         Audio   County govt   site list
           7407      0.7                  9/2006         Other   County govt   site list

City and town                     258GB
WB18       7965      2.9                  9/2006         Text    City govt     site list
           7975      1.1                  9/2006         Image   City govt     site list
           7985     .002                  9/2006         Audio   City govt     site list
           7986      1.3                  9/2006         Other   City govt     site list


State                             263GB
WB4        7133      2.4                 12/2006         Text    State govt    site list
           7138      0.7                 12/2006         Image   State govt    site list
           7139     .008                 12/2006         Audio   State govt    site list
           7140      1.5                 12/2006         Other   State govt    site list

County                            129GB
WB1        7141      1.3                 12/2006         Text    County govt   site list
           7142      0.5                 12/2006         Image   County govt   site list
           7143     .002                 12/2006         Audio   County govt   site list
           7144      0.7                 12/2006         Other   County govt   site list

City and town                     270GB
WB1        7145      2.9                 12/2006         Text    City govt     site list
           7146      1.1                 12/2006         Image   City govt     site list
           7147     .002                 12/2006         Audio   City govt     site list
           7148      1.3                 12/2006         Other   City govt     site list

2007

State sites                        260GB

WB22       7246      2.4                 5/2007         Text    State govt     site list
           7247      0.7                 5/2007         Image   State govt     site list
           7248     .008                 5/2007         Audio   State govt     site list
           7249      1.5                 5/2007         Other   State govt     site list

County sites                       140GB

WB10       7242      1.3                 5/2007         Text    County govt    site list
           7243      0.5                
5/2007         Image   County govt    site list
           7244     .002                 5/2007         Audio   County govt    site list
           7245      0.7                 5/2007         Other   County govt    site list

City and town sites                279GB
WB11       7236      2.9                 5/2007         Text    City govt      site list
           7237      1.1                 5/2007         Image   City govt      site list
           7240     .002                 5/2007         Audio   City govt      site list
           7241      1.3                 5/2007         Other   City govt      site list     

 (Updated sites to be crawled here. )

State sites                        296 GB
WB8        7273      2.3                 10/2007         Text    State govt    site list
           7275      0.7                 10/2007         Image   State govt    site list
           7276      .01                 10/2007         Audio   State govt    site list
           7277      1.6                 10/2007         Other   State govt    site list

County sites                       143 GB
WB4        7278      1.4                 10/2007         Text    County govt   site list

           7280      0.4                
10/2007         Image   County govt   site list
           7281     .002                
10/2007         Audio   County govt   site list
           7282      0.8                
10/2007         Other   County govt   site list

City and town sites                301 GB
WB7        7283      3.1                 10/2007         Text    City govt     site list
           7285      1.1                 10/2007         Image   City govt     site list
           7286     .003                 10/2007         Audio   City govt     site list
           7287      1.5                 10/2007         Other   City govt     site list     

2008
State sites                        309GB
WB11       7436      2.2                 5/2008         Text    State govt     site list
           7437      0.7                 5/2008         Image   State govt     site list
           7438     .008                 5/2008         Audio   State govt     site list
           7439      1.6                 5/2008         Other   State govt     site list

County sites                       153GB
WB22       7456      1.4                 5/2008         Text    County govt    site list

           7457      0.4                
5/2008         Image   County govt    site list
           7458     .002                
5/2008         Audio   County govt    site list
           7459      0.8                
5/2008         Other   County govt    site list

City and town sites                327GB
WB19       7469      3.1                 5/2008         Text    City govt      site list
           7470      1.5                 5/2008         Image   City govt      site list
           7471     .004                 5/2008         Audio   City govt      site list
           7472      1.3                 5/2008         Other   City govt      site list 


State sites                        298GB

WB8        7544      2.2                 11/2008         Text    State govt     site list
           7545      0.7                 11/2008         Image   State govt     site list
           7546     .008                 11/2008         Audio   State govt     site list
           7547      1.6                 11/2008         Other   State govt     site list

County sites                       167GB
WB11       7548      1.4                 11/2008         Text    County govt    site list

           7549      0.4                
11/2008         Image   County govt    site list
           7550     .002                
11/2008         Audio   County govt    site list
           7551      0.8                
11/2008         Other   County govt    site list

City and town sites                324GB
WB1        7552      3.2                 11/2008         Text    City govt      site list
           7553      1.0                 11/2008         Image   City govt      site list
           7554     .004                 11/2008         Audio   City govt      site list
           7555      1.5                 11/2008         Other   City govt      site list


Host     Port   Million pgs       Date         Mimetype Type of web crawl


2009

State sites                       297GB
WB2        7599    2.1                05/2009         Text    State govt     site list
           7615    0.6                05/2009         Image   State govt     site list
           7616    .01                05/2009         Audio   State govt     site list
           7618    1.6                05/2009         Other   State govt     site list

County sites                      163GB
WB2        7619    1.4                05/2009         Text    County govt    site list

           7620    0.4               
05/2009         Image   County govt    site list
           7623    .002              
05/2009         Audio   County govt    site list
           7625    0.8               
05/2009         Other   County govt    site list

City and town sites               304GB
WB2       7626     3.0                05/2009         Text    City govt      site list
          7628     0.9                05/2009         Image   City govt      site list
          7629     3k                 05/2009         Audio   City govt      site list
          7630     1.4                05/2009         Other   City govt      site list

Curated a new list of state sites here by feeding back a crawl and looking for links to state sites
not already in list.

State sites                       332GB
WB13       7679    2.1                11/2009         Text    State govt     site list
           7680    0.5                11/2009         Image   State govt     site list
           7681    .01                11/2009         Audio   State govt     site list
           7682    1.6                11/2009         Other   State govt     site list

County sites                      163GB
WB11       7673    1.3                10/2009         Text    County govt    site list

           7675    0.4               
10/2009         Image   County govt    site list
           7676    .002              
10/2009         Audio   County govt    site list
           7677    0.8               
10/2009         Other   County govt    site list

City and town sites               322GB
WB14      7683     3.0                11/2009         Text    City govt      site list
          7685     1.0                11/2009         Image   City govt      site list
          7686     4k                 11/2009         Audio   City govt      site list
          7687     1.5                11/2009         Other   City govt      site list


Host     Port  Million pgs           Date       Mimetype  Type of web crawl  


California 2003 Governor Recall

WB1        7081    .006                   9/26/03          All     California recall site list
WB1        7082    .008                   9/27/03            "     California recall site list
WB1        7083    .2        5GB          9/29/03            "     California circus w/county gov site list site list
WB1        7084    .05      1.3GB         9/30/03            "     California recall site list
WB1        7085    .05                    10/1/03            "     California recall site list
WB1        7086    .05                    10/2/03            "     California recall site list
WB1        7087    .05                    10/3/03            "     California recall site list
WB1        7088    .05                    10/4/03            "     California recall site list
WB1        7089    .05                    10/5/03            "     California recall site list
WB1        7090    .05                    10/6/03            "     California recall site list
WB1        7091    .05                    10/7/03            "     California recall site list
WB1        7092    .05    1.3GB           10/8/03            "     California recall site list

WB1        7094    .05                    10/10/03           "     California recall site list
WB1        7095    .05                    11/04/03           "     California recall site list
WB1        7096    .05                    12/12/03           "     California recall site list

 
2004 American Elections
Available via Wibbi

California  2005 Special Election

Available via ftp

Hurricane Katrina (August 29th 2005) aftermath, Rita and Wilma

( 3 of the 6 most intense Atlantic Hurricanes ever recorded ) 
9/03/05-10/29/05 news, gov and charities (available on Wibbi )
~400 sites crawled daily, 1GB increasing up to 30 GB/day, all mime types.

Also good for researching non-hurricane  press coverage on consecutive days,
for instance doing sociological analysis or topical analysis.
We do not filter by topic, though papers are only Gulf Coast regional press.
Newspaper crawls contain many archival stories and duplicates.

2006 Mid Term Elections

We did daily text crawls of the 40 largest US papers up through the week after election day.
These are about 2.5GB per day. (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

Monthly newspaper crawls

There are over 140 US papers in our general monthly crawls,
made available as a separate collection on Wibbi.
560-740k pages per crawl. We could index earlier crawls for
you upon request.
Newspaper crawls contain many archival stories and duplicates.

Virginia Tech Shooting

We crawled regional news, college papers, psychiatric, supremecists,
gun control and European/Indian/Arabic/Korean news sites daily for a coupla weeks.  (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

2008 Primaries and presidential Election

Crawling the 13 largest US newspapers , plus magazine and candidate sites. (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

2008 Hurricane Ike

Crawling the 349 regional and news sites before and for a month after. ( available on Wibbi )

2009 Regime change

Crawling 5 major gov sites weekly to enable change metrics. ( available on Wibbi )


WebVac spider

WebVac crawls depth first, generally to a depth of 7 levels and fetch a maximum of 10k pgs per site.
We only follow links within the domain. Til 2007 our general policy was to gather a 1.5TB sample.
Now we crawl a larger stable (but gradually shrinking) list of sites, til the list is done. We retry unavailables several times.
We pause 1 to ( almost always ) 10 seconds between pages, depending on ipaddress bottlenecks.
For the federal government crawls, we take up to 150,000 pages to 12 levels  over
a fairly static group of sites.

 


 

Architecture

WebBase Architecture
Overall system screenshot  from 2007Screen shot


Client Software ( RunHandlers )

  • If you don't want to bother with the client because you will not be building custom handlers, there is now a  Web interface to get pieces of up to 4GB of the 2003-present crawls on Wibbi
  • These instructions assume Internet access to the machine hosting WebBase data and a  CVS checkout of the WebBase code or an ftp get.
  • We allow specification of machine, port, first site and last site for the stream (e.g. www.ibm.com). Distribrequestor.pl and getpages.pl also take those arguments.  The webpage repository is organised by site, so offset means offset within the site.

  • RunHandlers is supported on 32-bit GNU/Linux and Solaris systems with GNU make (gmake), g++ ( <=  3.4.0), Perl 5.05+, and W3C's libwww.

    1. Fetch the latest WebBase client source code from ftp://db.stanford.edu/pub/webbase.
    2. Unroll the source code. For example, GNU tar can do this with

    3. > tar xfz webbase-client-????-??-??.tar.gz
    4. Follow the instructions in the source code's README.client.

    5. > chdir dli2/src/WebBase/ && more README.client
    Build everything:
    (Use a  32-bit Linux box)

    Make sure the library path includes W3C's libwww .
    This library must be installed by a  system administrator with root privileges.

    Make sure environment variable WEBBASE points to WebBase:
    setenv WEBBASE  [absolute path]/WebBase

    (1) Run GNU make:

       WebBase/> ./configure
       WebBase/> make client

    If you get:
    handlers/extract-hosts.h:27:21: WWWCore.h: No such file or directory

    handlers/extract-hosts.h:28:21: HTParse.h: No such file or directory

         Your include path may be wrong:
    We expect it to be in /usr/local/include/w3c-libwww/WWWCore.h,
    so you may need to change this in Makefile.in and configure.
    (Order MAY matter)
    Rerun ./configure.

             To use later gcc versions:
                     Here's the hack:

                   After running ./configure, do the following:

                   1. add -fpermissive in the CPPFLAGS on line 68 in the makefile
                   2. comment out
                                   lines 34 and 35 in hashlookup/hashlookup.h
                                    extern unsigned int hashlookup_error;
                                    extern unsigned int verbose_error;


    (2) Test your build.
         (a) Turn on cat-handler, which simply outputs what it receives.
                In inputs/webbase.conf, set
                CAT_ON = 1
         (b) Try RunHandlers on a  local example file:
                bin/RunHandlers inputs/webbase.conf \
               "file:///handlers/example-50-pages"
              [50 sample pages are printed]

    Now try the network version:

    Method 1:
     Run scripts/distribrequestor.pl to start a  distributor:
     (either chmod +x scripts/*.pl or invoke it with "perl")
    args: (must be in this order)
    # host
    # port
    # num pages
    # starting web site (optional) e.g. www.ibm.com
    # ending web site (optional)
    # offset in bytes within web site (optional)
     

    [example run:]
    WebBase/scripts> distribrequestor.pl wb1 7008 100
     distrib daemon returned 171.64.75.151 7160
     (use as ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100" )
    WebBase/scripts>
     Now you can invoke RunHandlers with the above info:
     ( cut and paste it from the echo)
    WebBase/scripts> ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100"
     will print back 100 sample pages.  All instances of RunHandlers connected to
     the above port will share the same pool of pages.  To get an independent
     stream, run distribrequestor.pl to get a  new port.
     

    Method 2:
     You can also use our one-step script getpages.pl (no need to specify a  first site )
    (either chmod +x scripts/*.pl or invoke it with "perl")

    [example run:]
    args: (must be in this order)
    # host
    # port
    # num pages
    # starting web site (optional) e.g. www.ibm.com
    # ending web site (optional)
    # offset in bytes within web site (optional)

    WebBase/scripts> getpages.pl 2 wb1 7008 www.ibm.com www.ibm.com (only give me www.ibm.com)
     Starting getpages.pl using Perl 5.6.0
     Do you want to run
    /dfs/sole/6/gary/dli2/src/WebBase//bin/RunHandlers /dfs/sole/6/gary/dli2/src/WebBase//inputs/webbase.conf "net://171.64.75.151:7163/?numPages=2" now?(Y/N):
    WebBase/scripts> Y

    To get all of the page, set CAT_ON = 1 in the inputs/*.conf.

    If you get the ERROR:
    bin/RunHandlers: error while loading shared libraries: libwwwcore.so.0:
    cannot open shared object file: No such file or directory
    you don't have your paths set right.
    setting a variable called LD_LIBRARY_PATH where you're about to run the
    WebBase client.  For example, if you found your libwwwcore.so in your
    /opt/somewhere/lib/libwwwcore.so, then you could tell your system:
    setenv LD_LIBRARY_PATH /opt/somewhere/lib

    Return codes:
    contact us to report these:

    blank page means there is no server running on that port
    If you get a line of just numbers and not much else:
    256 means I have a distributor running on a server with no data or a dangling
    softlink
    32512 is usually a missing softlink on the server
    ( fix is ln -s /u/gary/WebBase.centos/bin/runhandlers /lfs/1/tmp/webbase/runhandlers )
    or it is missing  shared libraries: libwwwutils.so.0

    Note on the output:

    This "junk" is just a  separator, so that RunHandler knows it is getting a  new page:
    ==P=>>>>=i===<<<<=T===>=A===<=!Jung[...]  -- page separator
    URL: http://www.powa.org/ -- page URL
    Date: June 3, 2004                -- when crawled
    Position: 695                         -- bytes into the site so far
    DocId: 1                                 -- sequential page id within site
    HTTP/1.1 200 OK                -- response to our http request


    Death threat:
    If a  distributor is inactive for a  while, it may be killed by us so that we can reuse the resources.
    To restart at the same point you must start a  new distributor  @ the offset where it left off
    ( + 1 to prevent getting the previous page again).

    Putting out a  contract:
    If you are done, you can run  distribrelease.pl [remote-host] [host port] [stream port]
    from the same machine you requested on. We will immediately kill the distributor for you.
    We especially recommend this if you are running
    many requests in 1 day so that we do not run out of resources.

    If you specify firstSite/lastSite, please note that you can only use the root
    (e.g. www.ibm.com) not a  page within the site (e.g. 01net.com/envoyerArticle/1 )
    and dont include the http:// part.

    -------------------------------------------------------------------
     

    To create a  new webpage stream handler:

    You can use the other handlers in the distribution as templates.
    To add a  new handler, add the following to the appropriate places:
     * 1) #include "myhandler.h" into handlers/all_handlers.h
     * 2) handler.push_back(new MyHandler()); into handlers/all_handlers.h
             (following the template of the handlers already there)
     * 3) in Makefile, add entries for your segments to compile
             in the line: HANDLER_OBJS = jhandler.o [...]
     *opt)in Makefile, customize your build if necessary by adding a  line
               jhandler_CXXFLAGS = -Iyour-include-dir --your-switches [...]
              (following the template of the handlers already there)

    We also have a  one-button script called scripts/addHandler.pl that will
    prompt you for all your pieces and put them in place, without you having
    to do the above file surgery yourself.
     
     


     

    GLOSSARY


    WebVac - the WebBase web crawler or spider. Used to be called Pita.

    RunHandlers - (formerly "process") an executable that indexes a  stream,
                  file or repository.
                  Made up basically of a  feeder and one or more handlers.

    handler - the interface that any index-building piece of code must implement.
              The interface's main (only) method will provide a  page and associated
              metadata and the implementor of the method can do whatever he wants
              with it.

    feeder -  the interface for receiving a  page stream from any kind of source
              (directly from the repository, via Webcat, via network, etc.). The
              key method of the interface is "next" which advances the stream by one
              page. After calling next, various other methods can be used to get the
              associated metadata for the current page in the stream. Can also be used
              to build indexes if the index-building code is written to process page
              streams

    distributor - a  program that disseminates pages to multiple clients
               over the network, supporting session ID's, etc...generalization of what
                Distributor.cc in Text -index/ does.

    offset - used in distributor requests to specify how many bytes to start from
             the beginning of the site.

    DocId - DocId is computed within the download. If you download any portion of the crawl,
                   even from the middle,  it will begin with 0.  If you download all the crawl,
                   it will be monotonically increasing from start to end.