Fetching Web Pages from the WebBase Web Page Repository

InfoLab was Database Group
                                                                                                                                       Gary Wesley (Jeez cs stanford Edu)
Updated  May 4 2012


Herein is described how to retrieve Web pages from the Stanford WebBase Archive,
a World Wide Web page repository built as part of the Stanford Digital Libraries Project
by members of the
Stanford InfoLab.


Countries using WebBase
WebBase visited by over 37 countries


The Repository

This web repository is over 260TB ( uncompressed size as of August 2011 ) of 7Billion web pages intended for research into topics such as web graph analysis and election or disaster press coverage ( we have  a workbench for press coverage analysis and coding ).
The general text crawls are each about  0.5TB compressed  ( 1.5TB uncompressed ). Sizes below are in compressed units.  We now effectively have rudimentary time series data. General crawls use almost the same site list each time. Building the client software.   Lists of sites with page counts is available via the "sites" links below.  Architecture diagram. Our web crawler or spider is named WebVac. Technical report: Stanford WebBase Components and Applications.  We are working in cooperation with the Library of Congress and the California Digital Library.

We now have tools for computational sociology in our Web Sociologist's Workbench. It was used for election coverage analysis by the Stanford Communication Department. Picture of  a sample screen. (The letter in the checkbox label is a keyboard shortcut.) Here is  a 2007 report on our efforts. A version of this is being used for a memetic epidemiology project with the Stanford Medical School 
involving Myspace blogs.



We have a collection of the links from each of the general crawls. These are available upon request via ftp.

We have a C++ tool to convert from our format to ARC version 1 format (Internet Archive and Heretrix). We are developing one for WARC (now  ISO) and International Internet Preservation Consortium (IIPC) standard. County, city, state and federal  crawls through  2008 have been converted to ARC.  We are considering converting the entire operation to WARC.



Wibbi:
If you don't want to bother with the client because you will not be building custom handlers, there is now  a Web interface to the crawls. There are several custom filters to choose from like and and or. Wibbi will give slower throughput than our C++ client, even with no filtering. A Windows/Linux browser limit (Except  Opera and Firefox 2.0.0.1+) causes you to only be able to download 4GB at a time. Since the filters are run on our server, it is possible to filter more data than that but not to reach that limit.

If you decide to use the data, please email Gary ( Jeez cs stanford Edu) for our funding requests.
We would also appreciate knowing of any papers that come out of your usage.


The Crawls

General Monthly Crawls
US Government
State and Local Governments
Newspapers
Universities
California 2003 Governor Recall
2004 National Elections
2005 California Special Election
Hurricane Katrina aftermath
2006 Mid Term Elections
Virginia Tech shooting
2008 Presidential Primaries and Election

General Crawls
2004   2005   2006   2007  2008
  2009  2010  2011
2012


 Host       Port     Million pgs             Date         Mimetype   Type of web crawl  

                                                
WB9        8003     119      343GB         1/2001         Text     general crawl    site list

WB6        7902      44      152GB         3/2002         Text     general crawl    site list(use 2002getpages.pl) 

WB1        7006      96      406GB         6/2003         Text     general crawl    site list

WB1        7008      96      423GB         8/2003         Text     general crawl    site list

WB1        7010     102      451GB        10/2003         Text     general crawl    site list

                             526GB
WB5        7012      36                   12/2003         Text     general crawl    site list
           7032      14                   12/2003         Image    general crawl    site list  

2004

WB1        7103      95      450GB         3/2004         Text     general crawl    site list
   
WB1        7114      6       447GB         4/2004         Image    general crawl  
site list

                                457GB
WB2        7107     11.5                   7/2004         Text     general crawl   site list
           7117      4.2                   7/2004         Image    general crawl   site list
           7127      0.02                  7/2004         Audio    general crawl   site list
           7137      2.3                   7/2004         Other    general crawl   site list

WB23       7108      72      363GB         8/2004         Text     general crawl   site list


                             474GB
WB3        7109      36                    9/2004         Text     general crawl   site list
           7119       7                    9/2004         Image    general crawl   site list

WB4        7190     105       495GB       10/2004         Text     general crawl   site list

                             1561GB
[by special arrangement]
           7192     37                    12/2004         Text      general crawl  site list

           7193     14                    12/2004         Image     general crawl  site list
           7194     0.08                  12/2004         Audio     general crawl  site list
           7195     7.7                   12/2004         Other     general crawl  site list

             


Host       Port     Million pgs             Date         Mimetype   Type of web crawl  


2005                               

                              980GB
WB15       7601     27                     1/2005         Text       general crawl  site list

           7611     6                      1/2005         Image      general crawl  site list
           7621     0.04                   1/2005         Audio      general crawl  site list
           7631     3.5                    1/2005         Other      general crawl  site list 

this deeper next crawl was done with pagemax of 20k per site instead of the usual 10k:
WB1        7603     85        440GB        3/2005         Text       general crawl  site list

WB2        7489     0.48      190GB      3-5/2005         Audio      general audio  site list

WB3        7604     98        480GB        4/2005         Text       general crawl  site list 

WB3        7605     79        460GB        5/2005         Text       general crawl  site list
 

WB22       7606    101        503GB        6/2005         Text       general crawl  site list

                              487GB
WB16       7658     9.5                    8/2005         Text       general crawl 
site list
           7668     3.4                                   Image                     site list
           7658     .02                                   Audio                     site list
           7678     2                                     Other                     site list       

WB15       7609     97        490GB        9/2005         Text       general crawl  site list

WB13       7610     97        508GB       10/2005         Text       general crawl  site list

WB16       7691     93        527GB       11/2005         Text       general crawl  site list

                                 945GB
WB15       7612     20.7                  12/2005         Text       general crawl 
site list
           7622      7                    12/2005         Image      general crawl  site list
           7632     0.04                  12/2005         Audio      general crawl  site list
           7642     4.5                   12/2005         Other      general crawl 
site list
                                                                                      

                                                                                     


Host      Port   Million pgs             Date         Mimetype   Type of web crawl  


2006

WB1        7701     98        515GB        1/2006         Text       general crawl  site list

WB18       7702     93        490GB        2/2006         Text       general crawl  site list

WB15       7703     95        497GB        3/2006         Text       general crawl  site list

WB17       7704     92        493GB        4/2006         Text       general crawl  site list

WB17       7705     93        499GB        5/2006         Text       general crawl  site list    

WB18       7706     60        426GB        6/2006         Text       general crawl  site list

WB6        7707     92        501GB        7/2006         Text       general crawl  site list

WB4        7708     92        514GB        8/2006         Text       general crawl  site list

WB15       7709     91        502GB        9/2006         Text       general crawl  site list

WB15       7710     90        497GB        10/2006        Text       general crawl  site list

WB3        7730     10        353GB        10-11/2006     Image      general crawl  site list

WB4        7711     90        506GB        11/2006        Text       general crawl  site list

WB1        7712     90        511GB        12/2006        Text       general crawl  site list

WB1        7713     0.5       222GB        12/2006        Audio      general crawl  site list

                                                                                      

    


Host     Port   Million pgs           Date        Mimetype   Type of web crawl      


2007

WB2        7118     87        502GB        1/2007         Text       general crawl  site list

WB19       7106    103        590GB        2/2007         Text       general crawl  site list

WB23       7161    102        578GB        3/2007         Text       general crawl  site list

WB5        7163    100        578GB        4/2007         Text       general crawl  site list

WB15       7239     98        573GB        5/2007         Text       general crawl  site list

WB2        7260     98        578GB        6/2007         Text       general crawl 
site list

WB15       7579     86        525GB        7/2007         Text       general crawl  site list

WB18       7262     87        514GB        8/2007         Text       general crawl  site list

WB19       7266     79        486GB        9/2007         Text       general crawl  site list

WB3        7272     80        492GB       10/2007         Text       general crawl  site list

WB6        7289     80        497GB       11/2007         Text       general crawl  site list

WB18       7291     79        494GB       12/2007         Text       general crawl  site list



Host   Port    Million pgs            Date      Mimetype   Type of web crawl      


2008

WB20        7320     81        498GB        1/2008         Text       general crawl  site list

WB21        7298     79        496GB        2/2008         Text       general crawl  site list

WB14        7299     80        507GB        3/2008         Text       general crawl  site list

WB1         7301     68        439GB        4/2008         Text       general crawl  site list

WB10        7476     66        500GB        5/2008         Text       general crawl  site list

WB14        7482     67        485GB        6/2008         Text       general crawl  site list

WB9         7483     75        498GB        7/2008         Text       general crawl  site list

WB2         7495     77        516GB        8/2008         Text       general crawl  site list

WB21        7518     76        522GB        9/2008         Text       general crawl  site list

WB3         7220     77        463GB       10/2008         Text       general crawl  site list

WB2         7571     77        522GB       11/2008         Text       general crawl  site list

WB9         7572     61        430GB       12/2008         Text       general crawl  site list



Host  Port Millionpgs       Date  Mimetype Type of web crawl      


2009

WB23     7565           75          525GB          1/2009       Text            general crawl    site list
WB23     7587           75          515GB          2/2009       Text            general crawl    site list
WB14     7254           72          506GB          3/2009       Text            general crawl    site list
WB2       7578           71          498GB          4/2009       Text            general crawl   
site list
WB6       7633           71          497GB          5/2009       Text            general crawl    site list
WB15     7650           69          505GB          6/2009       Text            general crawl    site list
WB15     7651           67          482GB          7/2009       Text            general crawl    site list
WB2       7659           64          466GB          8/2009       Text            general crawl    site list
WB9       7671           64          478GB          9/2009       Text            general crawl    site list
WB12     7688           64          468GB       10/2009       Text            general crawl    site list
WB9       7688           64          479GB       11/2009       Text            general crawl    site list
WB19     7717          200k       113GB       11/2009       Audio         general  crawl  site list
WB9       7716           64          480GB       12/2009       Text            general crawl    site list


Host  Port Millionpgs       Date  Mimetype Type of web crawl      


2010

WB19       7718           66          518GB       1/2010       Text            general crawl    site list
WB10       7722           66          500GB       2/2010       Text            general crawl    site list
WB23       7725           66          514GB       3/2010       Text            general crawl    site list
WB20       7726           66          519GB       4/2010       Text            general crawl    site list
WB22       7733           67          523GB       5/2010       Text            general crawl    site list
Added 650 new sites from Google's list of most popular
WB18       7753           62          502GB       6/2010       Text            general crawl    site list
WB21       7756           63          512GB       7/2010       Text            general crawl    site list
WB12       7758           65          518GB       8/2010       Text            general crawl    site list
WB19       7775           63          521GB       9/2010       Text            general crawl    site list
WB2         7776           50          424GB     10/2010       Text            general crawl    site list
WB1         7666           64          540GB     11/2010       Text            general crawl    site list
WB3         7791           56          480GB     12/2010       Text            general crawl    site list



Host  Port Millionpgs     Date  Mimetype    Type       


2011

WB3       7792           39          380GB     1/2011      Text            general crawl    site list
WB6       7793           65          544GB     2/2011      Text            general crawl    site list
WB18     7423           66          561GB     3/2011      Text            general crawl    site list
WB1       7424           65          547GB     4/2011      Text            general crawl    site list
WB23     7448           68          599GB     5/2011      Text            general crawl    site list
WB9       7460           68          569GB     6/2011      Text            general crawl    site list
WB6      7521           67          588GB     7/2011      Text            general crawl    site list
WB14     7570           60          557GB     8/2011      Text            general crawl    site list
WB19     7590           45          405GB     9/2011      Text            general crawl    site list
WB19     7594           63          560GB    10/2011     Text            general crawl    site list
WB11     7354           64          581GB    11/2011     Text            general crawl    site list
WB4       7947           48          444GB    12/2011     Text            general crawl    site list

Host  Port Millionpgs     Date  Mimetype    Type       


2012

WB11     7953           66          601GB      1/2012     Text            general crawl    site list
WB8       7969           64          597GB      2/2012     Text            general crawl    site list
WB4       7970           52          566GB      3/2012     Text            general crawl    site list
WB4       7988           64          608GB      4/2012     Text            general crawl   
site list

Specialized Crawls

University

Host       Port     Million pgs             Date      Mimetype   Type of web crawl    

WB1        7022      .28     1GB         11/2002        Text    U Cal@Berkeley site list

WB8        7050      .35    13GB          8/2003         All    Stanford University www.stanford.edu
[ we crawl 202 (now 1356) Stanford sites in our monthly text crawl ]

WB1        7300      .4      2GB         11/2004         Text    US CS        site list 

                                7.6GB
WB1        7440      .14                  1/2005         Text    U Cal@Berkeley site list
           7641      .07                  1/2005         Image   U Cal@Berkeley site list
           7492      .0001                1/2005         Audio   U Cal@Berkeley site list
           7443      .02                  1/2005         Other   U Cal@Berkeley site list

                              3GB
WB(fixing)        7060      .040   1.5GB         6/2005         Text    Stanford University site list
           7061      .038   125MB         6/2005         Image   Stanford University site list
           7062      60pgs                6/2005         Audio   Stanford University site list
           7063      .011   1.4GB         6/2005         Other   Stanford University site list

                                                         62GB
WB15               7688       0.9                                             10/2009                     Text         Stanford University         site list
                         7689      .03                                             10/2009                     Image      Stanford University         site list
                         7690      .001                                           10/2009                     Audio      Stanford University          site list
                         7692      .2                                              10/2009                      Other       Stanford University         site list


 [ we crawled 239 Stanford sites in our monthly text crawl (as of 10-2011 it is 1356)]


Government

US Government
    .mil is in the general crawl


Host     Port  Million pgs         Date         Mimetype   Type of web crawl         


                               214GB
WB2        7567      4.3                  7/2003         Text     US Government 
site list
                        268GB
WB2        7506      3.4                  6/2004         Text     US Government  site list
           7516      1.6                  6/2004         Image                  
site list
           7516      .003                 6/2004         Audio                   site list
           7536      1.2                  6/2004         Other                   site list

 

                            479GB
WB1        7509      2.8                  9/2004         Text     US Government  site list
           7519      1.5                  9/2004         Image                   site list
           7529      .006                 9/2004         Audio                   site list
           7539      1.1                  9/2004         Other                   site list


                


Host       Port     Million pgs    Date         Mimetype   Type of web crawl  


2005                         
 
               

                                  477GB
WB3        7644      2.5                  4/2005         Text  US Govt .gov + election site list
           7614      1.3                  4/2005         Image
           7624     .003                  4/2005         Audio
           7634      1.1                  4/2005         Other


Next 3: 20,000/site max on .gov only
                                    363GB
WB15       7607      4.0                  6-7/2005       Text   US .gov       site list
           7617      2.0                  6-7/2005       Image  US .gov       site list

           7627      .004                 6-7/2005       Audio  US .gov       site list
           7637      1.7                  6-7/2005       Other  US .gov       site list

(updated site list from LOC)
                                   337GB
WB23       7799      3.3                  9/2005         Text   US .gov       site list
           7719      1.1                  9/2005         Image  US .gov       site list
           7729                           9/2005         Audio  US .gov       site list
           7739      1.4                  9/2005         Other  US .gov       site list

                                   233GB
WB15       8012      2.2                 12/2005         Text   US .gov       site list
           8022      1.1                 12/2005         Image  US .gov       site list
           8032      0.004               12/2005         Audio  US .gov       site list
           8042      1.0                 12/2005         Other  US .gov       site list

 From here on we crawl up to 150,000 pages per  .gov site to  a depth of 12 quarterly.
For those below, we have removed the site list from ca.gov, which are state site list for California.
ca.gov are about 100GB for each crawl and can be made available upon request. These are also in the
state crawls.

2006                              568GB
WB2        8001      5.6                  3/2006          Text   US .gov         site list
           8011      2.8                  3/2006          Image  US .gov         site list
            8021      0.007                3/2006          Audio  US .gov         site list
            8031      2.2                  3/2006          Other  US .gov         site list
                                                                       
                                    658GB
WB21       8041      6.1                 6-7/2006         Text   US .gov         site list
           8051      3.3                 6-7/2006         Image  US .gov         site list
           8052      0.01                6-7/2006         Audio  US .gov         site list
            8053      2.3                 6-7/2006         Other  US .gov         site list       

                                    726GB
WB1        7100      6.6                 9-10/2006        Text   US .gov         site list
           7101      3.3                                  Image                  site list
           7102      0.01                                 Audio                  site list
           7104      2.9                                  Other                  site list
                                                                    
                                    609GB
WB11       7149      7.0                  12/2006         Text   US .gov         site list
           7150      3.0                                  Image                  site list
           7151      0.01                                 Audio                  site list
           7152      3.0                                  Other                  site list
                                                                      
2007                              681GB
WB2        7157      8.1                  3/2007          Text   US .gov         site list
           7158      3.4                                  Image                  site list
           7159      0.01                                 Audio                  site list
           7160      3.1                                  Other                  site list

(Updated our list of sites here )

                                     613GB
WB15       7255      7.0                  6/2007          Text   US .gov         site list
           7256      3.0                                  Image                  site list
           7257      0.01                                 Audio                  site list
           7258      2.8                                  Other                  site list
 

( California ca.gov is not crawled from here on except as part of the state crawls )                                                                        636GB
WB[fixing]       7267      5.5                  9/2007          Text   US .gov         site list
           7268      2.7                                  Image                  site list
           7269      0.01                                 Audio                  site list
           7270      2.4                                  Other                  site list
  

 
                                    629 GB
WB15       7292      5.4                 12/2007          Text   US .gov         site list
           7293      2.5                                  Image                  site list
           7295      0.01                                 Audio                  site list
           7296      2.3                                  Other                  site list
         
2008

                                    654 GB
WB9        7369      7.4                  3/2008          Text   US .gov         site list
           7370      3.4                                  Image                  site list
           7371      0.01                                 Audio                  site list
           7372      3.0                                  Other                  site list

                                    650 GB
WB8        7484      5.5                  6/2008          Text   US .gov         site list
           7485      2.5                                  Image                  site list
           7488      0.01                                 Audio                  site list
           7490      2.7                                  Other                  site list

                                    755 GB
WB2        7510      8.0                  9/2008          Text   US .gov         site list
           7513      3,2                                  Image                  site list
           7514      0.01                                 Audio                  site list
           7515      3.3                                  Other                  site list

                                    762 GB
WB2        7574      7.3                 12/2008          Text   US .gov         site list
           7575      3.2                                  Image                  site list
           7576      0.01                                 Audio                  site list
           7577      3.2                                  Other                  site list

2009

                                    880 GB
WB19       7645      8.0                 03/2009          Text   US .gov         site list
           7512      3.6                                  Image                  site list
           7522      0.01                                 Audio                  site list
           7532      3.4                                  Other                  site list

                                    826 GB
WB18       7646      7.1                 06/2009          Text   US .gov         site list
           7647      3.3                                  Image                  site list
           7648      0.01                                 Audio                  site list
           7649      3.0                                  Other                  site list

Updated site list from: http://www.lib.lsu.edu/gov/index.html, 230 new sites
Also got 2682 new sites by extracting links from a crawl.
NASA to a depth of 6 instead of 12
USGS and NOAA limited to 8 levels

                              1600 GB
WB1        7663      9.9                 09/2009          Text   US .gov         site list
           7665      4.7                                  Image                  site list
           7667      0.02                                 Audio                  site list
           7670      4.7                                  Other                  site list

Even tighter page limits on NASA,USGS,NOAA below to not get so many images

                              599 GB
WB15       7699      5.9              12/2009          Text   US .gov         site list
           7700      2.0                               Image                  site list
           7714      0.01                              Audio                  site list
           7715      4                                 Other                  site list
2010

                              646 GB
WB3        7735      6                03/2010          Text   US .gov         site list
           7727      2.0                               Image                  site list
           7728      0.01                              Audio                  site list
           7731      2.6                               Other                  site list

                              699 GB
WB13       7734      6.3              06/2010          Text   US .gov         site list
           7736      2.1                               Image                  site list
           7737      0.01                              Audio                  site list
           7738      2.8                               Other                  site list

                                                            661GB

WB22       7759      6.4              09/2010          Text   US .gov         site list
           7760      2.0                               Image                  site list
           7761      0.01                              Audio                  site list
           7762      2.9                               Other                  site list

                                                          646GB

WB4        7779      5.7              12/2010          Text   US .gov         site list
           7780      1.9                               Image                  site list
           7781      0.02                              Audio                  site list
           7782      2.6                               Other                  site list

2011
Added 140 new sites by extracting links from December crawl.

                                                         585GB

WB18       7406      5.4             3/2011          Text   US .gov         site list
          7411      1.8                             Image                  site list
           7414      0.02                            Audio                  site list

            7417      2.5                             Other                  site list

Added senators
                                                         614GB

WB6        7406      5.5             6/2011          Text   US .gov         site list
          7411      1.8                             Image                  site list
           7414      0.02                            Audio                  site list

            7417      2.5                             Other                  site list

                                                         570GB

WB11       7573      4.7             9/2011          Text   US .gov         site list
          7580      1.4                             Image                  site list
           7582      0.02                            Audio                  site list

            7583      2.3                             Other                  site list


                          570GB

WB8        7948      4.3            12/2011          Text   US .gov         site list
          7950      2.0                             Image                  site list
           7951      0.02                            Audio                  site list

            7952      2.3                             Other                  site list

2012



 
State, County and Local Governments


Host     Port     Million pgs       Date       Mimetype Type of web crawl  


These sitelists were compiled from the site http://www.statelocalgov.net

State                          211GB
WB19       7204      2.3              5/2005         Text    State govt   site list
           7214      0.7              5/2005         Image   State govt   site list
           7224     .005              5/2005         Audio   State govt   site list
           7234      1.4              5/2005         Other   State govt   site list

County                          90GB
WB15       7264      1.2                  5/2005         Text    County govt  site list
           7274      0.5                  5/2005         Image   County govt  site list
           7284     .060                  5/2005         Audio   County govt  site list   
           7294      0.5                  5/2005         Other   County govt  site list

City and town                 188GB
WB8        7664      2.5                  5/2005         Text    City govt    site list
           7674      1.2                  5/2005         Image   City govt    site list
           7684     .001                  5/2005         Audio   City govt    site list
           7694      1.0                  5/2005         Other   City govt    site list  

 
 
Post Katrina crawl
State                           217GB

WB2        7465      2.1                  9/2005         Text    State govt   site list
           7466      0.7                  9/2005         Image   State govt   site list
           7467     .060                  9/2005         Audio   State govt   site list
           7468      1.3                  9/2005         Other   State govt   site list    


2006

State                           280GB
WB17       7365      2.0                  4/2006         Text    State govt    site list  
                      7366      0.7                  4/2006         Image   State govt    site list
           7367     .006                  4/2006         Audio   State govt    site list
           7368      1.3                  4/2006         Other   State govt    site list
      
County                          115GB
WB2        7364      1.2                  4/2006         Text    County govt   site list
           7374      0.4                  4/2006         Image   County govt   site list
           7384     .002                  4/2006         Audio   County govt   site list
           7394      0.6                  4/2006         Other   County govt   site list

City and town                   238GB
WB10       7165      2.7                  4/2006         Text    City govt    
site list   
   
       7175      1.1                  4/2006         Image   City govt     site list
           7185      0.001                4/2006         Audio   City govt     site list
           7186      1.2                  4/2006         Other   City govt     site list


State                           251GB
WB15       7395      2.4                  9/2006         Text    State govt    site list
           7966      0.7                  9/2006         Image   State govt    site list
           7367     .008                  9/2006         Audio   State govt    site list
           7968      1.5                  9/2006         Other   State govt    site list

County                          126GB
WB2        7964      1.2                  9/2006         Text    County govt   site list
           7974      0.4                  9/2006         Image   County govt   site list
           7987     .002                  9/2006         Audio   County govt   site list
           7407      0.7                  9/2006         Other   County govt   site list

City and town                   258GB
WB16       7965      2.9              9/2006         Text    City govt     site list
           7975      1.1              9/2006         Image   City govt     site list
           7985     .002              9/2006         Audio   City govt     site list
           7986      1.3              9/2006         Other   City govt     site list


State                             263GB
WB8        7133      2.4                 12/2006         Text    State govt    site list
           7138      0.7                 12/2006         Image   State govt    site list
           7139     .008                 12/2006         Audio   State govt    site list
           7140      1.5                 12/2006         Other   State govt    site list

County                            129GB
WB1        7141      1.3                 12/2006         Text    County govt   site list
           7142      0.5                 12/2006         Image   County govt   site list
           7143     .002                 12/2006         Audio   County govt   site list
           7144      0.7                 12/2006         Other   County govt   site list

City and town                     270GB
WB5        7145      2.9                 12/2006         Text    City govt     site list
           7146      1.1                 12/2006         Image   City govt     site list
           7147     .002                 12/2006         Audio   City govt     site list
           7148      1.3                 12/2006         Other   City govt     site list

2007

State sites                       260GB

WB3        7246      2.4                 5/2007         Text    State govt     site list
           7247      0.7                 5/2007         Image   State govt     site list
           7248     .008                 5/2007         Audio   State govt     site list
           7249      1.5                 5/2007         Other   State govt     site list

County sites                      140GB

WB18       7242      1.3                 5/2007         Text    County govt    site list
           7243      0.5                
5/2007         Image   County govt    site list
           7244     .002                 5/2007         Audio   County govt    site list
           7245      0.7                 5/2007         Other   County govt    site list

City and town sites                279GB
WB6        7236      2.9                 5/2007         Text    City govt      site list
           7237      1.1                 5/2007         Image   City govt      site list
           7240     .002                 5/2007         Audio   City govt      site list
           7241      1.3                 5/2007         Other   City govt      site list     

 (Updated sites to be crawled here. )

State sites                   296 GB
WB11       7273      2.3                 10/2007         Text    State govt    site list
           7275      0.7                 10/2007         Image   State govt    site list
           7276      .01                 10/2007         Audio   State govt    site list
           7277      1.6                 10/2007         Other   State govt    site list

County sites                  143 GB
WB18       7278      1.4                 10/2007         Text    County govt   site list

           7280      0.4                
10/2007         Image   County govt   site list
           7281     .002                
10/2007         Audio   County govt   site list
           7282      0.8                
10/2007         Other   County govt   site list

City and town sites           301 GB
WB6        7283      3.1                 10/2007         Text    City govt     site list
           7285      1.1                 10/2007         Image   City govt     site list
           7286     .003                 10/2007         Audio   City govt     site list
           7287      1.5                 10/2007         Other   City govt     site list     

2008
State sites                        309GB
WB19       7436      2.2                 5/2008         Text    State govt     site list
           7437      0.7                 5/2008         Image   State govt     site list
           7438     .008                 5/2008         Audio   State govt     site list
           7439      1.6                 5/2008         Other   State govt     site list

County sites                       153GB
WB19       7456      1.4                 5/2008         Text    County govt    site list

           7457      0.4                
5/2008         Image   County govt    site list
           7458     .002                
5/2008         Audio   County govt    site list
           7459      0.8                
5/2008         Other   County govt    site list

City and town sites                327GB
WB6        7469      3.1                 5/2008         Text    City govt      site list
           7470      1.5                 5/2008         Image   City govt      site list
           7471     .004                 5/2008         Audio   City govt      site list
           7472      1.3                 5/2008         Other   City govt      site list
 


State sites                       298GB

WB20       7544      2.2                 11/2008         Text    State govt     site list
           7545      0.7                 11/2008         Image   State govt     site list
           7546     .008                 11/2008         Audio   State govt     site list
           7547      1.6                 11/2008         Other   State govt     site list

County sites                      167GB
WB1        7548      1.4                 11/2008         Text    County govt    site list

           7549      0.4                
11/2008         Image   County govt    site list
           7550     .002                
11/2008         Audio   County govt    site list
           7551      0.8                
11/2008         Other   County govt    site list

City and town sites               324GB
WB1        7552      3.2                 11/2008         Text    City govt      site list
           7553      1.0                 11/2008         Image   City govt      site list
           7554     .004                 11/2008         Audio   City govt      site list
           7555      1.5                 11/2008         Other   City govt      site list


Host     Port   Million pgs     Date       Mimetype Type of web crawl


2009

State sites                    297GB
WB2        7599    2.1                05/2009         Text    State govt     site list
           7615    0.6                05/2009         Image   State govt     site list
           7616    .01                05/2009         Audio   State govt     site list
           7618    1.6                05/2009         Other   State govt     site list

County sites                   163GB
WB2        7619    1.4                05/2009         Text    County govt    site list

           7620    0.4               
05/2009         Image   County govt    site list
           7623    .002              
05/2009         Audio   County govt    site list
           7625    0.8               
05/2009         Other   County govt    site list

City and town sites            304GB
WB2       7626     3.0                05/2009         Text    City govt      site list
          7628     0.9                05/2009         Image   City govt      site list
          7629     3k                 05/2009         Audio   City govt      site list
          7630     1.4                05/2009         Other   City govt      site list


Curated a new list of state sites here by feeding back a crawl and looking for links to new state sites.

State sites                 332GB
WB15       7679    2.1                11/2009         Text    State govt     site list
           7680    0.5                11/2009         Image   State govt     site list
           7681    .01                11/2009         Audio   State govt     site list
           7682    1.6                11/2009         Other   State govt     site list

County sites                163GB
WB19       7673    1.3                10/2009         Text    County govt    site list

           7675    0.4               
10/2009         Image   County govt    site list
           7676    .002              
10/2009         Audio   County govt    site list
           7677    0.8               
10/2009         Other   County govt    site list

City and town sites         322GB
WB20     7683     3.0                11/2009         Text    City govt      site list
         7685     1.0
               11/2009         Image   City govt      site list
         7686     4k                 11/2009         Audio   City govt      site list
         7687     1.5                11/2009         Other   City govt      site list


State sites                 358GB
WB2        7740    2.0                5/2010         Text    State govt     site list
           7741    0.5                               Image                  site list
           7742    .01                               Audio                  site list
           7743    1.7                               Other                  site list

County sites               201GB
WB3        7744    1.3                5/2010         Text    County govt    site list

           7745    0.4                      
        Image                  site list
           7746    .002                     
        Audio                  site list
           7747    0.8                      
        Other                  site list

City and town sites       344GB
WB15       7748    3.3                5/2010         Text    City govt      site list
           7749   
0.9                               Image                  site list
           7750    4k                                Audio                  site list
           7751    1.5
                              Other                  site list


State sites                325GB
WB20        7763    1.6               11/2010         Text    State govt     site list
           7764    0.4                                Image                  site list
           7765    11k                                Audio                  site list
           7766    1.4                                Other                  site list

County sites              168GB
WB12       7767    1.3               11/2010         Text    County govt    site list

           7768    0.3                      
        Image                  site list
           7769    3k                       
        Audio                  site list
           7770    0.8                      
        Other                  site list


Updated Site list for cities from http://www.statelocalgov.net/

City and town sites         376GB
WB18       7771    3.8              11/2010         Text   City govt      site list
           7772    0.9                              Image                 site list
           7773    3.6k                             Audio                 site list

           7774    1.5                                 Other                   site list

Updated Site lists here for county and state from http://www.statelocalgov.net/

State sites                  363GB
WB5        7428    1.8               5/2011         Text    State govt     site list
           7429    0.5                              Image                  site list
           7430    12k                              Audio                  site list
           7431    1.7                               Other                 site list

County sites                 195B
WB3       7434    1.3               5/2011         Text    County govt    site list
          7435    0.3                      
       Image                  site list
          7441    3k                       
       Audio                  site list
          7442    0.8                     
        Other                  site list


City and town sites     400GB
WB5       7444    4.1              5/2011         Text   City govt      site list
          7445    0.9                              Image                 site list
          7446    5k                              Audio                 site list
          7447   
1.7                             Other                 site list


State                        348GB
WB15      
7319    1.7              11/2011         Text    State govt     site list
          
7362    0.5                              Image                 site list
          
7817    13k                              Audio                 site list
          
7832    1.6                              Other                 site list

County                       195B
WB19     
7913    1.3              11/2011         Text    County govt    site list
         
7915    0.3                              Image                  site list
         
7916    4k                               Audio                  site list
         
7941    0.9                              Other                  site list


City and town           410GB
WB21     
7942    3.8               11/2011       Text   City govt      site list
         
7943    0.9                              Image                 site list
         
7945    5k                              Audio                 site list
         
7946    1.7                             Other                 site list




Host  Port  Million pgs  Date  Mimetype  Type of web crawl  


California 2003 Governor Recall

WB1        7081    .006                   9/26/03          All     California recall site list
WB1        7082    .008                   9/27/03            "     California recall site list
WB1        7083    .2        5GB          9/29/03            "     California circus w/county gov site list site list
WB1        7084    .05      1.3GB         9/30/03            "     California recall site list
WB1        7085    .05                    10/1/03            "     California recall site list
WB1        7086    .05                    10/2/03            "     California recall site list
WB1        7087    .05                    10/3/03            "     California recall site list
WB1        7088    .05                    10/4/03            "     California recall site list
WB1        7089    .05                    10/5/03            "     California recall site list
WB1        7090    .05                    10/6/03            "     California recall site list
WB1        7091    .05                    10/7/03            "     California recall site list
WB1        7092    .05    1.3GB     10/8/03            "     California recall site list

WB1        7094    .05                    10/10/03           "     California recall site list
WB1        7095    .05                    11/04/03           "     California recall site list
WB1        7096    .05                    12/12/03           "     California recall site list

 
2004 American Elections
Available via Wibbi

California  2005 Special Election

Available via ftp

Hurricane Katrina (August 29th 2005) aftermath, Rita and Wilma

( 3 of the 6 most intense Atlantic Hurricanes ever recorded ) 
9/02/05-10/29/05 news, gov and charities (available on Wibbi )
~400 sites crawled daily, 1GB increasing up to 30 GB/day, all mime types.

Also good for researching non-hurricane  press coverage on consecutive days,
for instance doing sociological analysis or topical analysis.
We do not filter by topic, though papers are only Gulf Coast regional press.
Newspaper crawls contain many archival stories and duplicates.

2006 US Mid Term Elections

We did daily text crawls of the 40 largest US papers up through the week after election day.
These are about 2.5GB per day. (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

Monthly newspaper crawls

There are 140-160 US papers in our general monthly crawls,
made available as a separate collection on Wibbi.
560-800k pages per crawl. We could index earlier crawls for
you upon request.
Newspaper crawls contain many archival stories and duplicates.

Virginia Tech Shooting

We crawled regional news, college papers, psychiatric, supremecists,
gun control and European/Indian/Arabic/Korean news sites daily for a coupla weeks.  (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

2008 Primaries and presidential Election

Crawling the 13 largest US newspapers , plus magazine and candidate sites. (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

2008 Hurricane Ike

Crawling the 349 regional and news sites before and for a month after. ( available on Wibbi )

2009 Regime change

Crawling 5 major gov sites weekly to enable change metrics. ( available on Wibbi )

2012 Primaries and Elections

Crawling major papers, candidate sites and "candidate" keyword in 3 separate crawls
til 2012 elections. ( available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.


WebVac spider

WebVac crawls depth first, generally to a depth of 7 levels and fetches a maximum of 10k pgs per site.
We only follow links to pages within the domain. Til 2007 our general policy was to gather a 1.5TB sample.
Now we crawl a list of sites, til the list is done or the month is over. We retry unavailables several times.
We pause 1 to ( almost always ) 12 seconds between pages, depending on ipaddress bottlenecks.
For the federal government crawls, we take up to 150,000 pages to 12 levels over
a fairly static group of sites.

 


 

Architecture

WebBase Architecture
Overall system screenshot  from 2007Screen shot


Client Software ( RunHandlers )

  • If you don't want to bother with the client because you will not be building custom handlers, there is now a  Web interface to get pieces of up to 4GB of the 2003-present crawls on Wibbi
  • These instructions assume Internet access to the machine hosting WebBase data and a  CVS checkout of the WebBase code or an ftp get.
  • We allow specification of machine, port, first site and last site for the stream (e.g. www.ibm.com). Distribrequestor.pl and getpages.pl also take those arguments.  The webpage repository is organised by site, so offset means offset within the site.

  • RunHandlers is supported on 32-bit GNU/Linux and Solaris systems with GNU make (gmake), g++ ( <=  3.4.0), Perl 5.05+, and W3C's libwww.

    1. Fetch the latest WebBase client source code from ftp://db.stanford.edu/pub/webbase.
    2. Unroll the source code. For example, GNU tar can do this with

    3. > tar xfz webbase-client-????-??-??.tar.gz
    4. Follow the instructions in the source code's README.client.

    5. > chdir dli2/src/WebBase/ && more README.client
    Build everything:
    (Use a  32-bit Linux box)

    Make sure the library path includes W3C's libwww .
    This library must be installed by a  system administrator with root privileges.

    Make sure environment variable WEBBASE points to WebBase:
    setenv WEBBASE  [absolute path]/WebBase

    (1) Run GNU make:

       WebBase/> ./configure
       WebBase/> make client

    If you get:
    handlers/extract-hosts.h:27:21: WWWCore.h: No such file or directory

    handlers/extract-hosts.h:28:21: HTParse.h: No such file or directory

         Your include path may be wrong:
    We expect it to be in /usr/local/include/w3c-libwww/WWWCore.h,
    so you may need to change this in Makefile.in and configure.
    (Order MAY matter)
    Rerun ./configure.

             To use later gcc versions:
                     Here's the hack:

                   After running ./configure, do the following:

                   1. add -fpermissive in the CPPFLAGS on line 68 in the makefile
                   2. comment out
                                   lines 34 and 35 in hashlookup/hashlookup.h
                                    extern unsigned int hashlookup_error;
                                    extern unsigned int verbose_error;


    (2) Test your build.
         (a) Turn on cat-handler, which simply outputs what it receives.
                In inputs/webbase.conf, set
                CAT_ON = 1
         (b) Try RunHandlers on a  local example file:
                bin/RunHandlers inputs/webbase.conf \
               "file:///handlers/example-50-pages"
              [50 sample pages are printed]

    Now try the network version:

    Method 1:
     Run scripts/distribrequestor.pl to start a  distributor:
     (either chmod +x scripts/*.pl or invoke it with "perl")
    args: (must be in this order)
    # host
    # port
    # num pages
    # starting web site (optional) e.g. www.ibm.com
    # ending web site (optional)
    # offset in bytes within web site (optional)
     

    [example run:]
    WebBase/scripts> distribrequestor.pl wb1 7008 100
     distrib daemon returned 171.64.75.151 7160
     (use as ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100" )
    WebBase/scripts>
     Now you can invoke RunHandlers with the above info:
     ( cut and paste it from the echo)
    WebBase/scripts> ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100"
     will print back 100 sample pages.  All instances of RunHandlers connected to
     the above port will share the same pool of pages.  To get an independent
     stream, run distribrequestor.pl to get a  new port.
     

    Method 2:
     You can also use our one-step script getpages.pl (no need to specify a  first site )
    (either chmod +x scripts/*.pl or invoke it with "perl")

    [example run:]
    args: (must be in this order)
    # num pages
    # host
    # port
    # starting web site (optional) e.g. www.ibm.com
    # ending web site (optional)
    # offset in bytes within web site (optional)


    WebBase/scripts> getpages.pl 2 wb1 7008 www.ibm.com www.ibm.com (only give me www.ibm.com)
     Starting getpages.pl using Perl 5.6.0
     Do you want to run
    /dfs/sole/6/gary/dli2/src/WebBase//bin/RunHandlers /dfs/sole/6/gary/dli2/src/WebBase//inputs/webbase.conf "net://171.64.75.151:7163/?numPages=2" now?(Y/N):
    WebBase/scripts> Y

    To get all of the page, set CAT_ON = 1 in the inputs/*.conf.

    If you get the ERROR:
    bin/RunHandlers: error while loading shared libraries: libwwwcore.so.0:
    cannot open shared object file: No such file or directory
    you don't have your paths set right.
    setting a variable called LD_LIBRARY_PATH where you're about to run the
    WebBase client.  For example, if you found your libwwwcore.so in your
    /opt/somewhere/lib/libwwwcore.so, then you could tell your system:
    setenv LD_LIBRARY_PATH /opt/somewhere/lib

    Return codes:
    contact us to report these:

    blank page means there is no server running on that port
    If you get a line of just numbers and not much else:
    256 means I have a distributor running on a server with no data or a dangling
    softlink
    32512 is usually a missing softlink on the server
    ( fix is ln -s /u/gary/WebBase.centos/bin/runhandlers /lfs/1/tmp/webbase/runhandlers )
    or it is missing  shared libraries: libwwwutils.so.0

    Note on the output:

    This next line is just a  separator, so that RunHandler knows it is getting a  new page:
    ==P=>>>>=i===<<<<=T===>=A===<=!Jung[...]  -- page separator
    URL: http://www.powa.org/ -- page URL
    Date: June 3, 2004                -- when crawled
    Position: 695                         -- bytes into the site so far
    DocId: 1                                 -- sequential page id within site
    HTTP/1.1 200 OK                -- response to our http request


    Death threat:
    If a  distributor is inactive for a  while, it may be killed by us so that we can reuse the resources.
    To restart at the same point you must start a  new distributor  @ the offset where it left off
    ( + 1 to prevent getting the previous page again).

    Putting out a  contract:
    If you are done, you can run  distribrelease.pl [remote-host] [host port] [stream port]
    from the same machine you requested on. We will immediately kill the distributor for you.
    We especially recommend this if you are running
    many requests in 1 day so that we do not run out of resources.

    If you specify firstSite/lastSite, please note that you can only use the root
    (e.g. www.ibm.com) not a  page within the site (e.g. 01net.com/envoyerArticle/1 )
    and dont include the http:// part.

    -------------------------------------------------------------------
     

    To create a  new webpage stream handler:

    You can use the other handlers in the distribution as templates.
    To add a  new handler, add the following to the appropriate places:
     * 1) #include "myhandler.h" into handlers/all_handlers.h
     * 2) handler.push_back(new MyHandler()); into handlers/all_handlers.h
             (following the template of the handlers already there)
     * 3) in Makefile, add entries for your segments to compile
             in the line: HANDLER_OBJS = jhandler.o [...]
     *opt)in Makefile, customize your build if necessary by adding a  line
               jhandler_CXXFLAGS = -Iyour-include-dir --your-switches [...]
              (following the template of the handlers already there)

    We also have a  one-button script called scripts/addHandler.pl that will
    prompt you for all your pieces and put them in place, without you having
    to do the above file surgery yourself.
     
     


     

    GLOSSARY


    WebVac - the WebBase web crawler or spider. Used to be called Pita.

    RunHandlers - (formerly "process") an executable that indexes a  stream,
                  file or repository.
                  Made up basically of a  feeder and one or more handlers.

    handler - the interface that any index-building piece of code must implement.
              The interface's main (only) method will provide a  page and associated
              metadata and the implementor of the method can do whatever he wants
              with it.

    feeder -  the interface for receiving a  page stream from any kind of source
              (directly from the repository, via Webcat, via network, etc.). The
              key method of the interface is "next" which advances the stream by one
              page. After calling next, various other methods can be used to get the
              associated metadata for the current page in the stream. Can also be used
              to build indexes if the index-building code is written to process page
              streams

    distributor - a  program that disseminates pages to multiple clients
               over the network, supporting session ID's, etc...generalization of what
                Distributor.cc in Text -index/ does.

    offset - used in distributor requests to specify how many bytes to start from
             the beginning of the site.

    DocId - DocId is computed within the download. If you download any portion of the crawl,
                   even from the middle,  it will begin with 0.  If you download all the crawl,
                   it will be monotonically increasing from start to end.

    flutes