If you decide to use the data, please email Gary (
Jeez cs stanford Edu) for our funding
requests.
We would also appreciate knowing of any papers
that
come
out of your usage.
General
Crawls
2004 2005 2006 2007 2008 2009
2010
2011
2012
WB9
8003 119
343GB
1/2001 Text
general crawl site
list
WB6
7902 44
152GB
3/2002
Text
general crawl site
list(use 2002getpages.pl)
WB1 7006 96 406GB 6/2003 Text general crawl site list
WB1
7008
96
423GB
8/2003
Text
general crawl site
list
WB1 7010 102 451GB 10/2003 Text general crawl site list
526GB
WB5
7012
36
12/2003
Text
general crawl site
list
7032
14
12/2003 Image
general crawl site
list
WB1
7103
95 450GB
3/2004
Text
general crawl site
list
WB1
7114 6
447GB
4/2004 Image
general
crawl site list
457GB
WB2
7107
11.5
7/2004
Text
general crawl site
list
7117
4.2
7/2004
Image
general crawl site
list
7127
0.02
7/2004
Audio
general crawl site list
7137
2.3
7/2004
Other
general
crawl site
list
WB23 7108 72 363GB 8/2004 Text general crawl site list
474GB
WB3
7109
36
9/2004
Text general
crawl site
list
7119
7
9/2004 Image
general crawl site
list
WB4 7190 105 495GB 10/2004 Text general crawl site list
1561GB
[by special arrangement]
7192
37
12/2004
Text general
crawl site list
7193
14
12/2004 Image
general
crawl site list
7194
0.08
12/2004 Audio
general
crawl site list
7195
7.7
12/2004 Other
general
crawl site list
Host
Port
Million
pgs
Date
Mimetype Type
of web
crawl
980GB
WB15
7601
27
1/2005 Text
general crawl site
list
7611
6
1/2005 Image
general crawl site
list
7621
0.04
1/2005 Audio
general crawl site
list
7631
3.5
1/2005 Other
general crawl site
list
this deeper next
crawl was done with pagemax of
20k
per site instead of the usual 10k:
WB1
7603
85
440GB
3/2005 Text
general
crawl site list
WB2
7489
0.48 190GB
3-5/2005 Audio
general audio site
list
WB3
7604
98
480GB
4/2005 Text
general crawl site
list
WB3
7605
79
460GB
5/2005 Text
general crawl site
list
WB22 7606 101 503GB 6/2005 Text general crawl site list
487GB
WB16
7658 9.5
8/2005 Text
general crawl site list
7668 3.4
Image
site
list
7658 .02
Audio
site list
7678 2
Other
site list
WB15
7609 97
490GB
9/2005 Text
general crawl site
list
WB13
7610 97
508GB
10/2005 Text
general crawl site
list
WB16
7691 93
527GB
11/2005 Text
general crawl site list
945GB
WB15
7612
20.7
12/2005 Text
general crawl site list
7622 7
12/2005 Image
general crawl site list
7632
0.04
12/2005 Audio
general crawl site list
7642
4.5
12/2005 Other
general crawl site list
Host Port Million pgs Date Mimetype Type of web crawl
WB1
7701 98
515GB
1/2006 Text
general crawl site list
WB18
7702 93
490GB
2/2006 Text
general crawl site list
WB15
7703 95
497GB
3/2006 Text
general crawl site list
WB17
7704 92
493GB
4/2006 Text
general crawl site list
WB17 7705 93 499GB 5/2006 Text general crawl site list
WB18
7706 60
426GB
6/2006 Text
general
crawl site list
WB6
7707 92
501GB
7/2006 Text
general
crawl site list
WB4 7708 92 514GB 8/2006 Text general crawl site list
WB15 7709 91 502GB 9/2006 Text general crawl site list
WB15 7710 90 497GB 10/2006 Text general crawl site list
WB3 7730 10 353GB 10-11/2006 Image general crawl site list
WB4 7711 90 506GB 11/2006 Text general crawl site list
WB1 7712 90 511GB 12/2006 Text general crawl site list
WB1 7713 0.5 222GB 12/2006 Audio general crawl site list
Host Port Million pgs Date Mimetype Type of web crawl
WB2 7118 87 502GB 1/2007 Text general crawl site list
WB19 7106 103 590GB 2/2007 Text general crawl site list
WB23 7161 102 578GB 3/2007 Text general crawl site list
WB5 7163 100 578GB 4/2007 Text general crawl site list
WB15
7239 98
573GB
5/2007 Text
general crawl site list
WB2
7260 98
578GB
6/2007 Text
general crawl site list
WB15
7579 86
525GB
7/2007 Text
general crawl site list
WB18 7262 87 514GB 8/2007 Text general crawl site list
WB19 7266 79 486GB 9/2007 Text general crawl site list
WB3 7272 80 492GB 10/2007 Text general crawl site list
WB6 7289 80 497GB 11/2007 Text general crawl site list
WB18 7291 79 494GB 12/2007 Text general crawl site list
Host Port Million pgs Date Mimetype Type of web crawl
WB20 7320 81 498GB 1/2008 Text general crawl site list
WB21 7298 79 496GB 2/2008 Text general crawl site list
WB14
7299 80
507GB 3/2008
Text
general crawl site list
WB1 7301 68 439GB 4/2008 Text general crawl site list
WB10
7476 66
500GB 5/2008
Text
general crawl site list
WB14 7482 67 485GB 6/2008 Text general crawl site list
WB9 7483 75 498GB 7/2008 Text general crawl site list
WB2 7495 77 516GB 8/2008 Text general crawl site list
WB21
7518 76
522GB
9/2008
Text
general crawl site list
WB3
7220 77
463GB
10/2008
Text
general crawl site list
WB2
7571 77
522GB
11/2008
Text
general crawl site list
WB9
7572 61
430GB
12/2008
Text
general crawl site list
Host Port Millionpgs Date Mimetype Type of web crawl
Host Port Millionpgs Date Mimetype Type of web crawl
Host Port Millionpgs Date Mimetype Type
Host Port Millionpgs Date Mimetype Type
WB1
7300 .4
2GB
11/2004 Text
US
CS site list
7.6GB
WB1
7440
.14
1/2005
Text U
Cal@Berkeley site
list
7641
.07
1/2005
Image U
Cal@Berkeley site
list
7492
.0001
1/2005
Audio U
Cal@Berkeley site
list
7443
.02
1/2005
Other U
Cal@Berkeley site
list
3GB
WB(fixing)
7060
.040 1.5GB
6/2005 Text
Stanford University site
list
7061
.038 125MB
6/2005 Image
Stanford University site
list
7062 60pgs
6/2005 Audio
Stanford University site
list
7063
.011 1.4GB
6/2005 Other
Stanford University site
list
62GB
WB15
7688 0.9
10/2009
Text Stanford
University site list
7689
.03
10/2009
Image
Stanford
University site list
7690
.001
10/2009
Audio
Stanford
University
site list
7692
.2
10/2009
Other
Stanford
University site
list
Government
US
Government
.mil is in the general crawl
Host Port Million pgs Date Mimetype Type of web crawl
214GB
WB2
7567 4.3
7/2003 Text
US Government site
list
268GB
WB2
7506 3.4
6/2004 Text
US Government site
list
7516 1.6
6/2004 Image
site
list
7516
.003
6/2004 Audio
site
list
7536 1.2
6/2004 Other
site
list
479GB
WB1
7509 2.8
9/2004 Text
US Government site
list
7519 1.5
9/2004 Image
site
list
7529
.006
9/2004 Audio
site
list
7539 1.1
9/2004 Other
site
list
Host Port Million pgs Date Mimetype Type of web crawl
2005
477GB
WB3 7644
2.5
4/2005 Text US
Govt .gov + election site
list
7614
1.3
4/2005 Image
7624
.003
4/2005 Audio
7634
1.1
4/2005 Other
Next 3: 20,000/site max on .gov only
363GB
WB15
7607
4.0
6-7/2005 Text US
.gov site list
7617
2.0
6-7/2005 Image US
.gov site list
7627
.004
6-7/2005 Audio US
.gov site list
7637
1.7
6-7/2005 Other US
.gov site list
(updated site list from LOC)
337GB
WB23
7799 3.3
9/2005 Text US
.gov site list
7719 1.1
9/2005 Image US
.gov site list
7729
9/2005 Audio US
.gov site list
7739
1.4
9/2005 Other US
.gov site list
233GB
WB15 8012
2.2
12/2005 Text US
.gov site list
8022
1.1
12/2005 Image US
.gov site
list
8032
0.004
12/2005 Audio US
.gov site
list
8042
1.0
12/2005 Other US
.gov site
list
From here on we crawl up to
150,000 pages per .gov site
to a depth
of 12 quarterly.
For those below, we have removed the site list from ca.gov, which are
state site list for California.
ca.gov are about 100GB for each crawl and can be made available upon
request. These are also in the
state crawls.
2006
568GB
WB2 8001
5.6
3/2006 Text
US
.gov site list
8011
2.8
3/2006
Image US
.gov site
list
8021
0.007
3/2006
Audio US
.gov site
list
8031
2.2
3/2006
Other US
.gov site
list
658GB
WB21 8041
6.1
6-7/2006
Text US
.gov site list
8051
3.3
6-7/2006 Image US
.gov site
list
8052
0.01
6-7/2006
Audio US
.gov site
list
8053
2.3
6-7/2006 Other US
.gov site
list
726GB
WB1 7100
6.6
9-10/2006 Text
US
.gov site list
7101
3.3
Image
site
list
7102
0.01
Audio
site
list
7104
2.9
Other
site
list
609GB
WB11 7149
7.0
12/2006
Text US
.gov site list
7150
3.0
Image
site
list
7151
0.01
Audio
site
list
7152
3.0
Other
site
list
2007
681GB
WB2 7157
8.1
3/2007
Text US
.gov site list
7158
3.4
Image
site
list
7159
0.01
Audio
site
list
7160
3.1
Other
site
list
(Updated our list of sites here )
613GB
WB15 7255
7.0
6/2007
Text US
.gov site list
7256
3.0
Image
site
list
7257
0.01
Audio
site
list
7258
2.8
Other
site
list
( California ca.gov is not crawled
from here on except as part
of the state crawls )
636GB
WB[fixing] 7267
5.5
9/2007
Text US
.gov site list
7268
2.7
Image
site
list
7269
0.01
Audio
site
list
7270
2.4
Other
site
list
629 GB
WB15 7292
5.4
12/2007
Text US
.gov site list
7293
2.5
Image
site
list
7295
0.01
Audio
site
list
7296
2.3
Other
site
list
2008
654 GB
WB9 7369
7.4
3/2008
Text US
.gov site list
7370
3.4
Image
site
list
7371
0.01
Audio
site
list
7372
3.0
Other
site
list
650 GB
WB8 7484
5.5
6/2008
Text US
.gov site list
7485
2.5
Image
site
list
7488
0.01
Audio
site
list
7490
2.7
Other
site
list
755 GB
WB2 7510
8.0
9/2008
Text US
.gov site list
7513
3,2
Image
site
list
7514
0.01
Audio
site
list
7515
3.3
Other
site
list
762 GB
WB2 7574
7.3
12/2008
Text US
.gov site list
7575
3.2
Image
site
list
7576
0.01
Audio
site
list
7577
3.2
Other
site
list
2009
880 GB
WB19 7645
8.0
03/2009
Text US
.gov site list
7512
3.6
Image
site
list
7522
0.01
Audio
site
list
7532
3.4
Other
site
list
826 GB
WB18 7646
7.1
06/2009
Text US
.gov site list
7647
3.3
Image
site
list
7648
0.01
Audio
site
list
7649
3.0
Other
site
list
Updated
site list from: http://www.lib.lsu.edu/gov/index.html, 230 new sites
Also got 2682 new sites by extracting links from a crawl.
NASA to a depth of 6 instead of 12
USGS and NOAA limited to 8 levels
1600 GB
WB1 7663
9.9
09/2009
Text US
.gov site list
7665
4.7
Image
site
list
7667 0.02
Audio
site
list
7670 4.7
Other
site
list
Even
tighter page limits on NASA,USGS,NOAA below to not get so many images
599 GB
WB15 7699
5.9
12/2009
Text US
.gov site list
7700
2.0
Image
site
list
7714 0.01
Audio
site
list
7715 4
Other
site
list
2010
646 GB
WB3 7735
6
03/2010
Text US
.gov site list
7727
2.0
Image
site
list
7728 0.01
Audio
site
list
7731 2.6
Other
site
list
699 GB
WB13 7734
6.3
06/2010
Text US
.gov site list
7736
2.1
Image
site
list
7737 0.01
Audio
site
list
7738 2.8
Other
site
list
WB22
7759
6.4
09/2010
Text US
.gov site list
7760
2.0
Image
site
list
7761 0.01
Audio
site
list
7762 2.9
Other
site
list
WB4
7779
5.7
12/2010
Text US
.gov site list
7780
1.9
Image
site
list
7781 0.02
Audio
site
list
7782 2.6
Other
site
list
WB18
7406
5.4
3/2011
Text US .gov site list
7411
1.8
Image
site list
7414 0.02
Audio
site list
7417 2.5
Other
site
list
WB6
7406
5.5
6/2011
Text US .gov site list
7411
1.8
Image
site list
7414 0.02
Audio
site list
7417 2.5
Other
site
list
WB11
7573
4.7
9/2011
Text US .gov site list
7580
1.4
Image
site list
7582 0.02
Audio
site list
7583 2.3
Other
site
list
570GB
WB8
7948
4.3
12/2011
Text US .gov site list
7950
2.0
Image
site list
7951 0.02
Audio
site list
7952 2.3
Other
site
list
State, County and Local Governments
Host Port Million pgs Date Mimetype Type of web crawl
State
211GB
WB19
7204
2.3
5/2005 Text
State
govt site
list
7214
0.7
5/2005 Image
State
govt site
list
7224
.005
5/2005 Audio
State
govt site
list
7234
1.4
5/2005 Other
State
govt site
list
County
90GB
WB15
7264
1.2
5/2005 Text
County
govt site
list
7274
0.5
5/2005 Image
County
govt site
list
7284
.060
5/2005 Audio
County
govt site
list
7294
0.5
5/2005 Other
County
govt site
list
City and
town
188GB
WB8
7664
2.5
5/2005 Text
City
govt site list
7674
1.2
5/2005 Image
City
govt site list
7684
.001
5/2005 Audio
City
govt site list
7694
1.0
5/2005 Other
City
govt site list
Post Katrina crawl
State
217GB
WB2
7465
2.1
9/2005 Text
State
govt site
list
7466
0.7
9/2005 Image
State
govt site
list
7467
.060
9/2005 Audio
State
govt site
list
7468
1.3
9/2005 Other
State
govt site
list
State
280GB
WB17
7365
2.0
4/2006 Text
State
govt site list
7366
0.7
4/2006 Image
State
govt site list
7367
.006
4/2006 Audio
State
govt site list
7368
1.3
4/2006 Other
State
govt site list
County
115GB
WB2
7364
1.2
4/2006 Text
County
govt site
list
7374
0.4
4/2006 Image
County
govt site
list
7384
.002
4/2006 Audio
County
govt site
list
7394
0.6
4/2006 Other
County
govt site
list
City and
town
238GB
WB10 7165 2.7
4/2006 Text
City
govt site list
7175 1.1
4/2006 Image
City
govt site list
7185
0.001
4/2006 Audio
City
govt site list
7186
1.2
4/2006 Other
City
govt site list
State
251GB
WB15
7395
2.4
9/2006 Text
State
govt site
list
7966
0.7
9/2006 Image
State
govt site
list
7367
.008
9/2006 Audio
State
govt site
list
7968
1.5
9/2006 Other
State
govt site
list
County
126GB
WB2
7964
1.2
9/2006 Text
County
govt site
list
7974
0.4
9/2006 Image
County
govt site
list
7987
.002
9/2006 Audio
County
govt site
list
7407
0.7
9/2006 Other
County
govt site
list
County
129GB
WB1
7141
1.3
12/2006 Text
County
govt site
list
7142
0.5
12/2006 Image
County
govt site
list
7143
.002
12/2006 Audio
County
govt site
list
7144
0.7
12/2006 Other
County
govt site
list
State sites 260GB
WB3
7246
2.4
5/2007 Text
State
govt site list
7247
0.7
5/2007 Image
State
govt site list
7248
.008
5/2007 Audio
State
govt site list
7249
1.5
5/2007 Other
State
govt site list
County sites 140GB
WB18
7242
1.3
5/2007 Text
County
govt site list
7243
0.5
5/2007
Image
County
govt site list
7244
.002
5/2007
Audio
County
govt site list
7245
0.7
5/2007
Other
County
govt site list
City and town
sites
279GB
WB6
7236 2.9
5/2007 Text
City
govt site list
7237 1.1
5/2007 Image
City
govt site list
7240 .002
5/2007 Audio
City
govt site list
7241 1.3
5/2007 Other
City
govt site list
(Updated sites to be crawled here. )
State
sites
296 GB
WB11
7273
2.3
10/2007 Text
State
govt site list
7275
0.7
10/2007 Image
State
govt site list
7276
.01
10/2007 Audio
State
govt site list
7277
1.6
10/2007 Other
State
govt site list
County sites
143 GB
WB18
7278
1.4
10/2007 Text
County
govt site
list
7280
0.4
10/2007
Image
County
govt site
list
7281
.002
10/2007
Audio
County
govt site
list
7282
0.8
10/2007
Other
County
govt site
list
City
and town
sites
301 GB
WB6
7283 3.1
10/2007 Text
City
govt site list
7285 1.1
10/2007 Image
City
govt site list
7286 .003
10/2007 Audio
City
govt site list
7287 1.5
10/2007 Other
City
govt site list
2008
State sites
309GB
WB19
7436
2.2
5/2008 Text
State
govt site list
7437
0.7
5/2008 Image
State
govt site list
7438
.008
5/2008 Audio
State
govt site list
7439
1.6
5/2008 Other
State
govt site list
County sites
153GB
WB19
7456
1.4
5/2008 Text
County
govt site list
7457
0.4
5/2008
Image
County
govt site list
7458
.002
5/2008
Audio
County
govt site list
7459
0.8
5/2008
Other
County
govt site list
City and town
sites
327GB
WB6
7469 3.1
5/2008 Text
City
govt site list
7470 1.5
5/2008 Image
City
govt site list
7471 .004
5/2008 Audio
City
govt site list
7472 1.3
5/2008 Other
City
govt site list
State sites
298GB
WB20
7544
2.2
11/2008 Text
State
govt site list
7545
0.7
11/2008 Image
State
govt site list
7546
.008
11/2008 Audio
State
govt site list
7547
1.6
11/2008 Other
State
govt site list
County sites
167GB
WB1
7548
1.4
11/2008 Text
County
govt site list
7549
0.4
11/2008
Image
County
govt site list
7550
.002
11/2008
Audio
County
govt site list
7551
0.8
11/2008
Other
County
govt site list
City and town
sites
324GB
WB1
7552 3.2
11/2008 Text
City
govt site list
7553 1.0
11/2008 Image
City
govt site list
7554 .004
11/2008 Audio
City
govt site list
7555 1.5
11/2008 Other
City
govt site list
Host
Port Million
pgs
Date Mimetype Type
of web
crawl
2009
State
sites
297GB
WB2
7599 2.1
05/2009 Text
State
govt site list
7615 0.6
05/2009 Image
State
govt site list
7616 .01
05/2009
Audio
State
govt site list
7618 1.6
05/2009 Other
State
govt site list
County sites
163GB
WB2
7619 1.4
05/2009 Text
County
govt site list
7620
0.4
05/2009
Image
County
govt site list
7623
.002
05/2009
Audio
County
govt site list
7625
0.8
05/2009
Other
County
govt site list
City and town
sites
304GB
WB2
7626 3.0
05/2009 Text
City
govt site list
7628
0.9
05/2009 Image
City
govt site list
7629
3k
05/2009 Audio
City
govt site list
7630
1.4
05/2009 Other
City
govt site list
Curated a new list of state sites here by
feeding back a crawl and looking for links to new state sites.
State
sites
332GB
WB15
7679 2.1
11/2009 Text
State
govt site list
7680 0.5
11/2009 Image
State
govt site list
7681 .01
11/2009
Audio
State
govt site list
7682 1.6
11/2009 Other
State
govt site list
County
sites
163GB
WB19
7673 1.3
10/2009 Text
County
govt site list
7675
0.4
10/2009
Image
County
govt site list
7676
.002
10/2009
Audio
County
govt site list
7677
0.8
10/2009
Other
County
govt site list
City and town
sites
322GB
WB20
7683 3.0
11/2009 Text
City
govt site list
7685
1.0
11/2009 Image
City
govt site list
7686
4k
11/2009
Audio
City
govt site list
7687
1.5
11/2009 Other
City
govt site list
State
sites
358GB
WB2
7740 2.0
5/2010 Text
State
govt site list
7741 0.5
Image site list
7742 .01
Audio
site list
7743 1.7
Other site list
County
sites
201GB
WB3
7744 1.3
5/2010 Text
County
govt site list
7745
0.4
Image site list
7746
.002
Audio site list
7747
0.8
Other site list
City and
town sites
344GB
WB15
7748 3.3
5/2010 Text
City
govt site list
7749
0.9
Image
site
list
7750
4k
Audio
site list
7751
1.5
Other site list
State
sites
325GB
WB20
7763 1.6
11/2010 Text
State
govt site
list
7764 0.4
Image site list
7765 11k
Audio
site list
7766 1.4
Other site list
County
sites
168GB
WB12
7767 1.3
11/2010 Text
County
govt site
list
7768
0.3
Image site list
7769
3k
Audio site list
7770
0.8
Other site list
Updated Site list
for cities from http://www.statelocalgov.net/
City and town sites
376GB
WB18 7771
3.8
11/2010 Text
City govt
site list
7772
0.9
Image
site list
7773
3.6k
Audio
site list
7774
1.5
Other
site list
Updated Site
lists here
for county and state from http://www.statelocalgov.net/
State sites
363GB
WB5
7428 1.8
5/2011 Text
State govt
site list
7429 0.5
Image
site list
7430 12k
Audio
site list
7431 1.7
Other site list
County sites
195B
WB3
7434 1.3
5/2011 Text
County
govt site
list
7435
0.3
Image site list
7441
3k
Audio site list
7442
0.8
Other site list
City and town sites
400GB
WB5
7444
4.1
5/2011 Text City
govt
site list
7445
0.9
Image
site list
7446
5k
Audio
site list
7447
1.7
Other site list
State
348GB
WB15
7319
1.7
11/2011
Text
State govt
site list
7362
0.5
Image
site list
7817 13k
Audio
site list
7832
1.6
Other site list
County
195B
WB19
7913 1.3
11/2011 Text
County
govt site
list
7915
0.3
Image site list
7916
4k
Audio site list
7941
0.9
Other site list
City and town
410GB
WB21
7942 3.8
11/2011 Text
City
govt
site list
7943
0.9
Image
site list
7945
5k
Audio
site list
7946
1.7
Other site list
Host Port Million pgs Date Mimetype Type of web crawl
WB1
7094
.05
10/10/03
" California recall site list
WB1
7095
.05
11/04/03
" California recall site list
WB1
7096
.05
12/12/03
" California recall site list
2004 American Elections
Available via
Wibbi
California
2005
Special Election
Also good for
researching non-hurricane press coverage on
consecutive days,
for instance doing sociological analysis or topical analysis.
We do not filter by topic, though papers are only Gulf Coast regional
press.
Newspaper crawls contain many archival stories and duplicates.

Make
sure the library path includes W3C's
libwww .
This library must be installed by a system administrator
with
root
privileges.
Make
sure environment variable WEBBASE points to
WebBase:
setenv WEBBASE [absolute path]/WebBase
(1) Run GNU make:
WebBase/> ./configure
WebBase/> make client
If you get:
handlers/extract-hosts.h:27:21: WWWCore.h: No such file or directory
handlers/extract-hosts.h:28:21: HTParse.h: No such file or directory
Your include path may be wrong:To use later gcc versions:
We expect it to be in /usr/local/include/w3c-libwww/WWWCore.h,
so you may need to change this in Makefile.in and configure.
(Order MAY matter)
Rerun ./configure.
Now try the network version:
Method
1:
Run scripts/distribrequestor.pl to start a
distributor:
(either chmod +x scripts/*.pl or
invoke it with "perl")
args: (must be in this order)
# host
# port
# num pages
# starting web site (optional) e.g. www.ibm.com
# ending web site (optional)
# offset in bytes within web site (optional)
[example run:]distrib daemon returned 171.64.75.151 7160
WebBase/scripts> distribrequestor.pl wb1 7008 100
WebBase/scripts>Now you can invoke RunHandlers with the above info:
WebBase/scripts> ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100"will print back 100 sample pages. All instances of RunHandlers connected to
Method
2:
You can also use our one-step script getpages.pl
(no need to specify a first site )
(either chmod +x scripts/*.pl or
invoke it with "perl")
[example run:]
args: (must be in this order)Starting getpages.pl using Perl 5.6.0
# num pages
# host
# port
# starting web site (optional) e.g. www.ibm.com
# ending web site (optional)
# offset in bytes within web site (optional)
WebBase/scripts> getpages.pl 2 wb1 7008 www.ibm.com www.ibm.com (only give me www.ibm.com)
To
get all of the page, set CAT_ON = 1 in the
inputs/*.conf.
If
you get the ERROR:
bin/RunHandlers: error while loading shared libraries: libwwwcore.so.0:
cannot open shared object file: No such file or directory
you don't have your paths set right.
setting a variable called LD_LIBRARY_PATH where you're about to run the
WebBase client. For example, if you found your libwwwcore.so in
your
/opt/somewhere/lib/libwwwcore.so, then you could tell your system:
setenv LD_LIBRARY_PATH /opt/somewhere/lib
Return
codes:
contact us to report these:
blank
page means there is no server running on that
port
If you get a line of just numbers and not much else:
256 means I have a distributor running on a server with no data or a
dangling
softlink
32512 is usually a missing softlink on the server
( fix is ln -s /u/gary/WebBase.centos/bin/runhandlers
/lfs/1/tmp/webbase/runhandlers )
or it is missing shared libraries: libwwwutils.so.0
Note on the output:
This
next line is just a separator, so that
RunHandler
knows
it is getting a new page:
==P=>>>>=i===<<<<=T===>=A===<=!Jung[...]
--
page separator
URL: http://www.powa.org/ -- page URL
Date: June 3,
2004
-- when crawled
Position:
695
-- bytes into the site so far
DocId:
1
-- sequential page id within site
HTTP/1.1 200
OK
-- response to our http request
Death
threat:
If a distributor is inactive for a while,
it may be
killed by us so that we can reuse the resources.
To restart at the same point you must start a new
distributor
@ the offset where it left off
( + 1 to prevent getting the previous page again).
Putting
out a contract:
If you are done, you can run distribrelease.pl
[remote-host] [host port] [stream port]
from the same machine you requested on. We will
immediately
kill the
distributor for you.
We especially recommend this if
you are running
many requests in 1 day so that we do not run out of
resources.
If
you specify firstSite/lastSite, please note that
you
can only use the root
(e.g. www.ibm.com) not a page within the site
(e.g.
01net.com/envoyerArticle/1
)
and dont include the http:// part.
-------------------------------------------------------------------
To create a new webpage stream handler:
You
can use the other handlers in the distribution
as
templates.
To add a new handler, add the following to the
appropriate
places:
* 1) #include "myhandler.h" into
handlers/all_handlers.h
* 2) handler.push_back(new MyHandler()); into
handlers/all_handlers.h
(following
the template of the handlers already there)
* 3) in Makefile, add entries for your segments
to compile
in the
line: HANDLER_OBJS = jhandler.o [...]
*opt)in Makefile, customize your build if
necessary
by adding a line
jhandler_CXXFLAGS = -Iyour-include-dir --your-switches [...]
(following the template of the handlers already there)
We
also have a one-button script called
scripts/addHandler.pl
that will
prompt you for all your pieces and put them in place,
without you having
to do the above file surgery yourself.
WebVac - the WebBase web crawler or spider. Used to be
called Pita.
RunHandlers
- (formerly "process") an executable
that
indexes a stream,
file or repository.
Made up basically of a feeder and one or more handlers.
handler
- the interface that any index-building
piece
of code must implement.
The interface's main (only) method will provide a page and
associated
metadata and the implementor of the method can do whatever he wants
with it.
feeder
- the interface for receiving a
page
stream
from any kind of source
(directly from the repository, via Webcat, via network, etc.). The
key method of the interface is "next" which advances the stream by one
page. After calling next, various other methods can be used to get the
associated metadata for the current page in the stream. Can also be used
to build indexes if the index-building code is written to process page
streams
distributor
- a program that disseminates
pages to
multiple
clients
over the network, supporting session ID's, etc... a generalization
of what
Distributor.cc
in Text -index/ does.
offset
- used in distributor requests to specify how
many bytes to start from
the
beginning of the site.
DocId
- DocId is computed within the
download.
If you download any portion of the crawl,
even from the middle,
it will begin with 0.
If you download all the crawl,
it will be monotonically increasing
from start to end.