If you decide to use the data, please email
Gary for our funding.
We would also appreciate knowing of any papers
that
come
out of your usage.
General
Crawls
2004 2005 2006 2007 2008 2009
WB9
7003 119
343GB
1/2001 Text
general crawl site
list
WB9
7005 44
152GB
3/2002
Text
general crawl site
list(use 2002getpages.pl)
WB1 7006 96 406GB 6/2003 Text general crawl site list
WB1
7008
96
423GB
8/2003
Text
general crawl site
list
WB1 7010 102 451GB 10/2003 Text general crawl site list
526GB
WB5
7012
36
12/2003
Text
general crawl site
list
7032
14
12/2003 Image
general crawl site
list
WB1
7103
95 450GB
3/2004
Text
general crawl site
list
WB1
7114 6
447GB
4/2004 Image
general
crawl site list
457GB
WB2
7107
11.5
7/2004
Text
general crawl site
list
7117
4.2
7/2004
Image
general crawl site
list
7127
0.02
7/2004
Audio
general crawl site list
7137
2.3
7/2004
Other
general
crawl site list
WB23 7108 72 363GB 8/2004 Text general crawl site list
474GB
WB22
7109
36
9/2004
Text general
crawl site
list
7119
7
9/2004 Image
general crawl site list
WB4 7190 105 495GB 10/2004 Text general crawl site list
1561GB
[by special arrangement]
7192
37
12/2004
Text general
crawl site list
7193
14
12/2004 Image
general
crawl site list
7194
0.08
12/2004 Audio
general
crawl site list
7195
7.7
12/2004 Other
general
crawl site list
Host
Port
Million
pgs
Date
Mimetype Type
of web
crawl
980GB
WB23
7601
27
1/2005 Text
general crawl site
list
7611
6
1/2005 Image
general crawl site
list
7621
0.04
1/2005 Audio
general crawl site
list
7631
3.5
1/2005 Other
general crawl site
list
this deeper next
crawl was done with pagemax of
20k
per site instead of the usual 10k:
WB1
7603
85
440GB
3/2005 Text
general
crawl site list
WB18
7489
0.48 192GB
3-5/2005 Audio
general audio site
list
WB3
7604
98
480GB
4/2005 Text
general crawl site
list
WB3
7605
79
460GB
5/2005 Text
general crawl site
list
WB8 7606 101 503GB 6/2005 Text general crawl site list
487GB
WB16
7658 9.5
8/2005 Text
general crawl site list
7668 3.4
Image
site
list
7658 .02
Audio
site list
7678 2
Other
site list
WB3
7609 97
490GB
9/2005 Text
general crawl site
list
WB3
7610 97
508GB
10/2005 Text
general crawl site
list
WB18
7691 93
527GB
11/2005 Text
general crawl site list
945GB
WB1
7612
20.7
12/2005 Text
general crawl site list
7622 7
12/2005 Image
general crawl site list
7632
0.04
12/2005 Audio
general crawl site list
7642
4.5
12/2005 Other
general crawl site list
Host Port Million pgs Date Mimetype Type of web crawl
WB1
7701 98
515GB
1/2006 Text
general crawl site list
WB19
7702 93
490GB
2/2006 Text
general crawl site list
WB15
7703 95
497GB
3/2006 Text
general crawl site list
WB17
7704 92
493GB
4/2006 Text
general crawl site list
WB17 7705 93 499GB 5/2006 Text general crawl site list
WB5
7706 90
497GB
6/2006 Text
general
crawl site list
WB19
7707 92
501GB
7/2006 Text
general
crawl site list
WB5
7708 93
515GB
8/2006 Text
general
crawl site list
WB8 7709 90 502GB 9/2006 Text general crawl site list
WB15 7710 90 497GB 10/2006 Text general crawl site list
WB22 7730 10 353GB 10-11/2006 Image general crawl site list
WB16 7711 90 506GB 11/2006 Text general crawl site list
WB1 7712 90 511GB 12/2006 Text general crawl site list
WB1 7713 0.5 222GB 12/2006 Audio general crawl site list
Host Port Million pgs Date Mimetype Type of web crawl
WB2 7118 87 502GB 1/2007 Text general crawl site list
WB11 7106 103 590GB 2/2007 Text general crawl site list
WB23 7161 102 578GB 3/2007 Text general crawl site list
WB5 7163 100 578GB 4/2007 Text general crawl site list
WB15
7239 98
573GB
5/2007 Text
general crawl site list
WB4
7260 98
590GB
6/2007 Text
general crawl site list
WB15
7579 86
525GB
7/2007 Text
general crawl site list
WB9 7262 87 514GB 8/2007 Text general crawl site list
WB11 7266 79 486GB 9/2007 Text general crawl site list
WB3 7272 80 492GB 10/2007 Text general crawl site list
WB19 7289 80 497GB 11/2007 Text general crawl site list
WB4 7291 79 494GB 12/2007 Text general crawl site list
Host Port Million pgs Date Mimetype Type of web crawl
WB20 7320 81 498GB 1/2008 Text general crawl site list
WB21 7298 79 496GB 2/2008 Text general crawl site list
WB20
7299 80
507GB 3/2008
Text
general crawl site list
WB11 7301 68 439GB 4/2008 Text general crawl site list
WB22
7476 66
500GB 5/2008
Text
general crawl site list
WB6 7482 67 485GB 6/2008 Text general crawl site list
WB7 7483 75 498GB 7/2008 Text general crawl site list
WB2 7495 77 516GB 8/2008 Text general crawl site list
WB21
7518 76
522GB
9/2008
Text
general crawl site list
WB22
7220 77
526GB
10/2008
Text
general crawl site list
WB2
7571 77
522GB
11/2008
Text
general crawl site list
WB7
7572 61
430GB
12/2008
Text
general crawl site list
Host Port Million pgs Date Mimetype Type of web crawl
WB23
7565 75
525GB
1/2009
Text
general crawl site list
WB23
7587 75
515GB
2/2009
Text
general crawl site list
WB1
7300 .4
2GB
11/2004 Text
US
CS site list
7.6GB
WB1
7440
.14
1/2005
Text U
Cal@Berkeley site
list
7641
.07
1/2005
Image U
Cal@Berkeley site
list
7492
.0001
1/2005
Audio U
Cal@Berkeley site
list
7443
.02
1/2005
Other U
Cal@Berkeley site
list
3GB
WB3
7060
.040 1.5GB
6/2005 Text
Stanford University site list
7061
.038 125MB
6/2005 Image
Stanford University site list
7062 60pgs
6/2005 Audio
Stanford University site list
7063
.011 1.4GB
6/2005 Other
Stanford University site list
62GB
WB11
7688 0.9
10/2009
Text Stanford
University site list
7689
.03
10/2009
Image
Stanford
University site list
7690
.001
10/2009
Audio
Stanford
University site list
7692
.2
10/2009
Other
Stanford
University site
list
Government
US
Government
.mil is in the general crawl
Host Port Million pgs Date Mimetype Type of web crawl
213GB
WB4
7567 4.3
7/2003 Text
US Government site
list
270GB
WB3
7506 3.4
6/2004 Text
US Government site list
7516 1.6
6/2004 Image
site
list
7516
.003
6/2004 Audio
site
list
7536 1.2
6/2004 Other
site
list
274GB
[by request]7508
3.2
8/2004 Text
US Government site list
7518 1.7
8/2004 Image
site list
7538
1.2
8/2004 Other
site list
259GB
WB1
7509 2.8
9/2004 Text
US Government site list
7519 1.5
9/2004 Image
site
list
7529
.006
9/2004 Audio
site
list
7539 1.1
9/2004 Other
site
list
274GB
[by request]7570
2.9
10/2004 Text
US
Govt early Oct site list
7580 1.5
10/2004 Image
site list
7590
2.2
10/2004 Other
site
list
280GB
[by request]7573
3.0
10/2004 Text US
Govt ,very late Oct site list
7583 1.6
10/2004 Image
site
list
7563
0.004
10/2004 Audio
site
list
7593 1.2
10/2004 Other
site
list
283GB
WB8
7511 3.0
11/2004 Text US
Govt+election, early Nov site list
7521 1.6
11/2004 Image
site
list
7531
.004
11/2004 Audio
site
list
7541 1.3
11/2004 Other
site
list
277GB
[by request]7512
2.9
12/2004 Text US
Government site
list
7522
1.5
12/2004 Image
site
list
7532
.004
12/2004 Audio
site
list
7542
1.2
12/2004 Other
site
list
Host Port Million pgs Date Mimetype Type of web crawl
2005
274GB
[upon request]
3.0
1/2005 Text US
Government, January site
list
7781 1.5
1/2005
Image
site list
7791
.004
1/2005 Audio
site list
7792 1.2
1/2005 Other
site list
483GB
WB3 7644
2.5
4/2005 Text US
Govt .gov + election site list
7614
1.3
4/2005 Image
7624
.003
4/2005 Audio
7634
1.1
4/2005 Other
Next 3: 20,000/site max on .gov only
363GB
WB18
7607
4.0
6-7/2005 Text US
.gov site list
7617
2.0
6-7/2005 Image US
.gov site list
7627
.004
6-7/2005 Audio US
.gov site list
7637
1.7
6-7/2005 Other US
.gov site list
336GB
(updated site list from LOC)
WB4
7799 3.3
9/2005 Text US
.gov site list
7719 1.1
9/2005 Image US
.gov site list
7729
9/2005 Audio US
.gov site list
7739
1.4
9/2005 Other US
.gov site list
233GB
WB8 8012
2.2
12/2005 Text US
.gov site list
8022
1.1
12/2005 Image US
.gov site
list
8032
0.004
12/2005 Audio US
.gov site
list
8042
1.0
12/2005 Other US
.gov site
list
From here on we crawl up to
150,000 pages per .gov site
to a depth
of 12 quarterly.
For those below, we have removed the site list from ca.gov, which are
state site list for California.
ca.gov are about 100GB for each crawl and can be made available upon
request. These are also in the
state crawls.
2006
484GB
WB2 8001
5.6
3/2006 Text
US
.gov site list
8011
2.8
3/2006
Image US
.gov site
list
8021
0.007
3/2006
Audio US
.gov site
list
8031
2.2
3/2006
Other US
.gov site
list
658GB
WB21 8041
6.1
6-7/2006
Text US
.gov site list
8051
3.3
6-7/2006 Image US
.gov site
list
8052
0.01
6-7/2006
Audio US
.gov site
list
8053
2.3
6-7/2006 Other US
.gov site
list
726GB
WB1 7100
6.6
9-10/2006 Text
US
.gov site list
7101
3.3
Image
site
list
7102
0.01
Audio
site
list
7104
2.9
Other
site
list
609GB
WB9 7149
7.0
12/2006
Text US
.gov site list
7150
3.0
Image
site
list
7151
0.01
Audio
site
list
7152
3.0
Other
site
list
2007
681GB
WB2 7157
8.1
3/2007
Text US
.gov site list
7158
3.4
Image
site
list
7159
0.01
Audio
site
list
7160
3.1
Other
site
list
(Updated our list of site list here. )
613GB
WB15 7255
7.0
6/2007
Text US
.gov site list
7256
3.0
Image
site
list
7257
0.01
Audio
site
list
7258
2.8
Other
site
list
( California ca.gov is not crawled
from here on except as part
of the state crawls )
636GB
WB[fixing] 7267
5.5
9/2007
Text US
.gov site list
7268
2.7
Image
site
list
7269
0.01
Audio
site
list
7270
2.4
Other
site
list
629 GB
WB23 7292
5.4
12/2007
Text US
.gov site list
7293
2.5
Image
site
list
7295
0.01
Audio
site
list
7296
2.3
Other
site
list
2008
654 GB
WB7 7369
7.4
3/2008
Text US
.gov site list
7370
3.4
Image
site
list
7371
0.01
Audio
site
list
7372
3.0
Other
site
list
650 GB
WB6 7484
5.5
6/2008
Text US
.gov site list
7485
2.5
Image
site
list
7488
0.01
Audio
site
list
7490
2.7
Other
site
list
755 GB
WB2 7510
8.0
9/2008
Text US
.gov site list
7513
3,2
Image
site
list
7514
0.01
Audio
site
list
7515
3.3
Other
site
list
762 GB
WB6 7574
7.3
12/2008
Text US
.gov site list
7575
3.2
Image
site
list
7576
0.01
Audio
site
list
7577
3.2
Other
site
list
2009
880 GB
WB11 7645
8.0
03/2009
Text US
.gov site list
7512
3.6
Image
site
list
7522
0.01
Audio
site
list
7532
3.4
Other
site
list
826 GB
WB14 7646
7.1
06/2009
Text US
.gov site list
7647
3.3
Image
site
list
7648
0.01
Audio
site
list
7649
3.0
Other
site
list
Updated
site list from: http://www.lib.lsu.edu/gov/index.html, 230 new sites
Also got 2682 new sites by feeding back a crawl and
looking for links to federal sites
not already in list.
NASA to a depth of 6 instead of 12
USGS and NOAA limited to 8 levels
1600 GB
WB1 7663
9.9
09/2009
Text US
.gov site list
7665
4.7
Image
site
list
7667 0.02
Audio
site
list
7670 4.7
Other
site
list
Even
tighter page limits on NASA,USGS,NOAA below
State, County
and Local Governments
Host Port Million pgs Date Mimetype Type of web crawl
State
211GB
WB11
7204
2.3
5/2005 Text
State
govt site
list
7214
0.7
5/2005 Image
State
govt site
list
7224
.005
5/2005 Audio
State
govt site
list
7234
1.4
5/2005 Other
State
govt site
list
County
90GB
WB8
7264
1.2
5/2005 Text
County
govt site
list
7274
0.5
5/2005 Image
County
govt site
list
7284
.060
5/2005 Audio
County
govt site
list
7294
0.5
5/2005 Other
County
govt site
list
City and town
188GB
WB6
7664
2.5
5/2005 Text
City
govt site list
7674
1.2
5/2005 Image
City
govt site list
7684
.001
5/2005 Audio
City
govt site list
7694
1.0
5/2005 Other
City
govt site list
Post Katrina crawl
State
217GB
WB2
7465
2.1
9/2005 Text
State
govt site
list
7466
0.7
9/2005 Image
State
govt site
list
7467
.060
9/2005 Audio
State
govt site
list
7468
1.3
9/2005 Other
State
govt site
list
State
280GB
WB17
7365
2.0
4/2006 Text
State
govt site list
7366
0.7
4/2006 Image
State
govt site list
7367
.006
4/2006 Audio
State
govt site list
7368
1.3
4/2006 Other
State
govt site list
County
115GB
WB1
7364
1.2
4/2006 Text
County
govt site list
7374
0.4
4/2006 Image
County
govt site list
7384
.002
4/2006 Audio
County
govt site list
7394
0.6
4/2006 Other
County
govt site list
City and town
237GB
WB16 7165 2.7
4/2006 Text
City
govt site list
7175 1.1
4/2006 Image
City
govt site list
7185
0.001
4/2006 Audio
City
govt site list
7186
1.2
4/2006 Other
City
govt site list
State
251GB
WB15
7395
2.4
9/2006 Text
State
govt site list
7966
0.7
9/2006 Image
State
govt site list
7367
.008
9/2006 Audio
State
govt site list
7968
1.5
9/2006 Other
State
govt site list
County
126GB
WB1
7964
1.2
9/2006 Text
County
govt site list
7974
0.4
9/2006 Image
County
govt site list
7987
.002
9/2006 Audio
County
govt site list
7407
0.7
9/2006 Other
County
govt site list
County
129GB
WB1
7141
1.3
12/2006 Text
County
govt site list
7142
0.5
12/2006 Image
County
govt site list
7143
.002
12/2006 Audio
County
govt site list
7144
0.7
12/2006 Other
County
govt site list
State sites 260GB
WB22
7246
2.4
5/2007 Text
State
govt site list
7247
0.7
5/2007 Image
State
govt site list
7248
.008
5/2007 Audio
State
govt site list
7249
1.5
5/2007 Other
State
govt site list
County sites 140GB
WB10
7242
1.3
5/2007 Text
County
govt site list
7243
0.5
5/2007
Image
County
govt site list
7244
.002
5/2007
Audio
County
govt site list
7245
0.7
5/2007
Other
County
govt site list
City and town sites
279GB
WB11
7236 2.9
5/2007 Text
City
govt site list
7237 1.1
5/2007 Image
City
govt site list
7240 .002
5/2007 Audio
City
govt site list
7241 1.3
5/2007 Other
City
govt site list
(Updated sites to be crawled here. )
State sites
296 GB
WB8
7273
2.3
10/2007 Text
State
govt site list
7275
0.7
10/2007 Image
State
govt site list
7276
.01
10/2007 Audio
State
govt site list
7277
1.6
10/2007 Other
State
govt site list
County sites
143 GB
WB4
7278
1.4
10/2007 Text
County
govt site list
7280
0.4
10/2007
Image
County
govt site list
7281
.002
10/2007
Audio
County
govt site list
7282
0.8
10/2007
Other
County
govt site list
City and town sites
301 GB
WB7
7283 3.1
10/2007 Text
City
govt site list
7285 1.1
10/2007 Image
City
govt site list
7286 .003
10/2007 Audio
City
govt site list
7287 1.5
10/2007 Other
City
govt site list
2008
State sites
309GB
WB11
7436
2.2
5/2008 Text
State
govt site list
7437
0.7
5/2008 Image
State
govt site list
7438
.008
5/2008 Audio
State
govt site list
7439
1.6
5/2008 Other
State
govt site list
County sites
153GB
WB22
7456
1.4
5/2008 Text
County
govt site list
7457
0.4
5/2008
Image
County
govt site list
7458
.002
5/2008
Audio
County
govt site list
7459
0.8
5/2008
Other
County
govt site list
City and town sites
327GB
WB19
7469 3.1
5/2008 Text
City
govt site list
7470 1.5
5/2008 Image
City
govt site list
7471 .004
5/2008 Audio
City
govt site list
7472 1.3
5/2008 Other
City
govt site list
State sites
298GB
WB8
7544
2.2
11/2008 Text
State
govt site list
7545
0.7
11/2008 Image
State
govt site list
7546
.008
11/2008 Audio
State
govt site list
7547
1.6
11/2008 Other
State
govt site list
County sites
167GB
WB11
7548
1.4
11/2008 Text
County
govt site list
7549
0.4
11/2008
Image
County
govt site list
7550
.002
11/2008
Audio
County
govt site list
7551
0.8
11/2008
Other
County
govt site list
City and town sites
324GB
WB1
7552 3.2
11/2008 Text
City
govt site list
7553 1.0
11/2008 Image
City
govt site list
7554 .004
11/2008 Audio
City
govt site list
7555 1.5
11/2008 Other
City
govt site list
Host Port Million
pgs
Date
Mimetype Type
of web
crawl
2009
State
sites
297GB
WB2
7599 2.1
05/2009 Text
State
govt site list
7615 0.6
05/2009 Image
State
govt site list
7616 .01
05/2009
Audio
State
govt site list
7618 1.6
05/2009 Other
State
govt site list
County sites
163GB
WB2
7619 1.4
05/2009 Text
County
govt site list
7620
0.4
05/2009
Image
County
govt site list
7623
.002
05/2009
Audio
County
govt site list
7625
0.8
05/2009
Other
County
govt site list
City and town sites
304GB
WB2
7626 3.0
05/2009 Text
City
govt site list
7628
0.9
05/2009 Image
City
govt site list
7629
3k
05/2009 Audio
City
govt site list
7630
1.4
05/2009 Other
City
govt site list
Curated a new list of state sites here by
feeding back a crawl and looking for links to state sites
not already in list.
State
sites
332GB
WB13
7679 2.1
11/2009 Text
State
govt site list
7680 0.5
11/2009 Image
State
govt site list
7681 .01
11/2009
Audio
State
govt site list
7682 1.6
11/2009 Other
State
govt site list
County
sites
163GB
WB11
7673 1.3
10/2009 Text
County
govt site list
7675
0.4
10/2009
Image
County
govt site list
7676
.002
10/2009
Audio
County
govt site list
7677
0.8
10/2009
Other
County
govt site list
City and town sites
322GB
WB14
7683 3.0
11/2009 Text
City
govt site list
7685
1.0
11/2009 Image
City
govt site list
7686
4k
11/2009
Audio
City
govt site list
7687
1.5
11/2009 Other
City
govt site list
Host Port Million pgs Date Mimetype Type of web crawl
California 2003 Governor
Recall
WB1
7081
.006
9/26/03 All
California recall site
list
WB1
7082
.008
9/27/03
" California recall site list
WB1
7083
.2
5GB
9/29/03
" California
circus
w/county gov site list site list
WB1
7084
.05 1.3GB
9/30/03
" California recall site list
WB1
7085
.05
10/1/03
" California recall site
list
WB1
7086
.05
10/2/03
" California recall site list
WB1
7087
.05
10/3/03
" California recall site list
WB1
7088
.05
10/4/03
" California recall site list
WB1
7089
.05
10/5/03
" California recall site list
WB1
7090
.05
10/6/03
" California recall site list
WB1
7091
.05
10/7/03
" California recall site list
WB1
7092
.05
1.3GB
10/8/03
" California recall site list
WB1
7094
.05
10/10/03
" California recall site list
WB1
7095
.05
11/04/03
" California recall site list
WB1
7096
.05
12/12/03
" California recall site list
2004 American Elections
Available via
Wibbi
California 2005
Special Election
Also good for researching non-hurricane press coverage on
consecutive days,
for instance doing sociological analysis or topical analysis.
We do not filter by topic, though papers are only Gulf Coast regional
press.
Newspaper crawls contain many archival stories and duplicates.

Make sure the library path includes W3C's
libwww .
This library must be installed by a system administrator
with
root
privileges.
Make sure environment variable WEBBASE points to
WebBase:
setenv WEBBASE [absolute path]/WebBase
(1) Run GNU make:
WebBase/> ./configure
WebBase/> make client
If you get:
handlers/extract-hosts.h:27:21: WWWCore.h: No such file or directory
handlers/extract-hosts.h:28:21: HTParse.h: No such file or directory
Your include path may be wrong:To use later gcc versions:
We expect it to be in /usr/local/include/w3c-libwww/WWWCore.h,
so you may need to change this in Makefile.in and configure.
(Order MAY matter)
Rerun ./configure.
Now try the network version:
Method 1:
Run scripts/distribrequestor.pl to start a
distributor:
(either chmod +x scripts/*.pl or
invoke it with "perl")
args: (must be in this order)
# host
# port
# num pages
# starting web site (optional) e.g. www.ibm.com
# ending web site (optional)
# offset in bytes within web site (optional)
[example run:]distrib daemon returned 171.64.75.151 7160
WebBase/scripts> distribrequestor.pl wb1 7008 100
WebBase/scripts>Now you can invoke RunHandlers with the above info:
WebBase/scripts> ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100"will print back 100 sample pages. All instances of RunHandlers connected to
Method 2:
You can also use our one-step script getpages.pl
(no need to specify a first site )
(either chmod +x scripts/*.pl or
invoke it with "perl")
[example run:]
args: (must be in this order)Starting getpages.pl using Perl 5.6.0
# host
# port
# num pages
# starting web site (optional) e.g. www.ibm.com
# ending web site (optional)
# offset in bytes within web site (optional)
WebBase/scripts> getpages.pl 2 wb1 7008 www.ibm.com www.ibm.com (only give me www.ibm.com)
To get all of the page, set CAT_ON = 1 in the
inputs/*.conf.
If you get the ERROR:
bin/RunHandlers: error while loading shared libraries: libwwwcore.so.0:
cannot open shared object file: No such file or directory
you don't have your paths set right.
setting a variable called LD_LIBRARY_PATH where you're about to run the
WebBase client. For example, if you found your libwwwcore.so in
your
/opt/somewhere/lib/libwwwcore.so, then you could tell your system:
setenv LD_LIBRARY_PATH /opt/somewhere/lib
Return codes:
contact us to report these:
blank page means there is no server running on that
port
If you get a line of just numbers and not much else:
256 means I have a distributor running on a server with no data or a
dangling
softlink
32512 is usually a missing softlink on the server
( fix is ln -s /u/gary/WebBase.centos/bin/runhandlers
/lfs/1/tmp/webbase/runhandlers )
or it is missing shared libraries: libwwwutils.so.0
Note on the output:
This "junk" is just a separator, so that
RunHandler
knows
it is getting a new page:
==P=>>>>=i===<<<<=T===>=A===<=!Jung[...]
--
page separator
URL: http://www.powa.org/ -- page URL
Date: June 3,
2004
-- when crawled
Position:
695
-- bytes into the site so far
DocId:
1
-- sequential page id within site
HTTP/1.1 200
OK
-- response to our http request
Death threat:
If a distributor is inactive for a while,
it may be
killed by us so that we can reuse the resources.
To restart at the same point you must start a new
distributor
@ the offset where it left off
( + 1 to prevent getting the previous page again).
Putting out a contract:
If you are done, you can run distribrelease.pl
[remote-host] [host port] [stream port]
from the same machine you requested on. We will
immediately
kill the
distributor for you.
We especially recommend this if
you are running
many requests in 1 day so that we do not run out of
resources.
If you specify firstSite/lastSite, please note that
you
can only use the root
(e.g. www.ibm.com) not a page within the site
(e.g.
01net.com/envoyerArticle/1
)
and dont include the http:// part.
-------------------------------------------------------------------
To create a new webpage stream handler:
You can use the other handlers in the distribution
as
templates.
To add a new handler, add the following to the
appropriate
places:
* 1) #include "myhandler.h" into
handlers/all_handlers.h
* 2) handler.push_back(new MyHandler()); into
handlers/all_handlers.h
(following
the template of the handlers already there)
* 3) in Makefile, add entries for your segments
to compile
in the
line: HANDLER_OBJS = jhandler.o [...]
*opt)in Makefile, customize your build if
necessary
by adding a line
jhandler_CXXFLAGS = -Iyour-include-dir --your-switches [...]
(following the template of the handlers already there)
We also have a one-button script called
scripts/addHandler.pl
that will
prompt you for all your pieces and put them in place,
without you having
to do the above file surgery yourself.
WebVac - the WebBase web crawler or spider. Used to be
called Pita.
RunHandlers - (formerly "process") an executable
that
indexes a stream,
file or repository.
Made up basically of a feeder and one or more handlers.
handler - the interface that any index-building
piece
of code must implement.
The interface's main (only) method will provide a page and
associated
metadata and the implementor of the method can do whatever he wants
with it.
feeder - the interface for receiving a
page
stream
from any kind of source
(directly from the repository, via Webcat, via network, etc.). The
key method of the interface is "next" which advances the stream by one
page. After calling next, various other methods can be used to get the
associated metadata for the current page in the stream. Can also be used
to build indexes if the index-building code is written to process page
streams
distributor - a program that disseminates
pages to
multiple
clients
over the network, supporting session ID's, etc... a generalization
of what
Distributor.cc
in Text -index/ does.
offset - used in distributor requests to specify how
many bytes to start from
the
beginning of the site.
DocId - DocId is computed within the
download.
If you download any portion of the crawl,
even from the middle,
it will begin with 0.
If you download all the crawl,
it will be monotonically increasing
from start to end.