Project Description
Key Components of the Stanford Archival DL Architecture
Under our architecture, a Digital Library Repository (DLR) is formed by a
collection of independent but collaborating sites. Each site
manages a collection of digital objects and provides services (to be
defined) to other sites. Each site uses one or more computers, and can
run different software, as long as it follows certain simple conventions
that we describe in this paper. Our architecture is based on following
key components.
Signatures as Object Handles
Each object in a DLR has a handle used to identify and retrieve it.
Handles are internal to the DLR and are not used by end users to
identify documents. (Example: If a user is searching for report
STAN-1998-347-B, a naming facility not discussed here will translate into
the appropriate handle, or handles if the report has multiple components.)
Given an object, we define its handle to be a (large) signature computed
exclusively from its contents, using a checksum or a Cyclic Redundancy
Check (CRC). If the contents are smaller than the size of the signature,
the object (at creation time) is ``padded'' with a random string to make
its size larger than the size of a signature. This scheme has the
following properties, which are important in an archival environment:
- Each site can generate objects and their handles
without consulting other sites.
This makes it possible for sites to operate independently.
Furthermore, sites only need to agree on the signature function,
not on software versions, character sets, timestamp services, and so on.
- The handle for an object can be reconstructed from the object itself.
As we will see, this is an extremely useful property,
since we do not need to reliably save any handle-to-object mappings.
- If copies of an object are made at different sites,
all copies will have identical handles.
This may seem disconcerting at first, but if the
contents are identical, it makes management simpler to call
``a spade a spade.''
- Objects with different contents will, with extremely
high probability, have different handles.
The last item requires some discussion, since it may be possible that two
different objects share a handle, which would be disastrous. However, by
making the signature large (e.g., 128 bits or more), the likelihood of
this disaster happening is so extremely low that it is not rational to
worry about it. To illustrate, in Appendix 1 we derive a bound for the
probability p that there is no disaster in a DLR with n objects and
signatures of size b bits. The bound is extremely conservative, but yet
we see that, say, a 256 bit signature can make even a DLR with 10 billion
objects incredibly safe.
| Collection Size (n) |
Probability of no collisions (p) |
Signature Size (b) |
| 10^7 | 1-10^-9 | 76 bits (10 bytes) |
| 10^8 | 1-10^-9 | 83 bits (11 bytes) |
| 10^9 | 1-10^-9 | 89 bits (12 bytes) |
| 10^7 | 1-10^-24 | 128 bits (16 bytes) |
| 10^7 | 1-10^-63 | 256 bits (32 bytes) |
| 10^10 | 1-10^-18 | 128 bits (16 bytes) |
| 10^10 | 1-10^-57 | 256 bits (32 bytes) |
If some applications (or paranoid users) need an absolute certainty that
each signature is unique, then we offer the following enhanced
identification scheme. Handles are extended to have two fields: a unique
publisher field and the signature of the object. The publisher field is
the unique code of the site that first publishes the object; this
publisher code is assigned to the site by some authority. The publisher
field of an object does not change when the object migrates to other
repositories. The second field is the same as the signature described
earlier. When a site creates a new object, it first stores its publisher
field in the object header. Then it computes the signature of this
extended object and checks if any other local object has the
same signature. In the extremely rare case there is a conflict, we add a
discriminator, a random string of bytes, at the end of the new
object. The discriminator is included in the computation of the signature
(and therefore will make the object map to a different signature), but it
is filtered out when the object is returned to a user. From then on, the
handle of an object is computed (at any site) by reading its publisher
value and adding to it the object signature.
No Deletions
Because of our handle scheme, objects cannot be updated in place. That
is, if the contents of an object are modified, it automatically becomes a
new object, with a different handle. This is actually an important
advantage, since it eliminates many sources of confusion. For instance,
one cannot correct a typo in a report and make it pass as the same object.
(We do provide a higher level mechanisms for tracking versions of an
object.) Similarly, if a stored object is corrupted due to a disk error,
the corrupted object will not be confused with the original.
Another fundamental rule in our architecture is that objects are never
(voluntarily) deleted. Allowing deletions is dangerous when sites are
managed independently; in particular, it makes it hard to distinguish
between a deleted object and one that was corrupted (``morphed'' into
another) and needs to be restored. Ruling out deletions is natural in a
digital library, where it is important to keep a historical record. Thus,
books are not ``burned'' but ``removed from circulation.'' We can provide
an analogous high level mechanism for indicating that certain objects
should not be provided to the public.
Having immutable objects presents some management challenges. For
example, say we create a new version Y of some object (say a video
clip) X. We cannot mark directly X to indicate there is a new
version Y that should be accessed, because this would be an
in-place update to X. In Section 5 we show how we can
``indirectly'' record such changes. Of course, having no deletions
increases storage requirements. We do not believe this is an
important issue because (1) storage costs are so low, and (2) we are
only archiving in this fashion library objects, not all possible data.
Layered Architecture
Since each DLR site may be implemented differently, it is important
to have well defined and as simple as possible site interfaces.
Furthermore, it is also important to have clean interfaces for
services within a site, so that different software systems
could be used to implement individual components. We achieve this in
our architecture by defining service layers at each site. The
layers include:
- Object Store Layer:} The Object Store layer uses a
Data Store (e.g., file system, database management system) to
persistently save objects. This layer may use its own scheme to
identify objects (e.g., file names, tuple-ids). We refer to these
local identifiers as disk-ids.
- Identity Layer: This layer has two main functions:
(i) it provides access to objects via their handles (signatures); and
(ii) it provides basic facilities for
reporting changes to its objects to other interested parties.
- Complex objects layer: Manages collections of related objects. Its
services could be used to maintain the different versions (or
representations) of a document.
- Reliability layer: Coordinates replication of objects to
multiple stores, for long term archiving. The assumption is
that the Object Store layer makes a reasonable effort at reliable
storage, but it cannot be counted on to keep objects forever
- Upper layers: Provide mechanisms for protecting intellectual
property, enforcing security, and charging customers under various revenue
models. It can also provide associative search for objects, based on
metadata or contents of objects, as well as user access.

Layers of a Cellular Repository
In the figure above we illustrate the layers of a DLR.
Each ``column'' in the figure represents a site, and each ``row'' a
software layer. We call the implementation of a layer at a site a
cell, and the complete repository a cellular DLR. Cells
can collaborate with others to achieve their goals. For example, the
reliability cell at Site 1 communicates with the reliability cell at
Site 2 Cells below the reliability layer only deal with their local
site. In this paper we only study the grayed-out cells.
Awareness Everywhere
Awareness services (standing orders, subscriptions, alerts) are
important in digital libraries. They are also important for our
reliability and indexing layers: if one site is backing up another,
it must be aware of new objects or corrupted objects to take
appropriate action. Similarly, to maintain an index up-to-date,
changes need to be propagated. In many systems, awareness services
are added as an afterthought, once the base storage system is
developed, and this makes it hard to detect all changes. In
our architecture, awareness services are an integral part of every
layer. This makes it possible to build very reliable awareness
services, that can be used for replication and indexing.
Disposable Auxiliary Structures
Layers typically maintain auxiliary structures for improving
performance. In our architecture these structures are designed to be
disposable, so they can be reconstructed from the underlying
digital objects. To illustrate, consider the Identity layer. For
efficient lookup, it needs an index structure that maps a handle
(signature) into the local disk-id (e.g., file name). One option
would be to store this index as a digital object, making it part of
the DLR. However, this opens the door for inconsistencies. For
instance, the index may say that the object with handle H can be
found at disk-id D, but the signature of the object at disk-id D
is not H. Instead, we say that no auxiliary structures are part of
the DLR. (The structures may be on secondary storage that in not
part of the DLR.) If the structures become corrupted or inconsistent
with the DLR, they should be deleted and reconstructed from scratch.
In addition to avoiding potential inconsistencies, this approach also
makes it easy to migrate objects to a new store, when the old one
becomes obsolete. Auxiliary structures, which are typically
intricate, do not have to be migrated to the new system. The new
system can simply obtains the digital objects, and builds its own
structures, using whatever implementation it desires.