Project Description

Key Components of the Stanford Archival DL Architecture

Under our architecture, a Digital Library Repository (DLR) is formed by a collection of independent but collaborating sites. Each site manages a collection of digital objects and provides services (to be defined) to other sites. Each site uses one or more computers, and can run different software, as long as it follows certain simple conventions that we describe in this paper. Our architecture is based on following key components.

Signatures as Object Handles

Each object in a DLR has a handle used to identify and retrieve it. Handles are internal to the DLR and are not used by end users to identify documents. (Example: If a user is searching for report STAN-1998-347-B, a naming facility not discussed here will translate into the appropriate handle, or handles if the report has multiple components.)

Given an object, we define its handle to be a (large) signature computed exclusively from its contents, using a checksum or a Cyclic Redundancy Check (CRC). If the contents are smaller than the size of the signature, the object (at creation time) is ``padded'' with a random string to make its size larger than the size of a signature. This scheme has the following properties, which are important in an archival environment:

The last item requires some discussion, since it may be possible that two different objects share a handle, which would be disastrous. However, by making the signature large (e.g., 128 bits or more), the likelihood of this disaster happening is so extremely low that it is not rational to worry about it. To illustrate, in Appendix 1 we derive a bound for the probability p that there is no disaster in a DLR with n objects and signatures of size b bits. The bound is extremely conservative, but yet we see that, say, a 256 bit signature can make even a DLR with 10 billion objects incredibly safe.

Collection Size (n) Probability of no collisions (p) Signature Size (b)
10^7 1-10^-9 76 bits (10 bytes)
10^8 1-10^-9 83 bits (11 bytes)
10^9 1-10^-9 89 bits (12 bytes)
10^7 1-10^-24 128 bits (16 bytes)
10^7 1-10^-63 256 bits (32 bytes)
10^10 1-10^-18 128 bits (16 bytes)
10^10 1-10^-57 256 bits (32 bytes)

If some applications (or paranoid users) need an absolute certainty that each signature is unique, then we offer the following enhanced identification scheme. Handles are extended to have two fields: a unique publisher field and the signature of the object. The publisher field is the unique code of the site that first publishes the object; this publisher code is assigned to the site by some authority. The publisher field of an object does not change when the object migrates to other repositories. The second field is the same as the signature described earlier. When a site creates a new object, it first stores its publisher field in the object header. Then it computes the signature of this extended object and checks if any other local object has the same signature. In the extremely rare case there is a conflict, we add a discriminator, a random string of bytes, at the end of the new object. The discriminator is included in the computation of the signature (and therefore will make the object map to a different signature), but it is filtered out when the object is returned to a user. From then on, the handle of an object is computed (at any site) by reading its publisher value and adding to it the object signature.

No Deletions

Because of our handle scheme, objects cannot be updated in place. That is, if the contents of an object are modified, it automatically becomes a new object, with a different handle. This is actually an important advantage, since it eliminates many sources of confusion. For instance, one cannot correct a typo in a report and make it pass as the same object. (We do provide a higher level mechanisms for tracking versions of an object.) Similarly, if a stored object is corrupted due to a disk error, the corrupted object will not be confused with the original.

Another fundamental rule in our architecture is that objects are never (voluntarily) deleted. Allowing deletions is dangerous when sites are managed independently; in particular, it makes it hard to distinguish between a deleted object and one that was corrupted (``morphed'' into another) and needs to be restored. Ruling out deletions is natural in a digital library, where it is important to keep a historical record. Thus, books are not ``burned'' but ``removed from circulation.'' We can provide an analogous high level mechanism for indicating that certain objects should not be provided to the public.

Having immutable objects presents some management challenges. For example, say we create a new version Y of some object (say a video clip) X. We cannot mark directly X to indicate there is a new version Y that should be accessed, because this would be an in-place update to X. In Section 5 we show how we can ``indirectly'' record such changes. Of course, having no deletions increases storage requirements. We do not believe this is an important issue because (1) storage costs are so low, and (2) we are only archiving in this fashion library objects, not all possible data.

Layered Architecture

Since each DLR site may be implemented differently, it is important to have well defined and as simple as possible site interfaces. Furthermore, it is also important to have clean interfaces for services within a site, so that different software systems could be used to implement individual components. We achieve this in our architecture by defining service layers at each site. The layers include:

  1. Object Store Layer:} The Object Store layer uses a Data Store (e.g., file system, database management system) to persistently save objects. This layer may use its own scheme to identify objects (e.g., file names, tuple-ids). We refer to these local identifiers as disk-ids.
  2. Identity Layer: This layer has two main functions: (i) it provides access to objects via their handles (signatures); and (ii) it provides basic facilities for reporting changes to its objects to other interested parties.
  3. Complex objects layer: Manages collections of related objects. Its services could be used to maintain the different versions (or representations) of a document.
  4. Reliability layer: Coordinates replication of objects to multiple stores, for long term archiving. The assumption is that the Object Store layer makes a reasonable effort at reliable storage, but it cannot be counted on to keep objects forever
  5. Upper layers: Provide mechanisms for protecting intellectual property, enforcing security, and charging customers under various revenue models. It can also provide associative search for objects, based on metadata or contents of objects, as well as user access.

Layers of a Cellular Repository

In the figure above we illustrate the layers of a DLR. Each ``column'' in the figure represents a site, and each ``row'' a software layer. We call the implementation of a layer at a site a cell, and the complete repository a cellular DLR. Cells can collaborate with others to achieve their goals. For example, the reliability cell at Site 1 communicates with the reliability cell at Site 2 Cells below the reliability layer only deal with their local site. In this paper we only study the grayed-out cells.

Awareness Everywhere

Awareness services (standing orders, subscriptions, alerts) are important in digital libraries. They are also important for our reliability and indexing layers: if one site is backing up another, it must be aware of new objects or corrupted objects to take appropriate action. Similarly, to maintain an index up-to-date, changes need to be propagated. In many systems, awareness services are added as an afterthought, once the base storage system is developed, and this makes it hard to detect all changes. In our architecture, awareness services are an integral part of every layer. This makes it possible to build very reliable awareness services, that can be used for replication and indexing.

Disposable Auxiliary Structures

Layers typically maintain auxiliary structures for improving performance. In our architecture these structures are designed to be disposable, so they can be reconstructed from the underlying digital objects. To illustrate, consider the Identity layer. For efficient lookup, it needs an index structure that maps a handle (signature) into the local disk-id (e.g., file name). One option would be to store this index as a digital object, making it part of the DLR. However, this opens the door for inconsistencies. For instance, the index may say that the object with handle H can be found at disk-id D, but the signature of the object at disk-id D is not H. Instead, we say that no auxiliary structures are part of the DLR. (The structures may be on secondary storage that in not part of the DLR.) If the structures become corrupted or inconsistent with the DLR, they should be deleted and reconstructed from scratch.

In addition to avoiding potential inconsistencies, this approach also makes it easy to migrate objects to a new store, when the old one becomes obsolete. Auxiliary structures, which are typically intricate, do not have to be migrated to the new system. The new system can simply obtains the digital objects, and builds its own structures, using whatever implementation it desires.