Digital Libraries Research Agenda Report - Appendix 3: Working Group Reports

Report of the Commercial Perspective Working Group

IITA Digital Libraries Workshop

Rashomon Meets Digital Libraries

Michael Lesk

Commercial exploitation of the Internet, the Web, and digital libraries is rushing towards us. What important limits on economic use of digital libraries could be alleviated by new research? This was the theme of our group discussion.

We discussed briefly the definitions of "digital library" and "infrastructure." Key ingredients in a "library" are organization and access tools, in addition to piles of bits. Publishers must be involved in the infrastructure debate, and we note approvingly that the existing digital library projects all involve publishers in cooperative roles. But a key question is what new research can do to assist the broad range of economic impacts from digital libraries, not just the effects on publishers or libraries.

Commercial organizations can exploit digital libraries in many ways. They can obtain information from libraries to help their operations; they can use library software to manage their own information; and they can sell services based on the delivery of information. The information industry is in a state of flux and rapid development, taking advantage of the quick progress in computer networks. However, there are still obstacles that new research might overcome, and opportunities that new research might provide, which would both facilitate the development of the industry. Many of these issues are targeted at opening the markets for information services and digital libraries, assuring all companies an equal chance at participating in these efforts.

We assume a familiar context: Digital libraries are collections of byte streams, stored in ways that permit users to retrieve and view information that they want. There will be many such collections, distributed around the United States and the world. Some will be freshly created material, and some will be converted from other forms. The various repositories of digital information will be connected, and users will be able to view the multiple repositories as if they were one big library, even though they will not be owned by one organization. There are many technical advances needed to bring about this world, and many are being developed right now, funded by various governmental and private organizations. This report points to some new areas where additional research is of highest priority.

The most important issues are about basic infrastructure support. In the context of digital libraries, these issues include preservation, collection, location, and mapping of information names ("handles") to locations. Preservation includes technology for long-term stable storage, techniques for managing archiving practice and refreshing, ways of verifying the integrity of stored files, and methods of tracking the number of copies of files in distributed repositories. For example, it should be possible to design file repositories to keep a count of the number of copies of each file, even though they are geographically spread, and to maintain a threshold so that a file can not be deleted if this would reduce the number of copies below the stated threshold. Collection involves technology to select material for long-term storage, to provide assistance in cataloging and storing the material, and to describe the collection for use in information navigation and retrieval. Location and location mapping require technologies for rapid retrieval of items of known locations and mapping of semantically meaningful names to locations.

Query handling must also be improved to facilitate information services. Queries in new and larger information systems often retrieve a great many documents, and techniques for visualizing and summarizing answer sets are required. Query negotiation is going to be necessary as well, since queries will often be sent to libraries under circumstances that only allow constrained processing. The constraints may arise out of limited bandwidth, access restrictions based on charging, or other circumstances. More general browsing interfaces are also needed.

Interoperability is also a key requirement for digital libraries. It must be possible to send queries to multiple index servers and retrieve documents from multiple repositories without human intervention. This will require either standardization or automatic translation. Research into both areas is necessary.

Remaining topics we identified as important include:

User modeling. The maintenance of state from one session to another, and the acquisition of information about each user's goals and intents. This can be used to form models of what kind of query processing will be advantageous to each user, and to improve the performance of systems in a world in which many queries are too short and need supplementary context.

Automatic methods of assessing quality, genre, and other properties of documents. Traditional library classification systems address subject content only, and do not deal with other aspects of documents that are often important for user needs. Given the costs of manual evaluation, we need fully automatic methods, perhaps involving language processing, to extract these properties of digital documents (or other digital objects).

Economic and social models and alternate structures for publishing. One of the key bottlenecks in the development of digital libraries has been uncertainty about which organizations would perform which tasks, and how they would be able to recover their costs. Technology research is needed to simplify these tasks and to provide systems which can be used for collecting revenue. Since the entire structure is uncertain, techniques for economic evaluation, perhaps including simulation, may be needed to suggest the best organizational roles for digital library administration.

Access control and economic charging structures. This topic is related to the previous one, but deals more directly with possible charging algorithms. Authors or readers might pay for information, and they might pay by the month, byte, page, article, minute, or other measures. Practical methods for administration of cost recovery in digital libraries, both for individual users and in the context of site licenses to institutions, are necessary. One very important technical issue is downstream protection: We need technological ways to make it difficult people to resell copies of purchased information.

Multimedia authoring and querying. Although simple text retrieval is now well understood, searching sounds, images, and video is still a difficult task. We need research on indexing, matching, and clustering of all kinds of media. In addition, there is a danger that the rise of multimedia will decrease the diversity of information sources available because of the increased cost of developing this material. Technology to improve authoring would alleviate this problem.

Individual tools to support use of combined personal and public files in a workstation library for use by one researcher. Individual information systems are going to become commonplace, and the methods by which they are connected to distributed national digital libraries are not certain. Research to improve an individual's use of information is needed to help people make the best use of digital library information.

Publicly managed cryptosystems so that businesses can use a standard form of cryptographic protection while avoiding monopolistic practices. The need for trust in cryptographic software and key servers makes it unlikely that many small vendors can serve this market, and having only one vendor will raise monopolistic risks. Public management of keys will avert these risks.

Evaluation criteria. Basically we wish to have commerce in electronic information continue to grow and thrive, and establish the US as a world leader. We need user acceptance, including institutional and individual reliance on digital libraries, and public acceptance (e.g., when use of the word "library" no longer conjures up an idea of paper books any more than use of the word "watch" implies a circular dial today). Instrumentation of our programs is also important so that we can tell how often and perhaps even how effectively they are being used.

We also thought a few mileposts along the road to acceptability should be noted. Some have already been passed: There exist libraries today which spend more on electronics than on paper, for example (typically in pharmaceutical companies). We suggested:

  1. A single on-line source sells electronic articles from a variety of publishers.

  2. An electronic information company makes it into the Fortune 500 (almost true today for American Online).

  3. A major library devotes more space to people than to paper.

  4. People throw away books with the same ease and personal comfort that they have when the type "rm".

  5. A faculty member at a top-rate university gets tenure for papers published only electronically.

  6. An ARPA grant is given to a proposal citing only electronic references.