Digital Libraries Research Agenda Report - Appendix 3: Working Group Reports

Report of the Internet Perspective Working Group

IITA Digital Libraries Workshop

Mike Schwartz

1. Introduction: The Internet Perspective and Focus of This Working Group

2. Scenarios

2.1. Scenario 1: Unassisted Student

2.2. Scenario 2: Reference Librarian

2.3. Scenario 3: Scientist

3. Interoperation Problems

4. Recommendations

1. Introduction: The Internet Perspective and Focus of This Working Group

Our working group began by trying to elucidate successful aspects of approaches taken by current Internet information systems that could be applied to digital libraries, as well as identify aspects that might hinder digital library efforts. We discussed three broad characteristics that we feel define essential aspects of successful Internet information technologies. First, new technologies must be easily deployed and used. Second, new technologies must interoperate with legacy systems while providing new functionality. This was seen, for example, with the introduction of the World Wide Web (which interoperates with older Gopher servers), and more recently with the introduction of Sun's Hot Java browser (which interoperates with older WWW servers). Finally, we felt that a critical aspect of the Internet is that success is primarily measured by how widely and quickly a technology is adopted. There is a clear history in the Internet of information systems that have undergone exponential usage growth (e.g., the Domain Naming System, Gopher, and WWW), and reaching this stage of technology transfer should be a goal of digital library projects if they are to make a serious impact on the Internet information infrastructure.

Of equal note are areas where Internet approaches and technologies do not mesh well with the future needs of society in general and digital libraries in particular. Some especially pressing problems are the lack of widely adopted support for billing, privacy, security, and related services, and the lack of global location-independent naming. Clearly, each of these problems represents areas of active research and development. Even so, there are problems that need more support than that provided by (for example) the current generation of experimental electronic commerce systems, such as the ability to separate content from content-independent aspects of an information object. This has important legal ramifications for handing data with intellectual property value.

Another area of concern is the historical Internet bias towards free software and information services. We note that digital libraries must transcend this philosophy if they are to serve valuable intellectual content. However, we believe that the essential strengths of the Internet will not exclude commercially created software and services, as long as they address real problems, are easily installed and used, and are based on open standards. Indeed, it seems clear that the PC software industry will play an increasingly critical role in the development of the next generation of Internet information services.

Given this background perspective, we focused on the question of what must be done to help the digital library infrastructure blossom. At the end of the current NSF/ARPA/NASA project funding, we would like to see a suite of easily used tools and widespread adoption of these tools by the many communities that use the Internet. In short, we want to see thousands of network-accessible digital libraries, not just six prototype demonstrations whose corpora become unreachable after funding ceases. Given the research focus of the funded efforts, this is a somewhat ambitious goal, yet we believe it is possible to achieve this level of success if priorities are set appropriately.

Rather than developing a long list of needed software components or a reference architecture to foster interoperation, we felt a more promising course would be to explore issues that adversely impact interoperation and deployment, and then enumerate a modest number of recommended priorities. We felt that the priorities should span more than just research; what is needed is a combination of research, development, and standardization.

Towards this end, we decided to create a few concrete digital library scenarios, with the intent of uncovering a range of differences that could lead to problems with deployment, interoperation, and technology acceptance. When creating these scenarios we explicitly chose not to try to cover "all" possible digital library visions, or even to force the scenarios to be visionary and futuristic. We felt that such speculative efforts are not likely to yield fruit in our current task.

The remainder of this report presents the scenarios, interoperation problems, and recommendations. One comment is in order before proceeding, however. It is debatable how far the Internet will eventually reach -- to some people it is primarily a data communications network, while others feel it will eventually subsume all communications functions, including telephony, television, and other networking. Rather than trying to offer our own predictions, we simply note here that the ensuing discussions should be considered in the context of a potentially much broader communications infrastructure than today's Internet. For example, one of the working group participants noted that a possible future scenario might be that television broadcasts might include an auxiliary data stream consisting of pointers into digital library source materials, to allow users to explore items of particular interest after viewing the broadcast. Clearly, such possibilities are limited only by our imaginations.

2. Scenarios

To help set the scope, before defining the scenarios we held a brief discussion with the goal of creating a concise definition of the national digital library infrastructure. We chose the following one sentence definition:

The national digital library infrastructure is an interactive medium in which producers and consumers can participate and which at a minimum would entail digital representations of existing and new more dynamic resources.

This definition purposefully avoids mentioning particular technologies. For example, a network is not required; a CD-ROM -based library would qualify. The definition also avoids discussing particular types or sources of data and particular information processing techniques. A digital library might encompass only search and retrieval, or it might also include facilities for evaluation and visualization, etc. What is more important is the eventual interconnectivity of digital libraries into a national infrastructure.

We created three scenarios, summarized as follows:

A student searching reference materials for a term paper, unassisted. This scenario is intended to capture a typical library user's situation.
A reference librarian managing a collection, through the processes of collecting, organizing, and disseminating. This scenario is intended to capture use of the more sophisticated mechanisms available in a research library setting, as might be used, for example, by a researcher working with a librarian while searching for information.
A scientist studying a spatial problem by combining private data, model generated data, and public data. This scenario includes a number of distinctive features, including active data and different data-sharing domains.

A more detailed discussion of each scenario follows.

2.1. Scenario 1: Unassisted Student

A college student in an environmental policy class turns on her portable computer to begin writing a paper on "environmental justice" that will be based on yet-to-be-found examples of environmental impacts on Native American communities. She connects her OLE browser to the Franklin digital library system to begin her work, and Franklin greets her by name based on an invisible credential exchange.

Franklin first pops a small window up in a corner that offers to show her material on mortgage redlining that has arrived in the past few days. Last week she wrote a paper on this topic, and Franklin has kept her interest on file. She is on a deadline for her current assignment, so she does not pursue this offer.

The student types "environmental," "justice," and "native american" into Franklin subject search boxes, and presses the focus button. Franklin pauses for one second, and then displays a book cart screen that shows related terms, references, and archived queries from other users that are directly related to the conjunction of these subject areas. She quickly selects terms from the abstract of the offered article "Unequal Protection: Environmental Justice and Communities of Color" by Bullard, and requests the article. The article is displayed, and the student first browses it at ten pages per second to locate figures and tables. While viewing the article, she marks a few relevant paragraphs. The student then zooms out from the article, and examines the entire journal issue where the article appears. The student marks a few words and presses focus again.

Franklin now shows other citations related to her original query and the additional items specified by Bullard. In addition, other entries authored by Bullard are displayed. At the same time, other references that matched her original focus request have now arrived from distant libraries, and they are integrated into the book cart. The student selects the term "racism" from another article by Bullard and a previous query constructed by another user on "environmental justice" and presses focus. Franklin pauses for a second, and shows a further revised list of citations. These citations show a wide range of origins, including the Library of Congress, the Harvard library, the index "Ethnic Newswatch," the index to "News from Indian County," and "Newspaper Abstracts Online." As she watches, the list of relevant items continues to expand. She selects the most relevant references, and finds in one of the articles a mention of a group called Native Americans for a Clean Environment. Selecting this name, she presses focus again, and finds herself at the group's Web server being solicited for a donation. After making a donation using the smart card in her purse, she scans the bulletin board at the Web server, and sends e-mail to a tribe chief to inquire about the struggle he faces.

Finally the student ends her Franklin session, and Franklin asks her to describe her recent query in a few more words to help other users find the trail she has blazed. Franklin then says good-bye, but keeps on looking in the background for relevant information, which will be there the next time she returns.

After reading this scenario, the group briefly debated aspects of the scenario. The primary concern was that the scenario might be expanded to encompass more complex situations:

The user interface might be a Personal Digital Assistant, allowing free-hand drawing rather than keyboard-based interactions.
The student's interaction might involve structured data types needing some translation.
The result set selection might require an auxiliary evaluation step (e.g., as performed by librarians by considering the source and relevance of each information source).
A for-fee vetting service might also be used to rank sources.

We then created two more scenarios, in an effort to flesh out some of the potential diversity. The lack of professional assistance, vetting, and other "real world" features in the above scenario led the group first into a discussion of the role of reference librarians in traditional libraries. This became the basis for our next scenario.

2.2. Scenario 2: Reference Librarian

Louise is a curator and reference librarian for the Materials Science Virtual Distributed Library (MSVDL), which specializes in materials science research. MSVDL is a virtual library because it does not have any holdings of its own, but catalogs and provides a uniform search and retrieval interface to a distributed set of physical repositories containing materials science resources. Together with colleagues located around the world, Louise identifies, selects, and catalogs digital resources of interest to MSVDL users, who include members of the materials science research community as well as educational and industrial users. The resources include electronic journals, technical reports, data archives, and visualizations of materials science phenomena. The results of Louise's collection management and organizational activities become part of the MSVDL catalog, which is available to members who pay a flat fee, and to other users on a pay-for-use basis. Although MSVDL does not hold copies of the resources it catalogs, the archiving subsystem of the distributed digital library system ensures that each resource remains available as long as it is part of the MSVDL collection, subject to copyright and other restrictions. Use of a globally unique location-independent naming system allows Louise and other catalogers to identify resources unambiguously with names that remain the same over the resource lifetimes. After searching the MSVDL catalog, users may resolve the names returned by a search to locate actual resource instances through use of a reliable and efficient name resolution service. Access to the instances is subject to copyright and other restrictions enforced by the physical repositories containing the instances.

In exchange for a fee, Louise will carry out a search for a client consisting of the following steps: 1) an initial email or videoconference reference interview; 2) a preliminary search starting with relevant secondary and tertiary sources (e.g., indexes, surveys, and bibliographies) followed by retrieval of some candidate primary resources; 3) an email or videoconference meeting with the client to evaluate preliminary results; and 4) modification of the user profile and search strategy followed by further retrieval. Louise selects databases to search depending not only on the client's subject area, but also on the degree of specialization and expertise level desired by the client. She attempts to know the source of retrieved information and to evaluate search results in terms of their accuracy, timeliness, and completeness. For users who cannot afford a human intermediary or who prefer to do their own searching, Louise has constructed a hypertext search manual that guides a user through the above steps. Louise also teaches online classes for end users on effective searching.

2.3. Scenario 3: Scientist

Joe Pine is an environmental chemist with the NC Department of Resources. He is studying acid rain deposition in the Appalachians outside Asheville, NC. He's on a field trip placing and maintaining acid rain sensors, when he notices that a large number of Jack Pine trees exhibit insect damage in areas with high levels of acid rain deposition. He wonders if this observation is coincidental or significant.

He begins by identifying the beetle doing the damage. He locates a beetle and, using the video capture on his laptop, grabs its image. He accesses the Scientific Library Advisor and requests matches with the image. The Advisor returns five thumbnail images as possible matches. He selects two that look close, and requests more detail. Based on additional images that show distinguishing features, he identifies the beetle. He requests semantic links for the beetle and is presented with a semantic web. He selects "environmental factors" and then within a subweb selects "pollution effects." He notes an entry "acid rain" and finds three articles listed, none of which looked at geographic distribution, but which suggest that the chemical components in acid rain may stimulate growth in beetles. To look at the correlation, Joe returns to the beetle "root node." selects the "maps" option, and follows a link to "distribution." He is presented with a series of thumbnails which are identified by date. He selects the one from three years ago and the most recent which is one month ago.

Using the map as a frame, he correlates the beetle distribution with acid rain levels. He accesses the EPA information/modeling repository, and retrieves maps for acid rain from three years ago. He overlays the beetle map interactively, varying acid rain levels, noting a tight correlation between beetle populations and "medium" levels of acid rain. He correlates his current acid rain readings for that range with the beetle map from one month ago, once again noting a good match.

He now wants to determine areas that are at risk for beetle infestation. He accesses the EPA acid rain deposition model, and requests a run for six months in the future. He "drops" a link (from his private library) to the NTIS photochemistry module onto the depositions model parameter screen, feeling that it provides a more accurate analysis than the EPA default. He logs off, after reading the message, "You will be notified when module results are available -- about 1/2 hour."

3. Interoperation Problems

We discussed a number of interoperation problems based on the above three scenarios. Perhaps the main point is that the term "digital libraries" connotes different things to different people, spanning many different types of information technology. Some digital libraries might support search and retrieval operations against managed archival collections, while other digital libraries might support dynamic objects, visualization, and other features.

Given this diversity of technologies, we focused our discussion of interoperation problems on five particular goals:

Reducing confusion from incompatible tools, formats, and models
Insulating developers and users from technology instability
Supporting increasing degrees of data complexity
Allowing a la carte inclusion of needed technologies
Sharing R&D technology

4. Recommendations

At this point,the group engaged in a brainstorming session to suggest recommendations for the above interoperation problems. Rather than trying to sort our recommendations by priority, we felt it best simply to present an unordered list of ideas. The recommendations are

To enhance the ability to evaluate digital library efforts and feed usage experience back into future designs, we recommend that
A usage record keeping system be integrated into digital libraries;
Shared analysis tools be developed; and
The collected data be made data publically available, in a suitable form to preserve privacy of digital library users.

On a related note, it might be worthwhile to initiate a set of independent projects to evaluate the current digital library efforts. There are already efforts underway to consider such evaluation techniques (e.g., an upcoming workshop on the topic to be held October 29-31, 1995 in Allerton Park, IL), and the digital library community should involve itself in these efforts.

We observed that the current generation of information access tools does not support operations users would like, in part because the tools were designed based on the underlying protocol functionality (rather than vice versa). For example, in Scenario 1, the user needs to work with a tool that continues to update a set of located information asynchronously from the user's interactions, but current protocols (such as HTTP) do not provide for this type of support. Accordingly, we recommend that user-centered design techniques be applied to digital library protocols.
We recommend performing research into protocols supporting a range of interaction styles, from batch to highly interactive. The current generation of information access tools operate primarily in batch mode (e.g., retrieving Web pages in response to URL access requests), which does not suit some of the types of interactions that will be needed.
We recommend initiating some group efforts to develop shared ontologies, schemas, and vocabularies. We particularly like the model that was used by the collaboratory and digital library projects, where proposals were fielded with community collaborations already in place.
We recommend supporting user- and group-customizable digital libraries, extending and integrating earlier work on
- database/hypertext views;
- resource discovery; and
- extensible/adaptable interfaces.
We recommend developing easily used tools encouraging higher-level representations so that collections can transition easily through generations of information technology. For example, we would like to avoid the cost of performing new markup when moving to the next "network publishing" paradigm, as happened when the Internet moved from Gopher to the WWW.
We observe that digital libraries can span a number of technologies (information matchmaking, visualization, programmable information appliances, etc.), and that it would be worthwhile to allow digital libraries to be constructed from a set of components. To enable this, we recommend performing research in software engineering/architecture to support a la carte inclusion of needed technologies, and to identify digital library tool classes.
We observe that interoperable Internet information systems ensue based on software artifacts rather pre-defined reference architectures. Therefore, we feel it is important to create a base of shared, reusable software, and to get this software into widespread use and testing. For this purpose we recommend forming a center for integrating, maturing, and redistributing digital library software, similar in spirit to the Berkeley UNIX software distribution that was performed during the 1980's.