| by Cynthia Manley December 2000 |
![]() |
The revolution in information technology has affected access to information in fundamental ways. Traditional libraries are being enhanced and extended electronically as a result of the increased amounts of information available in digital form.
For centuries, libraries have existed to preserve society's cultural artifacts and to provide access to them. The traditional role of libraries has been to collect, organize, store, retrieve and disseminate the data and information that become knowledge. If libraries are to continue to function as keepers of society's cultural artifacts, it's essential that they extend this function into the digital realm because without its cultural artifacts, society has no memory and no mechanism to learn from its successes and failures.
The proliferation of digital technology has pushed libraries to consider digital alternatives to print collections and has impacted their ability to assure the accessibility of digital materials for future users. The library community has recognized that there is an urgent need to organize, provide access to, and preserve the rapidly growing body of information that exists only in digital formats. One organization which exemplifies this enhanced awareness is the Council on Library and Information Resources (CLIR) which serves as administrative home to the Digital Library Federation (DLF). The DLF is a consortium of research libraries and archives that is committed to the preservation of digital materials.
Basically, a digital library is a collection of information resources stored in electronic format. But,the definition of a digital library depends on who answers the question because the terminology means different things to different people. The phrase 'digital library' will have a different meaning depending on whether you are talking to a librarian, a computer scientist, an educator, a journal publisher, or a Web master.
Currently, digital libraries tend to resemble traditional libraries and digital library users tend to exhibit the typical needs and habits of traditional library customers. In spite of that resemblance, there are fundamental differences between the two kinds of libraries and those differences represent both an opportunity and a challenge for the library community in terms of being able to provide permanent access to electronic materials for future use. Archiving is only one of many issues surrounding digital libraries.
The social implications of the preservation issue are staggering. Historically most of the material preserved in archives is in printed form and preservation of the material has been carried out for centuries by libraries and other cultural heritage institutions. "The present archival system has worked well for printed material. It has not however been as successful with other twentieth century media." 1 For example, a great deal of the early history of film and sound recordings is lost, although some artifacts from this period do exist. Part of the problem is technical because many of the early recordings were made with equipment that no longer exists, and the media on which they were made has deteriorated. From a cultural perspective these 'new' media were not recognized as important parts of the social record until they had achieved a certain level of maturity, popularity, and acceptance and thus a lot of early materials were not saved. Digital materials are on a similar track.
The combination of the explosive growth of information in digital form and the rapid pace of technological evolution, leading to obsolete hardware and software systems has caused librarians to worry about who will be responsible for archiving digital materials and insuring that important scholarly publications are available to future researchers. The start of the current debate on digital archiving has been attributed to the Task Force on Digital Archiving. At the end of 1994 the Commission on Preservation and Access (CPA) and the Research Libraries Group (RLG) created a Task Force on Archiving of Digital Information charged with investigating and recommending means to ensure 'continued access indefinitely into the future of electronic records stored in digital electronic form'. In October 1999, the Council on Library and Information Resources (CLIR), the DLF, and Coalition for Networked Information (CNI) convened a group of publishers and librarians to discuss responsibility for archiving the content of electronic journals.2
[Top]
The library community is painfully aware that preserving digital documents may require
substantial new investments and commitments by institutions and government agencies.
Along with other institutions and organizations, libraries have begun to discuss alternative
economic and administrative policies for funding and managing digital preservation and
have begun to develop conceptual frameworks for metadata that are not restricted to the
The objective of their efforts is to propose a research agenda that should be of interest to all
those concerned about digital preservation.
"The technical problems of managing digital information into the future are formidable.3
The purpose of this paper is to explore the archiving issues faced by digital libraries and to examine
current projects that have been developed in order to resolve those issues.
It will discuss the scope of the problem and some of the proposed solutions.
The problem of digital archiving has widespread implications that extend beyond the library domain. It affects government records, scientific data, corporate data and many other kinds of materials that are available in digital format. Digital preservation is a complex issue because there are: 1) no clearly defined/accepted standards, strategies or policies for archiving digitial material; 2) no agreement regarding who has responsibility for archiving digital materials; and 3) no identifiable infrastructure exists for digital archiving.
The preservation of digital documents is a matter of concern for more than just libraries. A 1990 House of Representatives report 4 cited a number of cases of significant digital records that had already been lost or were in serious jeopardy of being lost, and the 1997 documentary film Into the Future5 cited additional instances of lost digital materials.
The longevity of digital content is problematic for a number of complex and interrelated reasons. In addition to the technical aspects of this problem, there are administrative, procedural, organizational, and policy issues surrounding the management of digital material. Digital documents are different from traditional paper documents in ways that have significant implications for the means by which they are generated, captured, transmitted, stored, maintained, accessed, and managed. Digital information cannot survive without some form of active preservation. Digital preservation also raises issues concerning jurisdiction, funding, responsibility for successive phases of the digital document life cycle, and the development of policies requiring adherence to standard techniques and practices to prevent the loss of digital information. These issues cannot be meaningfully addressed unless a technical solution to the digital longevity problem is accepted and implemented.
[Top]Digital media are vulnerable to loss in ways: the physical media on which they are stored are subject to physical decay and obsolescence, and the proper interpretation of the documents themselves is inherently dependent on software. There is awareness of the fact that digital storage media have severely limited physical lifetimes. The National Media Laboratory has published test results showing that a wide range of tapes, magnetic disks, CD-ROMs, and other media have a much more limited lifespan than expected.
Media obsolescence occurs when the medium on which information was disseminated or stored disappears from the marketplace, when equipment that is capable of reading the medium are no longer produced, and media-accessing programs for deciphering the information contained on the medium are no longer written for new computers. Upgrading to a new computer system usually requires abandoning an old storage medium, even if an organization still has documents stored on that medium. The dual problems of short media lifetime and rapid obsolescence have led to the nearly universal recognition that digital information must be copied to new media (refreshed) on a very short cycle (every few years). Copying is a straightforward solution to these media problems, though it is not trivial: in particular, the copy process must avoid corrupting documents via compression, encryption, or changing data formats. Digital documents are usually dependent on application software to make them accessible and meaningful. Copying media correctly at best ensures that the original bit stream of a digital document will be preserved. Fundamental technical and intellectual problems are involved in extending the process of archiving beyond printed materials. Once published printed works are static, and unless they are printed on acid paper, they will last a long time in a reasonable environment with little attention or effort. Sound, film and similar material are also fixed once published, but they are on media that have variable and often rather short lives.6
The rapid growth of information sources that benefit large segments of society has led to an intense examination of the infrastructure that supports the provision of and access to such resources. A number of projects have been undertaken to explore and identify appropriate archival strategies for storing digital materials. Some of the institutions that are involved in this effort include the Association for Computing Machinery, Carnegie Mellon University, and NIST.
The Andrew W. Mellon Foundation has issued a request for proposals from a select number of libraries that have agreed to work in collaboration with publishers to develop approaches to digital archiving.7 The growth of information over the Internet has attracted more attention from federal, state, educational, and scientific sectors for it s potential to enhance learning and research. As the biggest proprietor of archivable data, the Federal Government struggles to preserve the "uncountable" number of records it generates daily.8
The variety of research and exploratory projects seek to answer the following questions: 1) who will decide which parts of our culture are worth preserving for the future? 2) what technological configuration might provide reliability and redundancy for long term storage? Research repositories globally are working to develop infrastructures for identifying, acquiring, managing, and accessing digital materials. Organizational models for successful digital archives being tested in Europe, Australia, and North America hold promise for institutional and collaborative approaches to a wide range of operations and facilities. Some projects that are exploring appropriate archival practices for storing digital material are the Internet Archive, the Intermemory Project, the Cedars Project, JSTOR and new initiatives proposed by OCLC and IBM.
[Top]The Internet Archive is a 501(c)(3) public nonprofit that was founded in 1996 to build an 'Internet library,' with the purpose of offering free access to historical digital collections for researchers, historians, and scholars. The Archive is working to prevent digital documents from disappearing off the Internet.
In keeping with the library tradition of open and free access to literature, the Archive opens its collections to researchers, historians, and scholars to ensure that they have free and permanent access to public materials.
In July 1997, a team of five research scientists at the NEC Research Institute in Princeton, N.J., began developing a new kind of global, distributed computer memory called the Intermemory. The Intermemory was created to "preserve information perfectly and protect it in perpetuity, invulnerable to hackers, power outages, natural disasters, war and just about everything with the possible exception of Armageddon." 9
Intermemory's creators view it as a foundation upon which data structures such as Digital Libraries may be built. For example, read-only file systems such as an ISO-9660 (CD-ROM) image stored in the prototype Intermemory can be mounted under Linux, enabling the files they contain to be conveniently accessed. Since our prototype Intermemory also runs an http daemon, Web browsers connecting to it can retrieve documents from such a mounted file system. There are other approaches as well to creating a gateway between Intermemory and the Web.
Intermemory has the following advantages:
The Cedars Project officially came into being in April or 1998. It is funded by the Joint Information Systems Committee through the eLib programme and administered through the Consortium of University Research Libraries (CURL). CURL is made up of three institutions - Leeds, Oxford and Cambridge. Recent years have seen a massive increase in the range and volume of digital information resources, and their acquisition by libraries. In the United Kingdom (UK), as in the United States, there is no formal mechanism for the long-term preservation of digital material. CURL recognizes there is a pressing need for a strategy for digital preservation which addresses both the urgency of rapidly obsolete technologies, the current economic situation as well as the complex intellectual property rights issues which arise from work in this area. The Cedars Project provides an opportunity for research libraries and other stakeholders in the UK to explore digital archiving in some depth.
Cedars stands for "CURL exemplars in digital archives" and the main objective of the project is to address strategic, methodological and practical issues and provide guidance in best practice for digital preservation. It proposes to do this by working on two levels - through practical demonstrator projects which will provide concrete practical experience in preserving digital resources and through strategic working groups based on broad concepts or concerns which will articulate preferences and make recommendations of benefit to the wider community. The main deliverables of the project will be recommendations and guidelines as well as practical robust and scaleable models for establishing distributed digital archives. It is expected that the outcomes of Cedars will influence legislation for legal deposit of electronic materials and feed directly into the emerging national strategy for digital archives currently being developed through the National Preservation Office of the British Library.
By converting the older materials to digital formats and making them searchable, JSTOR injects new life into materials that may seem moribund. Established in August 1995, JSTOR is an independent not-for-profit organization created with the assistance of The Andrew W. Mellon Foundation to help the scholarly community take advantage of advances in information technology. JSTOR has adopted a system-wide perspective that takes into account the sometimes conflicting needs of scholars, libraries and publishers. Although its initial focus has been on core scholarly journals, the primary objectives10 of JSTOR are to:
Recently, the Research Libraries Group (RLG) and OCLC Online Computer Library Center began discussing ways the two organizations can cooperate to create infrastructures for digital archiving.11 As a first step, OCLC and RLG have begun collaboration on two working documents to establish best practices for digital archiving. The documents, Attributes of a Digital Archive for Research Repositories will outline the characteristics of reliable archiving services, and Preservation Metadata for Long-Term Retention will propose approaches for descriptive and management metadata needed in the long-term retention of digital files. RLG and OCLC hope to bring key players together to identify common practices among those most experienced in the archiving arena. After being reviewed by key stakeholders around the world, the papers are expected to serve as a basis for further exploration of roles and responsibilities of RLG, OCLC, and others.
Another recent project was initiated in September 2000, when Koninklijke Bibliotheek (the National Library of the Netherlands) and IBM Nederland signed an agreement to start a project on digital archiving and preservation. This project is expected to result in the creation of a unique storage system that will be the heart of the Dutch deposit library. The National Library will build the DNEP system with help of funds from the Department of Education, Culture and Sciences. The objective of this project is to ensure that publications will be accessible for future generations. The DNEP system will contain both electronic documents published in the Netherlands and digital publications from other sources.
The long-term digital preservation problem calls for a solution that does not require continual or repeated invention of new approaches every time formats, software or hardware paradigms, document types, or recordkeeping practices change. Since it is not possible to predict future changes, and the solution must not require labor-intensive translation or examination of individual documents. It must handle current and future documents of unknown type in a uniform way, while being capable of evolving as necessary. Furthermore, it should allow flexible choices and tradeoffs among priorities such as access, fidelity, and ease of document management.
The information-storage medium of the past couple of millennia has of course been paper. Paper does decay with time, and it is fragile, but it can be read and used without a machine. In contrast, the dual problems of short media lifetime and rapid obsolescence have led to the recognition that digital information must be copied to new media (refreshed) every few years. The copying process should avoid corrupting documents via compression, encryption, or changing data formats. Digital documents are usually dependent on application software to make them accessible and meaningful. Copying media correctly at best ensures that the original bit stream of a digital document will be preserved. It is sometimes suggested that digital documents be printed and saved as hard copy. This is not a true solution to the problem, since many documents (especially those that are inherently digital, such as hypermedia) cannot meaningfully be printed at all, or would lose many of their uniquely digital attributes and capabilities if they were printed. Some companies have begun "refreshing" their aging records, by continually copying them onto new storage media using new software but most institutions have not yet realized that this strategy may be necessary.
[Top]An alternative to printing digital documents includes translating them into standard forms or extracting their contents without regard to their original nature. This approach can be dangerous because the meaning of a document may be contextual, and since meaning can depend on the user, what may be a trivial transformation to one user could be a disastrous loss to another. This approach often sacrifices elements such as format, font, footnotes, cross-references, citations, headings, numbering, shape, and color. It can also leave out entire segments such as graphics, imagery, and sound.
CLIR has supported several studies on digital archiving and has published reports12 on various approaches to the problem, including emulation and migration. With CLIR support, Cornell University Library and Cornell's computing science department have developed a tool for risk analysis that helps assess the long-term risk of migrating selected textual and numeric file formats.
In a 1999 report, Dr. Jeff Rothenberg puts forth a proposal for emulating obsolete software and hardware systems on future, unknown systems, as a means of preserving digital information far into the future. In his opinion, the emulation approach is the only approach offering a true solution to the problem of digital preservation. In the report, he explores the problem of long-term digital preservation and spells out the criteria for an ideal solution. He goes on to describe how to encapsulate a document so that is can be decoded by an emulator, the sequence of events required to preserve the document and to read it on future systems, and the techniques that need to be developed in order to make emulation work.
Digital documents are characterized by the ability to make perfect copies of digital artifacts, to publish them on a wide range of media, to distribute and disseminate them over networks, to reformat and convert them into alternate forms, to locate them, search their contents, and retrieve them, and to process them with automated and semi-automated tools.
The adoption of every new generation of software and hardware technology can result in the loss of information, as documents are translated between formats. The most serious losses are caused by those paradigm shifts which require the complete redesign of documents in order to migrate to the new paradigm. Whenever this happens, documents that are not in continuing use may be abandoned to save the cost of migration, while each document that does migrate may be turned into something unrecognizably different from its original. Migration can often result in subtle losses of context and content.
[Top]The Task Force recommended the development of a national system of digital archives to act as repositories for digital information. It suggested that "without the operation of a formal certification program and a fail-safe mechanism, preservation of the nation's cultural heritage in digital form will likely be overly dependent on marketplace forces, which may value information for too short a period and without applying broader, public interest criteria."
One option is to distribute, rather than centralize, the responsibility for archiving digital materials among specific bodies, such as the National Sound Archive, the Public Record Office, or the British Library for instance. Other areas could be handled by commercial organisations with the appropriate resources and expertise, on a contract basis, such as publishers, data storage and recovery companies and private picture libraries.
A National Office for Digital Archiving represents one possible way of carrying forward the agenda on digital archiving. The problem of preserving (or at least conserving) digital materials is growing and a coherent, coordinated national strategy needs to be put in place.
Standards can be used to create a framework of data types. Reliance on standards would appear to offer the best solution to the preservation problem by allowing digital documents to be represented in forms that will endure into the future and for which future software will always provide accessibility. Some standards, such as standard generalized markup language (SGML) have proven worthwhile within their limited scope. Since text is likely always to be a part of most documents, SGML provides a useful capability), even though it does not by itself solve the problems of nontextual representation or of representing dynamic, interactive documents. In fact, if SGML had been adopted as a common format among word processing programs, it would have greatly relieved the daily conversion problems that plague most computer users. Unfortunately, this has not occurred, implying that even well-designed standards do not necessarily sweep the marketplace. Nevertheless, converting digital documents into standard forms, and migrating to new standards if necessary, may be a useful interim approach while a true long-term solution is being developed. I also suggest below that standards may play a minor role in a long-term solution by providing a way to keep metadata and annotations readable.
Metadata are another element to consider. A vital part of the work encompassed by the Cedars project is the development of a framework for data descriptions which will ensure the long-term preservation of digital materials. Meaningful access to digital objects over periods of decades or even centuries will require a sophisticated system of metadata to describe detailed technical processes, managerial and adminstrative activities as well as copyright and access control. 13
An ideal approach should provide a long-term solution that can be applied uniformly and automatically to all types of documents and media, with minimal human intervention. It should facilitate document management by associating metadata with each document. It should retain as much as desired and feasible of the original functionality and design of each original document, while minimizing translation so as to minimize both labor and the potential for loss via corruption. The ideal approach should also offer alternatives for levels of safety and quality, volume of storage, ease of access, and other attributes at varying costs, and it should allow these alternatives to be changed for a given document, type of document, or corpus at any time in the future. It should offer up-front acceptance testing at accession time, to demonstrate that a given document will be accessible in the future.
Most of the approaches that have been suggested as solutions to the problem of digital archiving will require different players to apply their strengths and perspectives in complementary ways. More research is needed to address the complex and interrelated issues associated with the long-term preservation of digital material for future use.
Few institutions have escaped the Web's impact. This is particularly true for institutions that deliver information. The underlying technologies that grow digital libraries are changing and evolving at a rapid pace. The expectations are far greater than ever before for what libraries can offer to their patrons in digital format and electronic access. It falls to librarians and archivists to hold to the tradition which reveres history and the published heritage of our times. The social implications of archiving are staggering. Archiving is only one or many issues surrounding digital libraries. Other issues include copyright, privacy and free speech, trademark, trade secrets, import/export issues, stolen property, pornography, and the question of who will have access to the libraries.
Although offering the possibility of perfect preservation, digital information also raises many practical barriers to long life. It is often stored on media with short life spans, it may require reading equipment that has an even shorter life span. Software may be needed to interpret and or view the data. Added complexity is associated with some of the new digital works that contain dynamic, interactive digital documents, because they are not fixed in form at publication, they evolve and change. 14
Preservation activities play a critical role in society. Without archived material, it would be extremely difficult for society to exercise the "right to remember". With so much of the society's records moving from paper to digital media, digital libraries should become essential players in maintaining that "right to remember".
According to Kenneth Thibodeau, "Digital information technology is creating major and serious challenges for how we're going to preserve anything of our culture and our history. It's also creating opportunities: we'll be able to preserve and use a lot more information than ever before."15
It is hoped that any solution that is implemented will be able to handle issues of corruption of information, privacy, authentication, validation, and preserving intellectual property rights, and that solution will be feasible in terms of the societal and institutional responsibilities and the costs required to implement it.
Presently, no single, long-term solution to the problem of digital preservation exists. However, a lot of cooperative and collaborative activity is taking place among those interested in the preservation of digital resources. Judging from the level of activity reflected in the literature, it is only a matter of time before a solution (of some sort) is found. Successfully implemented archiving strategies will allow digital libraries to change electronic content from ephemera to permanent artifacts of our society.
1. National Research Council. Computer Science and Telecommunications Board. The Digital Dilemma: Intellectual Property in the Information Age. National Academy Press, Washington, D.C. 2000, p.113. [Back to Text]
2. The group was asked to consider what would be required to ensure access to electronic journals for 100 years. They also suggested that language should be included in license agreements to designate responsibility for digital archiving. [Back to Text]
3. National Research Council. 2000. p. 118. [Back to Text]
4. United States Congress. 1990. House Committee on Government Operations. Taking a Byte out of History: The Archival Presentation of Federal Computer Records. House Report 101-987. Washington, D.C. [Back to Text]
5. This film was produced by Terry Sanders in association with the Council on Library and Information Resources and the American Council of Learned Societies. It explores the issues behind the survival of digitally stored information into the future. [Back to Text]
6. National Research Council. 2000. p. 116. [Back to Text]
7. National Research Council. Developing a Digital National Library for Undergraduate Science, Mathematics, Engineering, and Technology Education: Report of a Workshop. National Academy Press: Washington, D.C. 1998. [Back to Text]
8. The last attempt to count this data was made early in the 1990's by the National Academy of Public Administration. It identified about 12,000 major databases, but estimated that the count missed at least that many more. The count did not include a the large amounts of scientific data at the space and weather agencies, and it excluded data on individual PC's. [Back to Text]
9. Information about the goals of the Intermemory Project are taken from its web site: www.intermemory.org. [Back to Text]
10. Information about the goals and objectives of JSTOR are taken from its web site: www.jstor.org. [Back to Text]
11. RLG and OCLC announced their joint venture in a press release dated March 10, 2000. [Back to Text]
12. CLIR publishes newsletters, technical reports, and other items. The full text of many of their publications is available on the CLIR web site. Technical reports covering topics of interest to the library community and others are published throughout the year. [Back to Text]
13. See the Dublin Core Metadata Initiative at: purl.oclc.org/dc/ for more details. [Back to Text]
14. National Research Council. 2000. p. 117. [Back to Text]
15. Kenneth Thibodeau. "To Be or Not To Be: Archives for Electronic Records." Archival Management of Electronic Records, 1991, p.5.[Back to Text]
Arms, William Y. "Key concepts in the Architecture of the Digital Library." D-Lib Magazine. 1995. http://www.dlib.org/dlib/July95/07arms.html.
"The Net Never Forgets." CIO Magazine. April 1999. http://www.cio.com/archive/040199_trendlines.html.
Gleick, James. 1998. "The Digital Attic: Are We Now Amnesiacs? Or Packrats?" http://www.around.com/packrat.html
Guenther, Kim. "The Evolving Digital Library." Computers in Libraries. February 2000.
Haynes, David, David Streatfield, Tanya Jowett and Monica Blake. "Responsibility for digital archiving and long term access to digital data." http://www.ukoln.ac.uk/services/papers/bl/jisc-npo67/digital-preservation.html
Hodge, Gail. "Digital Electronic Archiving: The State of the Art, The State of the Practice."
April 26, 1999.
Hodge, Gail. "Best Practices for Digital Archiving: An Information Life Cycle Approach." D-Lib Magazine. January 2000, 6, no.1. http://www.dlib.org/dlib/january00/01hodge.html
Majka, David R. "The Seven Deadly Sins of Digitization". Online. Mar/Apr. 1999, 23: 43-48.
National Research Council. Developing a Digital National Library for Undergraduate Science, Mathematics, Engineering, and Technology Education: Report of a Workshop. National Academy Press: Washington, D.C. 1998.
National Research Council. The Digital Dilemma: Intellectual Property in the Information Age. National Academy Press, Washington, D.C. 2000.
Paepcke, Andreas,Chen-Chuan K. Chang, Hector Garcia-Molina, and Terry Winograd. "Interoperability for Digital Libraries Worldwide."Communications of the ACM.(1998)41,4.
RLG and OCLC Explore Digital Archiving March 10, 2000, Press release, http://www.rlg.org/pr/pr2000-oclc.html
Rothenberg, Jeffrey. Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation. Report to CLIR, January, 1999.
Schatz, Bruce and Chen, Hsinchun. "Digital Libraries: Technological Advances and Social Impacts." Computer.Feb. 1999.
Tanen, Ben and Anthony Oettinger. "Information Preservation in the Digital Era." January 1996, General Education 156: The Information Age http://www.netspace.org/users/btanen/electronic-records.txt
Thibodeau, Kenneth. "To Be or Not To Be: Archives for Electronic Records." Archival Management of Electronic Records. 1991:1-13.
United States Congress. 1990. House Committee on Government Operations. "Taking a Byte out of History: The Archival Presentation of Federal Computer Records". House Report 101-987. Washington, D.C.
[Top]This web site was created as part of an assignment for the Network Applications course (IS567), Fall 2000.
Comments or questions about the site can be sent to cmanley@utk.edu.
Last update: 12/7/2000