Week 8: Digital preservation strategies

No definitive answer

Librarians and archivists have studied digital preservation issues for more than twenty years, but as yet, there has been no consensus on a reliable digital preservation strategy. There is no definitive approach to the problem of maintaining digital content across multiple generations of technology.

These differences present the curators of digital materials with some fundamental challenges.

The way in which digital material is created, particularly the technologies used, will determine how conducive to long-term preservation those materials are, and will present varied challenges to curators charged with the subsequent management and preservation of the material.

Digital Asset Managers will need to have access to adequate metadata about the material if they are to successfully manage, preserve and keep it accessible. Proliferating multiple copies may also imply multiple versions and somehow the manager needs to ensure the integrity and authenticity of the material. It is important to be aware of changing technologies and the fragility of media and take these into consideration from an early stage in the life cycle of the material.

Within the digital preservation community, the concept of life-cycle (or continuum) management has emerged to describe and document the active management processes that need to take place, and the key decision-making and intervention points along the continuum. The life-cycle concept has been incorporated into OAIS Reference Model, now adopted as an ISO standard for digital preservation.

Decisions about preservation methods might usefully take into account the following three-tiered understanding of digital preservation:

  1. Preservation of the bit stream (basic sequences of binary digits) that ultimately represent the information stored in any digital resource;
  2. Preservation of the information content (words, images, sounds etc.) stored as bits and defined by a logical data model, embodied in a file or media format;
  3. Preservation of the experience (speed, layout, display device, input device characteristics etc.) of interacting with the information content.

Techniques for achieving the first of these objectives are well understood and include environmentally controlled storage, data replication, backup, and media refreshment. In the OAIS model, much of this activity falls into the archival storage function. The second and third objectives present a far greater challenge.

Binary data remains useful only for as long as it can be correctly rendered (displayed, played-back, interacted with) into meaningful content such as sound files and video clips. The process of rendering is performed by a complex mix of hardware and software, which is subject to rapid obsolescence. As a rule of thumb, it is reasonable to predict that current hardware and software will be able to correctly render a file for around ten years after its creation. By the end of this period, repositories need to have adopted a more active preservation strategy than simply preserving the bit stream of the file if they are to maintain access to information content held in the file. Either old data must be altered to operate in a new technical environment (migration, format standardisation) or the new environment must be modified so that it can render the old data (emulation, virtual computers).

Bitstream preservation

This simply involves retaining the original data as a single sequence of binary digits (bits), ie. the original data in an uninterpreted state. Bitstreams may be preserved in 2 ways: either as the original file in the data format as received, for example, an MP3 file; or in a normalised bitstream format, for example, as a sequence of bits contained inside XML wrappers. Bitstream preservation is widely viewed as a form of ‘insurance’ in that it allows for the possibility of using future techniques for making content accessible. As well, it is an additional form of data backup and is a necessary component of any digital preservation strategy.

In either case metadata is needed to make sense of the file. Using the first method, the metadata would need to be kept separately but associated with the data content. In the second method metadata about the file and its format can be included within the XML wrappers surrounding the content bitstream and is thus always kept with the associated content data.

Content/Experience Preservation

Preservation of Technology

Preserving and maintaining hardware and software (applications and operating systems is essentially creating an IT museum. This may be a useful short-term solution but costs in terms of the maintenance and storage of equipment makes it prohibitive for any sustained strategy.

Migration

Migration can be used to describe both file format migration and media migration. Each has its place in a planned preservation strategy.

Media migration is more often known as ‘refreshing’ and is necessary to ensure that data is not lost through media degradation over time. The lifetime of media must be estimated and migration to new media undertaken before the point where media errors make the data unrecoverable.

Integrity between versions is maintained through the use of fixity or checksum values.

File format migration is also used to ensure the accessibility of a digital object when the software it depends on becomes obsolete or unusable. It can involve conversion of digital objects from one file format to another (not necessarily the same) format, for example from Word 98 to Word 2000, from Word 2000 to Adobe’s Portable Document Format (PDF), or from GIF to PNG.

Some attributes of the digital object may be lost during the conversion process, so the experience may not be equivalent after migration. The level of data loss through migration depends on the number of preservation treatments applied to the record, the choice of process, the new data format, the level of human intervention, and post-migration descriptive work.

A variant of this approach involves the migration of all file formats to a limited range of formats or even a standardised file format which is chosen for its presumed longevity as a digital format, for example, XML. This migration technique is not yet widely used but has been adopted by the National Archives of Australia where it is referred to as ‘normalisation’. It too may have a place in preservation strategies but depends to some extent on the availability of tools or applications that can ‘normalise’ source formats to the preservation format, so is not suitable for all preservation institutions.

Emulation

Emulation is a technique often proposed as a solution to the hardware/software obsolescence problem. Put simply, emulation involves the development of software to replicate the behaviour of obsolete processes (such as hardware configurations or operating systems) on current hardware. Emulation aims to recreate part of the original process that interprets the data to produce a modern rendering of the original performance. Much emulation work is motivated by a belief that the original ‘look and feel’ of a digital resource must be maintained forever. ‘Look and feel’ includes not only the content of the resource, be it a moving image or a sound file, but also tangible aspects of the presentation of the content, such as structure, layout, and functionality.

The major problem with emulation is that it requires not only the retention of all relevant software applications (ie. one for every type of file format being preserved), but also the coding of an emulator for every hardware and operating system in use. As well, the whole issue of the significance of the original ‘look and feel’ to digital preservation is a discussion that has been largely avoided by proponents of the emulation approach, but is assumed to be of prime importance.

Universal Virtual Computer

A development related to emulation is the so-called ‘universal virtual computer’ (UVC). This concept was proposed in 2000 by Raymond Lorie of IBM. In brief, the UVC is a virtual representation of a simplified computer that will run on any existing hardware platform. Its appeal seemingly lies in the fact that problems of hardware and software obsolescence become irrelevant, and digital objects can be retained in their original format. It is said by its proponents to have the advantages of both the emulation approach and the format migration approach, with none of the disadvantages.

The only real world application of the concept is at the Netherlands Koninklijke Bibliotheek (KB) where a test implementation has been developed for preserving digital images (in fact PDF files, each of which is manifested as a sequence of JPEG files) has shown that there is a large software development and maintenance load on implementers. To access the file in the future the KB will also need to develop a UVC emulator for every hardware/software configuration on which the file will be accessed throughout its useable life.

1. OAIS Reference Model

The functional requirements for the preservation of digital information have been the focus of considerable attention and the Reference Model for an Open Archival Information System (OAIS) (Consultative Committee for Space Data Systems [CCSDS], 2002) has become the accepted standard.

The OAIS functional model identifies the main tasks that any type of repository must perform in order to secure the long-term preservation of digital material. The model defines six main functional entities that describe the activity of a digital repository as a flow of digital material, from the arrival of new material in the repository, its storage and management, and through to its delivery to a user (consumer).

  • Ingest includes the physical transfer of files and the legal transfer of rights through the signing of licences or other agreements that establish the OAIS repository’s right to maintain the ingested material. During ingest, descriptive information (resource discovery metadata) should be created to describe the material, and the submitted files are checked to ensure that they are consistent with the OAIS repository’s data formatting and documentation standards. This may include tasks such as file format conversions or other changes to the technical representation and organisation of the submitted material.
  • Archival Storage is concerned with the bit storage of the submitted digital material including tasks such as backup, mirroring, security and disaster recovery.
  • Access All the services and functions needed for users to find and access the contents of the repository.
  • Data Management Data management involves the collection, management and retrieval of both resource discovery, administrative and preservation metadata.
  • Administration The administration functional entity involves the entire range of administrative activities that an archival organisation should undertake. Notable tasks include managing, monitoring and developing the repository’s software systems, negotiating submission agreements with producers (authors), and the establishment of policies and standards for the repository.
  • Preservation Planning This functional includes four sub-entities associated with identifying preservation risks and developing plans to address them:
    • Monitor Designated Community – refers to the community of stakeholders who have an interest in the content of the repository and the need to monitor its adoption of new technology, and other trends that may affect preservation of the community’s digital output.
    • Monitor Technology – ensures that the repository is constantly aware of technological changes that may render its current holdings obsolete or difficult to access.
    • Develop Preservation Strategies and Standards – The development of strategies and standards for preservation that are informed by the current and future requirements of the producers and consumers of the repository.
  • Develop Packaging Designs and Migration Plans This function accepts standards for file formats, metadata and documentation (generated as part of the administration functional entity) and creates tools or defines techniques that apply these standards to submissions.

The model presents six key events that may occur in the full lifecycle of digital moving images and sound: 1. Creation 2. Transfer/Ingest 3. Curation/Preservation 4. Access and Use 5. Technical Obsolescence and media migration 6. Withdraw / reject At each key event a range of actions are, or should, be taken that will affect the future of the digital audio-visual resources. Many of these actions will affect the longer term survival of the moving image and sound resources and will determine if they are merely a collection of bits, or digital objects that remain fit for purpose and usable.

2. Performance model and fundamental nature of digital records

The performance model breaks down the concept of a digital object into a series of fundamental components.

The source contains a unique meaning that interacts with technology in order to be rendered as its creator intended. The process is the technology required to render meaning from the source. When a source is combined with a process, a performance is created and it is this performance that provides meaning

More specifically, the source of a digital record is a data file. This data file has a defined structure that varies according to different formats: a Microsoft Word document, a Microsoft Excel Spreadsheet, an Adobe Acrobat file and an HTML web page all use different data formats. The process is the specific combination of computer hardware and software and the configuration needed to understand the file format of a source. A Word source requires the correct version of the Word application, using a Windows operating system, which is installed on a suitable Intel computer. The performance is what is rendered to the screen or to any other output device.

Storage media, such as disks, tapes and cartridges, decay relatively rapidly compared to other media. They are not designed for long term use and are therefore extremely susceptible to short and medium term decay. The short lifetime of contemporary storage media means that a constant media refreshing program is the only way to ensure the survival of digital material. More serious than the decay of storage media is the issue of technological obsolescence. New advances in computer science mean that both hardware technologies and software data formats are superseded over time. Furthermore, market-driven innovations mean that manufacturers update and release new systems, software applications and hardware technologies at a rapid rate. In terms of the performance model, the structure of the source object and the process that these structures depend on are in a constant state of development and change. As a result, without intervention by archivists to preserve the source and process, the performance cannot be guaranteed.

The performance model shows that neither the source nor the process need be retained in their original state for a future performance to be considered authentic. As long as the essential parts of the performance can be replicated over time, the source and process can be replaced.

Essence

The performance model demonstrates that digital records are not stable artefacts; instead they are a series of performances across time. Each performance is a combination of characteristics, some of which are incidental and some of which are essential to the meaning of the performance. The essential characteristics are what we call the ‘essence’ of a record.

Determining the essence of records is not a science and is open to subjectivities and archival interpretation, but it is essential to an efficient, effective and accountable preservation program.

Proprietary data formats are unsuitable for long-term preservation and accessibility of digital records, particularly for an organisation committed to free long-term access to digital records.

The digital domain is dominated by organisations that invest heavily in new product development and seek to recoup that investment by energetically protecting their intellectual property rights over their products. As a result, many of the information technologies used to create archival digital records are proprietary. Licences from the intellectual property owners are needed to use software applications, hardware components and to structure source objects. Indefinite access to technologies is not a standard condition of the licences, meaning that IT vendors can change licence conditions or withdraw products from the market without consultation in order to support other aspects of their business.

Archival data formats

The cornerstone of the performance approach is the use of archival data formats that are non-proprietary and specifically designed for long-term access across different computer platforms. Archival data formats are formats that digital data objects are converted into for preservation purposes.

Within the archival and digital library communities there have been many candidate archival formats suggested over the last decade. Adobe’s Portable Document Format (PDF), for instance, is often nominated as an archival format for typical office documents. PDF presents the digital record as if it were a printed page. This means that for any digital record saved to this format, its look and feel is fundamentally one of text and images designed to fit a particular page size. However, proposals to use formats such as PDF normally suppose that the entire range of preservation requirements for digital records can be satisfied by a single data format.

The idea of creating data formats to meet the preservation needs of many record types is not as daunting as it first seems. Mark-up language technology, and specifically XML, allows us to quickly and easily create our own non-proprietary archival formats that can preserve a record’s essence. XML is the currently preferred archival data format.

when a digital record is ingested, it undergoes a single preservation treatment, called normalisation. Normalisation is the conversion of the source object from its original data format into an XML-based archival format. The conversion work is automated by using specific software applications, called normalisers, that convert the original source object into XML. The newly created preservation master is then stored in a digital repository, along with the original transferred source object (see figure 5). The major difference between normalisation and many other forms of migration is that records are migrated only once into archival data formats, and do not enter into an ongoing, cyclical migration process.

The proposed preservation process involves two major processes: normalisation to convert the original source object into XML, and transformation to convert the XML into an accessible format.

XML is not so much a data format or ‘language’ as a set of universal rules for describing data and documents. It does this by providing elements that identify unique sections of data within a digital document. These elements are separated from each other by start and end tags. Each element can also have attributes associated with it that provide further context to the element’s enclosed data.

XML is an open standard maintained by the World Wide Web Consortium (W3C). The W3C is the standards-setting organisation for web data standards such as HTML (HyperText Markup Language, the document format used to encode web pages). The W3C specially developed XML to be cross-platform, so that a document structured using XML can be read on a wide variety of computing platforms (including all Windows platforms, MacOS, and all variants of UNIX and Linux) using a wide variety of software tools.

Unlike HTML where the display is built into the tags (ie: bold text), XML separates the content and the formatting. In the minute example, a style-sheet may indicate that all the content in the tag must be Times New Roman font, 12 point size and bolded, while the may be Arial font and 10 point size. The flexibility of XML means that an archivist can tailor the style-sheet to capture the essence or ‘look and feel’ of any document.

Assignment 8

What does the performance approach and the idea of ‘essence’ have in common with the Guggenheim Museum’s Variable Media Initiative (see references)? How might you be able to apply this approach to the digital object you have included in our collection? (post your answer to your blog).

References


About this entry