XML Version
Fornaro, P., Rosenthaler, L. (2016). File Formats for Archiving: Stability and Persistence Issues. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 507-508.
File Formats for Archiving: Stability and Persistence Issues

File Formats for Archiving: Stability and Persistence Issues

1. Introduction

A well chosen file format is an important aspect in the process of archiving digital information. A typical example are image files for archival purposes. Unfortunately it can be shown, that even a well-defined file format like TIFF can have partially proprietary, vendor-based features. Today it is inevitable to digitize analogue photographs or images because the originals are endangered by unstoppable physical decay. Because of the continuous process of scanning, digital images are an important part of our cultural heritage already and they account for a constitutive part of our contemporary multimedia output in social, scientific and economic ambits (Kuny et. al., 1998). Unfortunately any digital object must be migrated in periodic cycles, because of technological changes. Hardware migration is only one of the steps needed to ensure that a digital file can be rendered in future. The file format definition is of the same importance. If such a definition becomes obsolete, the existing file must be converted into another one that is not in danger to be outdated (Heath et. al., 2011). Therefore it is necessary to judge and optimize all relevant factors that define the sustainability of a file format definition. Even more important is the persistence of a file format if a storage solution with very long cycles of migration is used, as presented in the Monolith-Project (Fornaro et. al., 2014) or the Rosetta-Project. In those cases a lifecycle of several decades can be expected.

We discuss in this document the long-term stability of existing image file formats and derive possible new approaches. We will show in detail what weaknesses exist, that endanger the future rendering of the content. In addition an image file format definition for archival needs is proposed, based on the already existing widespread standard TIFF. The proposed approach follows the concept of the Portable Document Format (Oettler et. al.,2013), PDF and its archival derivative PDF/A. The recommended specification is called TI/A, Tagged Image for Archives.

2. Problem

If digital file formats are not well chosen, the content won’t be accessible in future because it can no longer be decoded as a consequence of one or multiple technological issues (Rothenberg, 1995). A “file format” basically defines the logical structure and meaning of the bits within the bit stream and thus it is essential for correct interpretation and proper rendering of the coded data. Unfortunately a file format or parts of its logical structure and definition can become obsolete, like hardware does. As a result the information renders useless, even-though the bit stream is still properly readable. To prevent such obsolescence the file format must be migration. In most cases this is more complex than creating a simple duplicate of a bit stream; the file has to be restructured. In addition every migration can reduce image quality or introduce artefacts. Therefore it is necessary to use a file format for long-term preservation of digital data that is stable, simple, well-documented and reliable. Unfortunately most image file formats do not fulfill the needed requirements. Even open standards like the Tagged Image File Format (TIFF) can have partially vendor specific, proprietary content that decreases long-term stability. TIFF is one of the most widespread formats used to represent high quality image data in archives. TIFF is a well known, established, flexible, adaptable file format for handling images and data within a single file, by including various header tags. TIFF offers some features that are rarely used and not supported by most applications. The TIF format is quite complex and parts of the original definition have become obsolete, while new, not formally standardized additions have been made. As a consequence it would be easily possible to create a TIFF file that is fully conformant to the TIFF Revision 6.0 specifications but would be virtually useless because no existing software is be able to open and render it, migration is the needed.

Migration is an expensive task. Therefor numerous approaches for archival storage solutions exist that do not or only rarely need to be migrated. Most of those solutions make use of a very stable carrier and a simple interface to access data. Monolith is such an example for an “eternal” storage. Monolith (Gubler et. al., 2006) is based on chromogenic optical film, that has a life expectancy of up to 500 years. The data is stored on Monolith as 2D-barcode, enriched with human readable metadata. For such a solution the data format is of very high importance because any format obsolescence reduces migration cycles drastically.

3. Approach

Since a digital archive has the goal that the file can be rendered in a indefinite but possibly far future, a simplistic approach is necessary. Therefor a TIFF suitable for long term archiving should require only a minimal set of features (tags) that are necessary to allow a correct future rendering of the data and the essential descriptive metadata. We therefore propose a subset of the full functionality of TIFF that is fully compatible with the de-facto TIFF standard itself but marks some tags as mandatory, some as optional and some as forbidden in order to guarantee the correct rendering in the future. In addition to the core functionalities, it is crucial to define a minimal set of metadata for archival applications, following standards like Dublin Core or METS (Loeffler et. al., 2007). In analogy to PDF/A format we propose to call this specification TI/A or Tagged Image for Archives.

In cooperation with the University of Girona in Spain and EASY INNOVA, a technology and innovation centre of Girona, we have started the process of specifying TI/A in co-operation with multiple memory institutions of Switzerland and Europe. 


Of course the concept of using a subset of the functionality of TIFF can be applied to any other format common for archiving digital image data like JPEG2000/A (Buckley, 2013) or even video or motion picture like DCP/A (Fornaro et. al., 2014: Goethels, 2009).

4. Results

It can be shown that a smart chosen file format is very important for successful archiving [9]. With the help of numerous institutions and experts we have drafted a recommendation, based on the existing TIFF standard (http://ti-a.org). The exchange of needs, requirements, dos and don’ts will lead us to a final draft specification of an ideal archival file format for high quality image data that is well supported by an international network of experts. Following the original standard definition of TIFF allows us to define a format that is fully compatible with existing decoders. This approach makes it not necessary to have “out on the market” software modified or enhanced by any means.

Based on that preliminary work we will try to have the document standardized by the International Standard Organisation, ISO. Such a precise definition of the functionalities and their implementation in a Tagged Image File for Archives will help to increase the sustainability of the original image format drastically.

Bibliography
  1. Kuny, T. (1998). A digital dark ages? Challenges in the preservation of electronic information. Int. Preserv. News, pp. 8–13.
  2. Rothenberg, J. (1995). Ensuring the longevity of digital documents. Sci. Amer. 272(1): 42–47.

  3. Gubler, D., Rosenthaler, L. and Fornaro, P. (2006). The obsolescence of migration: Long-Term storage of digital code on stable optical media. In Proceedings of IS&T’s Archiving Conference. IS&T, pp. 135–39.

  4. Loeffler, H. (2007). Photo Metadata White Paper. In Baranger, W. (Eds.), IPTC, http://www.iptc.org/std/photometadata/0.0/documentation/IPTC-PhotoMetadataWhitePaper2007_11.pdf (accessed 4. March 2016)
  5. Goethels, A. (2009). General Considerations for Choosing File Formats, Harvard University Library, http://library.harvard.edu/sites/default/files/general_format_considerations.pdf (accessed 4. March 2016).
  6. Heath, T. and Bizer, Ch. (2011). Linked Data: Evolving the Web into a Global Data Space, Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool.
  7. Oettler, A. (2013). PDF/A in a Nutshell 2.0, – presentation from the First International PDF/A Conference in Amsterdam, Association for Digital Document Standards e.V., Berlin.
  8. Buckley R. (2013). Using Lossy JPEG2000 Compression For Archival Master Files, Library of Congress Office of Strategic Initiatives, Version 1.1, http://www.digitizationguidelines.gov/still-image/documents/JP2LossyCompression.pdf (accessed 4. March 2016).
  9. Fornaro, P. and Gubler D. (2014). DCP/A: Discussion of an Archival Digital Cinema Package for A V-Media, IS&T Archiving Conference Proceedings, Berlin.
  10. Fornaro P., Wassmer A., Rosenthaler L. and Gschwind R. (2014)
. Monolith: Materialised Bits, the Digital Rosetta Film, DH2014 Conference, Lausanne.