Relocution translating / information / images / ideas

ERPANET seminar: Metadata in Digital Preservation

Marburg The ERPANET Training Seminar: Metadata in Digital Preservation took place at Archivschule, Marburg, Germany, 3rd-5th September 2003. The following notes focus on points made by speakers that have some bearing on current deliberations on metadata intiatives. I have added links to PDF versions of the presentations available on the ERPANET web site.

Speakers

Full presentation Wendy Duff from University of Toronto opened the proceedings with An Overview of Metadata Initiatives. She stressed the need to establish the purpose for which metadata was being collected/stored, and to define the metadata elements required: that done, we can then consider to what extent the requirement is wholly unique, or conforms an existing metadata initiative. She reminded us that there are often political motives behind the choice of metadata standards. Wendy believes it is unwise to depend on manual entry of metadata, preferring automatic harvesting of metadata and use of ontologies (aka thesauruses). Perhaps most interesting of all was her demonstration of the excellent ‘metamap’, a diagram of metadata in the style of a Tube map.

Full presentation Steve Knight from National Library of New Zealand described recent strict legislation which defines scope of NLNZ mission. This even allows them to circumvent protection of online documents for purpose of accessioning electronic documents/objects. Among other things, NLNZ has worked on auto-extract tools: converters already developed include Word 2, Word 6, TIFF, BMP, WAV. Metadata is extracted using the appropriate converter, output as XML, then further transformed as necessary. The NLNZ system is compatible with OAIS.

Full presentation Heike Neuroth from Gottingen State and University Library also discussed OAIS as a framework within which other standards can be used as appropriate. XML implementations of metadata standards are good not only for preservation, but also for interoperability between collections over the web. To this end, Gottingen is also making experimental use of SOAP. Discussing Dublin Core, Helke pointed out that it is a strictly minimal metadata set, “simple signposts for digital tourists”, supporting very general discovery needs of users. She also observed that LOC is moving away from using Dublin Core, preferring MODS instead. Heike also recommended that institutions use Collection-level descriptions to explicitly define ambiguous terms such as “digital object”.

Full presentation In a rather chaotic presentation of “unstructured thoughts”, Malcolm Todd from TNA (ex-PRO) expressed further reservations about using Dublin Core: that it addresses discovery, not preservation or records management; that “selective” compliance with this or other standards may not be compliance at all. He talked about the need to anticipate future migrations when defining and implementing metadata systems, and the fact that migration considerations are closely related to those of interoperability. He also mentioned, during a breakout session, that the PRO has developed tools for auto-extraction of metadata from certain proprietary file formats.

Full presentation Palle Aagaard from Danish IT and Telecom Agency also focused on interoperability of metadata independent of media, system, content-type. Free flow of and access to data is only possible with standardised metadata.

Full presentation Michael Day from UKOLN defined interoperability in terms of supporting reuse, exchange and migration of data and metadata. He observed that no single metadata standard is likely, possible, or desirable. What may, therefore, be necessary, are registries of metadata standards, supporting easy discovery and version control of standards, which need to be effectively managed over time. Above all, it is important to focus on implementation: conceptual specifications need to move quickly beyond proof of concept, to viable applications.

Full presentation Andrew Wilson from National Archives of Australia described NAA’s “two-pronged approach”, focusing on resource discovery and recordkeeping (aka records-management). Preservation metadata, he contended, was essentially a superset of these. NAA (like NLNZ) are advocates of DC, adding a few of their own elements to the basic DC set. He also pointed out, much to everyone’s amusement, that since all DC elements are optional, a file with no metadata at all is DC compliant. NAA is still developing its metadata standard, but they see the procedure as follows:

  1. Document data needed
  2. Investigate existing standards
  3. Develop application profile (find out more)
  4. Develop schemas as necessary.

Full presentation Thomas Severiens, from the Institute for Science Networking, Oldenburg, visualised the onion-like accumulation of metadata layers by digital objects as they pass through different hands (author, department, publisher, library, archive), and described the ‘Meta-Maker’, a simple interface for creating DC metadata.

Full presentation Bill Roberts described a variety of approaches to storing the data, including that of TNA (ex-PRO): digital objects are stored in filestore; metadata is arranged in XML documents (which are stored in a database); links between objects, metadata, etc are managed by the database. The alternative approach of VERS was described: metadata and data are all encapsulated in XML files. Advantages of this approach are that the record is self-contained, and the resulting file is amenable to use of digital signatures; the disadvantage is the added complexity of accessing both object and metadata. He highlighted the flexibility of XML, compared with relational database, schemas. Like Michael Day, Bill exhorted the value of practical experience of implementing solutions rather than producing endless refinements of specifications.

Other speakers included Full presentation Denis Minguillon, talking about archiving data from space, and Full presentation Lars-Erik Hansen from Swedish Social Insurance Administration.

Themes

Digital objects

Most approaches are based on the concept of a “digital object” with which metadata is associated. An object may be any coherent digital entity (e.g text file, image file), or a collection of several objects (e.g. web page). Objects may have different attributes and metadata requirements depending on their type, content or origin. It is important to have a clear idea about the classes of object that are being archived, and define metadata accordingly. One interesting approach is to consider the objects as accumulating layers or wrappers of metadata over time, not only for description, but also about how they are managed and handled.

Classes of metadata

Definitions vary, but based on their purpose, the three main classes of metadata seem to be:

Some elements may appear in more than one class.

Existing standards

It was generally agreed that Dublin Core is a very general element set focused on discovery, and not sufficient on its own for other metadata requirements. Several speakers favoured using MODS instead, and this seems to be the direction favoured by the Library of Congress. Compliance with existing standards is desirable, but these may be extended where necessary to meet local needs. A requirement for metadata registries was proposed, to assist with identifying appropriate external schemas (and for version control of published schemas), but there seems to be no such thing in existence at present.

Interoperability

There are two main aspects to developing “interoperable metadata”: using published standards where possible to achieve common semantics, and implementing solutions independent of platform, media, operating system. Using common semantics means that diverse institutions can meaningfully share/exchange metadata; using a common, platform-independent implementation (i.e. an XML-based approach) enables automatic sharing of (meta)data over the web (e.g. using SOAP) regardless of individual hardware/software preferences.

Where metadata requirements go beyond existing standards, or span more than one standard, it is acceptable to use elements from different sources, as long as the standards from which elements are derived is specified. XML has clear support for this mix-and-match approach: using namespaces allows elements to be explicitly associated with different standards; it is also possible to create local (sub)schemas that modify views of external schemas.

Collecting metadata

There was a general feeling that as much metadata as possible should be collected automatically, and not depend on user input. Many elements are available in the system environment, or can be inferred; others are embedded in the various file formats. For text documents, at least, the need for manual entry of keywords can be obviated by
combining full-text indexing with thesauruses (aka ontologies). Both NLNZ and TNA (ex-PRO) claimed to have scripts capable of extracting metadata from some prporietary file formats, e.g. MS Word.

Approach

The NAA approach is straightforward and appealing:

  1. Document data needed
  2. Investigate existing standards
  3. Develop application profile
  4. Develop schemas as necessary.

In addition, of course, it is also necessary to have specifications for the applications, interfaces and procedures required for ingestion and management of the metadata.

Implementation

Options for physical arrangement of digital objects and metadata include:

  1. Encapsulating objects in metadata using XML (CDATA sections).
  2. Separate metadata files (XML schema) stored with objects. Metadata and objects may additionally be
    compressed as zip files.
  3. Separate metadata files (XML) stored separately from objects (in filestore or in database).
  4. Metadata stored in database tables (SQL schema).

Other implementation considerations include: controlling access to objects; control of metadata revisions; establishing authenticity of objects and metadata using checksums or digital signatures.


It’s a good wheeze to see foreign parts for nothing Book review: Building an electronic resource collection