The ERPANET Training Seminar: Metadata in Digital Preservation took place at Archivschule, Marburg, Germany,
3rd-5th September 2003. The following notes focus on points made by speakers that have some bearing on current
deliberations on metadata intiatives. I have added links to PDF versions of the presentations available on the
ERPANET web site.
Speakers
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Wendy Duff from University of Toronto opened the proceedings with An Overview of Metadata Initiatives. She
stressed the need to establish the purpose for which metadata was being collected/stored, and to define the
metadata elements required: that done, we can then consider to what extent the requirement is wholly unique, or
conforms an existing metadata initiative. She reminded us that there are often political motives behind the
choice of metadata standards. Wendy believes it is unwise to depend on manual entry of metadata, preferring
automatic harvesting of metadata and use of ontologies (aka thesauruses). Perhaps most interesting of all was
her demonstration of the excellent ‘metamap’, a
diagram of metadata in the style of a Tube map.
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Steve Knight from National Library of New Zealand described recent strict legislation which defines scope
of NLNZ mission. This even allows them to circumvent protection of online documents for purpose of accessioning
electronic documents/objects. Among other things, NLNZ has worked on auto-extract tools: converters already
developed include Word 2, Word 6, TIFF, BMP, WAV. Metadata is extracted using the appropriate converter, output
as XML, then further transformed as necessary. The NLNZ system is compatible with OAIS.
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Heike Neuroth from Gottingen State and University Library also discussed OAIS as a framework within which
other standards can be used as appropriate. XML implementations of metadata standards are good not only for
preservation, but also for interoperability between collections over the web. To this end, Gottingen is also
making experimental use of SOAP. Discussing Dublin Core, Helke pointed out that it is a strictly minimal metadata
set, “simple signposts for digital tourists”, supporting very general discovery needs of users. She also
observed that LOC is moving away from using Dublin Core, preferring
>
>MODS instead. Heike also recommended that institutions use Collection-level descriptions to explicitly
define ambiguous terms such as “digital object”.
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
In a rather chaotic presentation of “unstructured thoughts”, Malcolm Todd from TNA (ex-PRO) expressed
further reservations about using Dublin Core: that it addresses discovery, not preservation or records
management; that “selective” compliance with this or other standards may not be compliance at all. He talked
about the need to anticipate future migrations when defining and implementing metadata systems, and the fact
that migration considerations are closely related to those of interoperability. He also mentioned, during a
breakout session, that the PRO has developed tools for auto-extraction of metadata from certain proprietary file
formats.
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Palle Aagaard from Danish IT and Telecom Agency also focused on interoperability of metadata independent of
media, system, content-type. Free flow of and access to data is only possible with standardised metadata.
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Michael Day from UKOLN defined interoperability in terms of supporting reuse, exchange and migration of
data and metadata. He observed that no single metadata standard is likely, possible, or desirable. What may,
therefore, be necessary, are registries of metadata standards, supporting easy discovery and version control of
standards, which need to be effectively managed over time. Above all, it is important to focus on
implementation: conceptual specifications need to move quickly beyond proof of concept, to viable applications.
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Andrew Wilson from National Archives of Australia described NAA’s “two-pronged approach”, focusing on
resource discovery and recordkeeping (aka records-management). Preservation metadata, he contended, was
essentially a superset of these. NAA (like NLNZ) are advocates of DC, adding a few of their own elements to the
basic DC set. He also pointed out, much to everyone’s amusement, that since all DC elements are optional, a file
with no metadata at all is DC compliant.
NAA is still developing its metadata standard, but they see the procedure as follows:
- Document data needed
- Investigate existing standards
- Develop application profile (find out more)
- Develop schemas as necessary.
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Thomas Severiens, from the Institute for Science Networking, Oldenburg, visualised the onion-like
accumulation of metadata layers by digital objects as they pass through different hands (author, department,
publisher, library, archive), and described the ‘Meta-Maker’, a simple interface for creating DC metadata.
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Bill Roberts described a variety of approaches to storing the data, including that of TNA (ex-PRO): digital
objects are stored in filestore; metadata is arranged in XML documents (which are stored in a database); links
between objects, metadata, etc are managed by the database. The alternative approach of VERS was described: metadata and data are all encapsulated in
XML files. Advantages of this approach are that the record is self-contained, and the resulting file is amenable
to use of digital signatures; the disadvantage is the added complexity of accessing both object and metadata. He
highlighted the flexibility of XML, compared with relational database, schemas. Like Michael Day, Bill exhorted
the value of practical experience of implementing solutions rather than producing endless refinements of
specifications.
Other speakers included
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Denis Minguillon, talking about archiving data from space, and
style="border: 0px solid ; width: 15px; height: 16px;" title=""/>
style=”border: 0px solid ; width: 15px; height: 16px;” title=”“/>
Lars-Erik Hansen from Swedish Social Insurance Administration.
Themes
Digital objects
Most approaches are based on the concept of a “digital object” with which metadata is associated. An object may
be any coherent digital entity (e.g text file, image file), or a collection of several objects (e.g. web page).
Objects may have different attributes and metadata requirements depending on their type, content or origin. It
is important to have a clear idea about the classes of object that are being archived, and define metadata
accordingly. One interesting approach is to consider the objects as accumulating layers or wrappers of metadata
over time, not only for description, but also about how they are managed and handled.
Classes of metadata
Definitions vary, but based on their purpose, the three main classes of metadata seem to be:
- Discovery
- Administrative (including records management)
- Preservation
Some elements may appear in more than one class.
Existing standards
It was generally agreed that Dublin Core is a very general element set focused on discovery, and not sufficient
on its own for other metadata requirements. Several speakers favoured using MODS instead, and this seems to be
the direction favoured by the Library of Congress. Compliance with existing standards is desirable, but these
may be extended where necessary to meet local needs. A requirement for metadata registries was proposed, to
assist with identifying appropriate external schemas (and for version control of published schemas), but there
seems to be no such thing in existence at present.
Interoperability
There are two main aspects to developing “interoperable metadata”: using published standards where possible to
achieve common semantics, and implementing solutions independent of platform, media, operating system. Using
common semantics means that diverse institutions can meaningfully share/exchange metadata; using a common,
platform-independent implementation (i.e. an XML-based approach) enables automatic sharing of (meta)data over
the web (e.g. using SOAP) regardless of individual hardware/software preferences.
Where metadata requirements go beyond existing standards, or span more than one standard, it is acceptable to
use elements from different sources, as long as the standards from which elements are derived is specified. XML
has clear support for this mix-and-match approach: using namespaces allows elements to be explicitly associated
with different standards; it is also possible to create local (sub)schemas that modify views of external
schemas.
Collecting metadata
There was a general feeling that as much metadata as possible should be collected automatically, and not depend
on user input. Many elements are available in the system environment, or can be inferred; others are embedded in
the various file formats. For text documents, at least, the need for manual entry of keywords can be obviated by
combining full-text indexing with thesauruses (aka ontologies). Both NLNZ and TNA (ex-PRO) claimed to have
scripts capable of extracting metadata from some prporietary file formats, e.g. MS Word.
Approach
The NAA approach is straightforward and appealing:
- Document data needed
- Investigate existing standards
- Develop application profile
- Develop schemas as necessary.
In addition, of course, it is also necessary to have specifications for the applications, interfaces and
procedures required for ingestion and management of the metadata.
Implementation
Options for physical arrangement of digital objects and metadata include:
- Encapsulating objects in metadata using XML (CDATA sections).
- Separate metadata files (XML schema) stored with objects. Metadata and objects may additionally be
compressed as zip files. - Separate metadata files (XML) stored separately from objects (in filestore or in database).
- Metadata stored in database tables (SQL schema).
Other implementation considerations include: controlling access to objects; control of metadata revisions;
establishing authenticity of objects and metadata using checksums or digital signatures.

