Basic operations

Egaia will organize your collection in the BagIt archival storage format, with accompanying Dublin Core descriptive metadata and copies of your input files in distribution and preservation formats.

The BagIt format

BagIt is an ideal format for living ethnographic collections, as it allows files to be stored and accessed as ordinary files on disk. No dedicated software is needed to retrieve, transmit, and use a packaged collection. According to the BagIt specification:

BagIt is a hierarchical file packaging format designed to support disk-based or network-based storage and transfer of arbitrary digital content. A bag consists of a “payload” and “tags”. The content of the payload is the custodial focus of the bag and is treated as semantically opaque. The “tags” are metadata files intended to facilitate and document the storage and transfer of the bag. The name, BagIt, is inspired by the “enclose and deposit” method, sometimes referred to as “bag it and tag it”.

BagIt is widely used for preserving digital assets originating from a different domains. Organizations involved in digital preservation with BagIt include the Library of Congress, Dryad Data Repository, NSF DataONE, and the Rockefeller Archive Center. Software implementations have been written in Python, Ruby, Java, Perl, and PHP. It is also used in the libraries of many universities, such as Cornell, Purdue, Stanford, Ghent University, New York University, and the University of California.

Transformations

The following figures illustrate how egaia transforms a sample collection of two items – an image and a video file.

Input

myCollection/
├─ mySubDirectory/
│  └─ MyImg.jpg
└─ MyVideo.mp4

Output of egaia bag command

For details see the documentation for egaia bag.

myCollection/
├─ data/
│  ├─ mySubDirectory/
│  │  └─ MyImg.jpg
│  └─ MyVideo.mp4
├─ bagit.txt
├─ bag-info.txt
├─ manifest-md5.txt
├─ metadata-en.docx
└─ tagmanifest-md5.txt

Egaia applies the following basic operations on your collection.

  1. Original items are moved into a data/ subdirectory.

    This is the collection “payload”; anything in the top-level directory will be considered metadata.

  2. Top-level BagIt files are created.

    bagit.txt contains bersion information for BagIt, to allow recipient archives to parse the collection accurately. This file is also used by egaia to locate the root of your collection. You should not delete or modify this file, otherwise your collection will become unusable!

    bag-info.txt provides the metadata for your collection as a whole. This will be populated automatically with the values you entered in the configuration stage, but you can edit these manually if you like. You may also add arbitrary metadata fields, so long as they comply with the BagIt format specification. You will be prompted, the first time you run the egaia bag command, to enter a title and description that will be stored in this file; further details can be included in the file metadata-en.docx (described below), which uses Dublin Core fields and is intended to serve as a finding aid.

    manifest-md5.txt contains an automatically generated list of files in the data directory, including the md5 checksum and file path for each one. This file is used to verify the integrity of the files in the collection.

    metadata-<LANGUAGE>.docx is a finding aid describing the collection as a whole. The field headings, by default, correspond to Dublin Core metadata elements; see the Dublin Core standard for details. As with item metadata documents (below), all fields are optional, but at minimum you should enter a description under the “description” field. If there is more than one value for a given element – two creators, for example – you can enter these as separate paragraphs. Unlike bag-info.txt, this document is created in word-processor format, intended to be editable by end users. The <LANGUAGE> element in the filename is the language code for your archive (“en” for “English” by default); if you wish to maintain a multilingual archive, it is possible to include several metadata documents with different language codes.

    tagmanifest-md5.txt is another automatically-generated file that is used to verify data validity. It should not be edited by hand.

Output of egaia tag command

For details see the documentation for egaia tag.

myCollection/
├─ data/
│  ├─ mySubDirectory/
│  │  └─ MyImg.efd4d0e6-6628-48fd-91c7-5176f5dba5e6.jpg
│  ├─ MyVideo.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.mp4
│  └─ MyVideo.metadata-en.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.docx
├─ bagit.txt
├─ manifest-md5.txt
├─ bag-info.txt
├─ metadata-en.docx
└─ tagmanifest-md5.txt
  1. Files are tagged with UUIDs, or “Universally Unique Identifiers”.

    Each item is assigned a randomly-generated unique and permanent identifier, which will be inserted into the filenames for all versions of that item. This way, even if you move a file, change its content, or create new versions in different formats (see below), it should still be possible to trace the metadata associated with that item by searching for the UUID. This identifier differs from the file-level identifiers included in the manifest files, which are based on the unique MD5 hash of each file’s contents, and will therefore change if you modify a file even slightly.

  2. Metadata documents (finding aids) are generated for each item.

    These documents take the format <FILENAME>.metadata-<LANGUAGE>.<UUID>.docx. Each document is populated with basic metadata such as file size, video duration, image dimensions, and date. See the documentation for egaia docx for further information about the format of these files.

Output of egaia derive command

For details see the documentation for egaia derive.

myCollection/
├─ data/
│  ├─ mySubDirectory/
│  │  ├─ MyImg.efd4d0e6-6628-48fd-91c7-5176f5dba5e6.jpg
│  │  ├─ MyImg.pf-tiff.efd4d0e6-6628-48fd-91c7-5176f5dba5e6.jpg
│  │  ├─ MyImg.df-thumb-img.efd4d0e6-6628-48fd-91c7-5176f5dba5e6.jpg
│  │  └─ MyImg.df-med-img.efd4d0e6-6628-48fd-91c7-5176f5dba5e6.jpg
│  ├─ MyVideo.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.mp4
│  ├─ MyVideo.df-stills.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.dir/
│  │  ├─ MyVideo.still-0001.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.jpg
│  │  ├─ MyVideo.still-0002.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.jpg
│  │  └─ MyVideo.stills-index.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.html
│  ├─ MyVideo.df-360p-vp9-400k.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.webm
│  └─ MyVideo.metadata-en.89b0fab6-bf69-4d1b-928f-c1f3a0d78f90.docx
├─ bagit.txt
├─ manifest-md5.txt
├─ bag-info.txt
├─ metadata-en.docx
└─ tagmanifest-md5.txt
  1. Derivative files are created.

    The help text for the egaia derive command includes details about current transformation policies. Typically a file will have two derivatives:

    • distribution format

      A file that is smaller than the original, and in a format that can be read using widely-available software. In the case of images, for example, both a thumbnail and a medium-size version of the original image will be created. For videos, a subdirectory containing still image thumbnails will also be produced, along with an HTML index. These are the versions that will be used in HTML output. Distribution files are labelled with a pattern matching .df-xxx in the filename.

    • preservation format

      A file that is stored in a lossless, open format for long-term storage, ensuring that a version of the file can likely be opened by software at some point in the distant future. Preservation files are labelled with a pattern matching .pf-xxx in the filename.