egaia derive

Generate derivative formats for archival storage and for distribution.

Usage

egaia derive --help
egaia derive [ --item=ITEM ] [ --frame=N ] [ --force | update ]

Options

--help
Show this help text and exit.
--item=ITEM
Only process this item. Either a bare UUID or a filename containing a UUID may be specified.
--frame=N
Process frame N of a video, or image/page N of a multi-page document, when generating thumbnail and medium-size images. If this option is not given, the first frame/image/page will be used by default.
--force
Overwrite all existing derivatives.
--update
Regenerate derivatives where the derived file is older than the source. This is intended for use with editable files (mainly docx and svg) which may have changed.

Requirements

This command makes use of external command-line tools. If the required tools are unavailable on your system, or are not configured properly, some conversions may fail without warning!

conversion tool media types
wget url
wkhtmltopdf url
ffmpeg audio, video
libreoffice word processor document
inkscape vector image
imagemagick all except plain text and audio

Overview of transformations

The table below lists the transformations applied to various input files. Any file whose extension matches one of those listed in the “input” column will be processed to create all of the derivatives listed in the corresponding “outputs” column. The transformation rules are prefixed “pf-” for “preservation format” and “df-” for “distribution format”; only the latter are copied to the catalogue that egaia generates with the mkindex command. The transformation rule is inserted into the filename of the derived object, so for instance myfile.UUID.png would have a derivative named myfile.df-med-img.UUID.jpg.

media type input outputs
audio
  • .mp3
  • .wav
  • .wma
  • .ogg
  • pf-wav (.wav)
  • df-mp3 (.mp3)
doc
  • .odt
  • .odp
  • .doc, .docx
  • .ppt, .pptx
  • df-pdf (.pdf)
  • df-med-img (.jpg)
  • df-thumb-img (.jpg)
  • df-html (.html), docx only
raster
  • .bmp
  • .gif
  • .jpg, .jpeg
  • .png
  • .psd
  • .tif, .tiff
  • pf-tiff (.tiff)
  • df-med-img (.jpg)
  • df-thumb-img (.jpg)
vector
  • .eps
  • .svg
  • pf-vector (.svg)
  • df-pdf (.pdf)
  • df-med-img (.png)
  • df-thumb-img (.png)
video
  • .avi
  • .flv
  • .mov
  • .mpeg
  • .mwv
  • .mp4
  • .webm
  • .ogv
  • pf-ffv1 (.mkv)
  • df-360p-vp9-400k (.webm)
  • df-stills (.dir)
  • df-contact-sheet (.html)
  • df-thumb-vid (.jpg)
  • df-med-img-vid (.jpg)
  • df-h264 (.mp4)
video clips
  • .vclips
  • df-concat-list (.txt)
  • df-concat-offsets (.csv)
  • pf-ffv1 (.mkv)
  • df-h264 (.mp4)
  • df-360p-vp9-400k (.webm)
  • df-thumb-vid (.jpg)
  • df-med-img-vid (.jpg)
  • df-stills (.dir)
  • df-contact-sheet (.html)
text
  • .txt
  • .md
  • .rst
  • df-txt (.txt)
  • df-html (.html)
url
  • .url

(see below)

  • pf-webarc (.dir)
  • pf-screenshot (.png)
  • df-pdf (.pdf)
  • df-med-img (.png)
  • df-thumb-img (.png)

Description of output formats

pf-wav
A lossless audio file in WAV format.
df-mp3
A distribution copy of an audio file, in lossy MP3 format.
df-pdf
For document formats (doc/docx, ppt/pptx, etc.) this will simply be a pdf version of the document as generated by LibreOffice. For web documents given using the “url” file, this will be a PDF corresponding to the print version of the resource; not all original formatting will be preserved.
df-med-img
A medium-sized image (currently 800x600), used on item description pages in HTML output. For screenshots of web resources downloaded from .url files, this will be a cropped screenshot.
df-thumb-img
A thumbnail image of the resource, used in html indexes and galleries.
pf-ffv1
A lossless video, using the FFV1 codec in an MKV container. (Currently disabled since the file sizes are very large.)
df-360p-vp9-400k
A distribution copy of a video, reduced to 360p resolution in webm format using the VP9 codec.
df-h264
A distribution copy of the video optimized for upload to video hosting services such as YouTube, Vimeo, or Internet Archive. The video is not scaled, but the bitrate may be reduced from the source file and the moov atom will be placed at the beginning of the file in order to enable streaming.
df-stills
The “stills” directory contains thumbnail images taken from every six seconds in the video.
df-contact-sheet
An HTML index for video thumbnails. Images are embedded using “data” URIs, to create a standalone document. Clicking on an image will copy to the system clipboard the UUID, clip name, and in/out points for a six-second clip centred on the frame represented by that thumbnail, for use with the egaia roughcut tool or an external editing utility.
df-concat-list
A list of source clips in ffconcat v1.0 format, as generated from a directory of raw video footage. The source clips listed in this file are concatenated into a single file, which is then used for the remaining derivatives. (See “archiving video footage” below.)
df-concat-offsets
A list of source filenames from a container directory (<project-name>.vclips) that have been concatenated to produce a single set of derivatives, along with each clip’s offset in seconds from the beginning of the concatenated file. This is used by egaia roughcut for translating time references to a source clip into references to the concatenated version. (See “archiving video footage” below.)
df-txt
Currently, the distribution format for plain-text documents is simply a direct copy of the original resource, though this may change in future.
df-html
HTML generated from docx source documents. This will be a standalone document containing embedded copies of any required media and stylesheets, so it may be copied to other locations and still be used.
pf-webarc
The “web archive” is a directory containing a downloaded copy of the resource specified in a URL file, along with any other resources needed in order to display that resource (e.g., embedded images, stylesheets, or fonts). Links to the associated resources are converted to relative hyperlinks in order to make the resource viewable locally, but are otherwise unchanged.
pf-screenshot
A full screenshot that captures the entire page of a web resource, as viewed in a web browser.

Archiving video footage

A collection of video clips from a single project, or a single day of shooting, can be treated as a single item within the archive, so that it is not necessary to create a separate finding aid for each source clip. This is particularly useful if you have a large set of relatively short clips on a single topic, which should be treated as a unit.

A set of source clips with the same format, resolution, time base, etc. can be placed in a directory labelled <project-name>.vclips. The container directory, but not the files within it, will be tagged by egaia tag.

The egaia derive tool will first generate two lists of clips – df-concat-list and df-concat-offsets – as described above. All the clips in the source directory will then be assembled to create unified derivatives; thus the file <project-name>.df-h264.<uuid>.mp4 will be a single video that includes all the footage from the <project-name>.vclips directory, and the stills-list and other derivatives will reference this concatenated video. Source clips will be included in the order of their modification time, which should normally correspond to their creation date.

Archiving HTML resources

It is possible to archive web pages with egaia using a “url” file – basically a plain-text file with the extension .url, containing the address of the resource you want to archive. This input file can contain multiple URLs, one per line, but only the first URL will be fully processed. Note that egaia will not be able to process resources requiring cookies, user interaction (e.g., content revealed through “infinite scrolling”), or the submission of login credentials; such resources should be copied by other means. As indicated in the derivatives table, outputs include a downloaded copy of the resource and its supporting media, a full screenshot of the resource as it would appear in a web browser, a PDF representing the “print” view of the web page, and a cropped screenshot and thumbnail.

Re-running the command

Re-running the derive command will not overwrite existing derivatives unless the --force or --update flag is given; it is always safe to call this command after updating a collection. New derivatives will only be created for files that have been tagged by egaia (i.e., any file whose name includes a UUID). During processing, the derived output is written to a file with the suffix .tmp that is then renamed to the target filename once the conversion has completed successfully. If processing is interrupted for any reason you will need to remove any such temporary files and call the derive command again.