egaia derive¶
Generate derivative formats for archival storage and for distribution.
Usage¶
egaia derive --help
egaia derive [ --item=ITEM ] [ --frame=N ] [ --force | update ]
Options¶
--help
- Show this help text and exit.
--item=ITEM
- Only process this item. Either a bare UUID or a filename containing a UUID may be specified.
--frame=N
- Process frame N of a video, or image/page N of a multi-page document, when generating thumbnail and medium-size images. If this option is not given, the first frame/image/page will be used by default.
--force
- Overwrite all existing derivatives.
--update
- Regenerate derivatives where the derived file is older than the source. This is intended for use with editable files (mainly docx and svg) which may have changed.
Requirements¶
This command makes use of external command-line tools. If the required tools are unavailable on your system, or are not configured properly, some conversions may fail without warning!
conversion tool media types wget url wkhtmltopdf url ffmpeg audio, video libreoffice word processor document inkscape vector image imagemagick all except plain text and audio
Overview of transformations¶
The table below lists the transformations applied to various input files. Any
file whose extension matches one of those listed in the “input” column will be
processed to create all of the derivatives listed in the corresponding
“outputs” column. The transformation rules are prefixed “pf-” for “preservation
format” and “df-” for “distribution format”; only the latter are copied to the
catalogue that egaia generates with the mkindex
command. The transformation
rule is inserted into the filename of the derived object, so for instance
myfile.UUID.png
would have a derivative named
myfile.df-med-img.UUID.jpg
.
media type input outputs audio
- .mp3
- .wav
- .wma
- .ogg
- pf-wav (.wav)
- df-mp3 (.mp3)
doc
- .odt
- .odp
- .doc, .docx
- .ppt, .pptx
- df-pdf (.pdf)
- df-med-img (.jpg)
- df-thumb-img (.jpg)
- df-html (.html), docx only
raster
- .bmp
- .gif
- .jpg, .jpeg
- .png
- .psd
- .tif, .tiff
- pf-tiff (.tiff)
- df-med-img (.jpg)
- df-thumb-img (.jpg)
vector
- .eps
- .svg
- pf-vector (.svg)
- df-pdf (.pdf)
- df-med-img (.png)
- df-thumb-img (.png)
video
- .avi
- .flv
- .mov
- .mpeg
- .mwv
- .mp4
- .webm
- .ogv
- pf-ffv1 (.mkv)
- df-360p-vp9-400k (.webm)
- df-stills (.dir)
- df-contact-sheet (.html)
- df-thumb-vid (.jpg)
- df-med-img-vid (.jpg)
- df-h264 (.mp4)
video clips
- .vclips
- df-concat-list (.txt)
- df-concat-offsets (.csv)
- pf-ffv1 (.mkv)
- df-h264 (.mp4)
- df-360p-vp9-400k (.webm)
- df-thumb-vid (.jpg)
- df-med-img-vid (.jpg)
- df-stills (.dir)
- df-contact-sheet (.html)
text
- .txt
- .md
- .rst
- df-txt (.txt)
- df-html (.html)
url
- .url
(see below)
- pf-webarc (.dir)
- pf-screenshot (.png)
- df-pdf (.pdf)
- df-med-img (.png)
- df-thumb-img (.png)
Description of output formats¶
pf-wav
- A lossless audio file in WAV format.
df-mp3
- A distribution copy of an audio file, in lossy MP3 format.
df-pdf
- For document formats (doc/docx, ppt/pptx, etc.) this will simply be a pdf version of the document as generated by LibreOffice. For web documents given using the “url” file, this will be a PDF corresponding to the print version of the resource; not all original formatting will be preserved.
df-med-img
- A medium-sized image (currently 800x600), used on item description pages in HTML output. For screenshots of web resources downloaded from .url files, this will be a cropped screenshot.
df-thumb-img
- A thumbnail image of the resource, used in html indexes and galleries.
pf-ffv1
- A lossless video, using the FFV1 codec in an MKV container. (Currently disabled since the file sizes are very large.)
df-360p-vp9-400k
- A distribution copy of a video, reduced to 360p resolution in webm format using the VP9 codec.
df-h264
- A distribution copy of the video optimized for upload to video hosting services such as YouTube, Vimeo, or Internet Archive. The video is not scaled, but the bitrate may be reduced from the source file and the moov atom will be placed at the beginning of the file in order to enable streaming.
df-stills
- The “stills” directory contains thumbnail images taken from every six seconds in the video.
df-contact-sheet
- An HTML index for video thumbnails. Images are embedded using “data” URIs,
to create a standalone document. Clicking on an image will copy to the
system clipboard the UUID, clip name, and in/out points for a six-second
clip centred on the frame represented by that thumbnail, for use with the
egaia roughcut
tool or an external editing utility. df-concat-list
- A list of source clips in ffconcat v1.0 format, as generated from a directory of raw video footage. The source clips listed in this file are concatenated into a single file, which is then used for the remaining derivatives. (See “archiving video footage” below.)
df-concat-offsets
- A list of source filenames from a container directory
(
<project-name>.vclips
) that have been concatenated to produce a single set of derivatives, along with each clip’s offset in seconds from the beginning of the concatenated file. This is used byegaia roughcut
for translating time references to a source clip into references to the concatenated version. (See “archiving video footage” below.) df-txt
- Currently, the distribution format for plain-text documents is simply a direct copy of the original resource, though this may change in future.
df-html
- HTML generated from docx source documents. This will be a standalone document containing embedded copies of any required media and stylesheets, so it may be copied to other locations and still be used.
pf-webarc
- The “web archive” is a directory containing a downloaded copy of the resource specified in a URL file, along with any other resources needed in order to display that resource (e.g., embedded images, stylesheets, or fonts). Links to the associated resources are converted to relative hyperlinks in order to make the resource viewable locally, but are otherwise unchanged.
pf-screenshot
- A full screenshot that captures the entire page of a web resource, as viewed in a web browser.
Archiving video footage¶
A collection of video clips from a single project, or a single day of shooting, can be treated as a single item within the archive, so that it is not necessary to create a separate finding aid for each source clip. This is particularly useful if you have a large set of relatively short clips on a single topic, which should be treated as a unit.
A set of source clips with the same format, resolution, time base, etc. can be
placed in a directory labelled <project-name>.vclips
. The container
directory, but not the files within it, will be tagged by egaia tag
.
The egaia derive
tool will first generate two lists of clips –
df-concat-list
and df-concat-offsets
– as described above. All the
clips in the source directory will then be assembled to create unified
derivatives; thus the file <project-name>.df-h264.<uuid>.mp4
will be a
single video that includes all the footage from the <project-name>.vclips
directory, and the stills-list and other derivatives will reference this
concatenated video. Source clips will be included in the order of their
modification time, which should normally correspond to their creation date.
Archiving HTML resources¶
It is possible to archive web pages with egaia using a “url” file – basically
a plain-text file with the extension .url
, containing the address of the
resource you want to archive. This input file can contain multiple URLs, one
per line, but only the first URL will be fully processed. Note that egaia will
not be able to process resources requiring cookies, user interaction (e.g.,
content revealed through “infinite scrolling”), or the submission of login
credentials; such resources should be copied by other means. As indicated in
the derivatives table, outputs include a downloaded copy of the resource and
its supporting media, a full screenshot of the resource as it would appear in a
web browser, a PDF representing the “print” view of the web page, and a cropped
screenshot and thumbnail.
Re-running the command¶
Re-running the derive
command will not overwrite existing derivatives
unless the --force
or --update
flag is given; it is always safe to call
this command after updating a collection. New derivatives will only be created
for files that have been tagged by egaia (i.e., any file whose name includes a
UUID). During processing, the derived output is written to a file with the
suffix .tmp
that is then renamed to the target filename once the conversion
has completed successfully. If processing is interrupted for any reason you
will need to remove any such temporary files and call the derive
command
again.