egaia docx¶

Create and update finding aids in docx format, containing metadata for archive items. Export and import metadata to and from CSV and json.

Usage¶

egaia docx --help
egaia docx --new
egaia docx --update
egaia docx --to-csv=CSV-FILE [ DOCX ]
egaia docx --to-json=JSON-FILE [ DOCX ]
egaia docx --from-csv=CSV-FILE [ --force ]
egaia docx --from-json=JSON-STR [ --append | --force ] [ DOCX ]

Options¶

--help: Show this help text and exit.
--new: Create metadata files, in docx format, for tagged items from the current collection where matching metadata files are not already present. Each file will include the title, thumbnail image, and core metadata fields.
--update: Refresh the automatically extracted metadata, thumbnail image, etc., for all items in the collection, and write the updated data to all docx files, creating those files where necessary. This command is primarily intended to be used with text documents that are composed directly within the archive, and whose file size and modification date need to be updated in the associated metadata file. The command can also be useful if new thumbnail images have been generated and you wish to include those in the docx file.
--to-csv=CSV-FILE: Extract metadata from the specified items and write to a CSV file, for batch editing within a spreadsheet program. This file may be edited and then re-imported into the collection using the --from-csv command. For fields with multiple values, the values will be separated by newline characters within a single cell. This command will ONLY export the “core metadata fields” as defined in the egaia configuration settings; additional fields present in the docx file will be ignored. If DOCX is not specified, all collection items will be processed.
--to-json=JSON-FILE: Extract metadata from the specified items and write to a json file, for processing by external tools. Json may be re-imported into the collection using the --write or --append commands, both of which read from STDIN. If DOCX is not specified, all collection items will be processed.
--from-csv=CSV-FILE: Convert rows in a CSV spreadsheet to individual docx metadata documents for items in the current collection. The first row of the CSV document should contain column headers, which must correspond to localized labels defined in the egaia configuration file (e.g., the Dublin Core labels). Multiple values in a single cell should be separated by newlines, which may be entered in graphical spreadsheet editors by using a combination such as <SHIFT>+<ENTER> or <CTRL>+<ENTER>. Where matching metadata documents already exist, non-duplicate metadata from the CSV file will be appended to the existing metadata in those documents. Unlike --to-csv, this command is not limited to the core metadata elements; every metadata column present within the spreadsheet will be imported, even if empty. CAUTION: Metadata for items that are not present in the current collection will be ignored! If the “DCTERMS.type” field for a row has the value “collection”, the metadata for that row will be appended to the collection-level metadata file.
--from-json=JSON-STR: Write metadata to a docx document, which may or may not already exist. This command reads a string in json format from STDIN. If the --force flag is given, this will overwrite all the metadata values for any matching fields in the original docx document content (non-matching fields are kept intact); otherwise, by default this performs a dry run and simply prints a list of changes. Keys used in the json should be localized metadata terms (e.g., the Dublin Core labels). If DOCX is not specified, all collection items will be updated.
--append: Append new metadata to existing data in docx files. Any input data that do not match existing values will be appended to the fields in the matching docx document, leaving the original data intact. New docx documents will be created where necessary.
--force: Overwrite existing text. Otherwise, by default, a list of changes is printed to stdout for review.
DOCX: Input/target DOCX filename or globbing pattern. Filenames, without directory paths, will be matched recursively from the collection root. It is possible to update the collection metadata by using metadata-en.docx as the input value. Input should be quoted on the shell if wildcards are used, e.g., "*" (to match all metadata files in the collection) or "foo*" (to match the metadata file or files for items whose names begin with “foo”). If DOCX is a target and more than one filename matches a given wildcard pattern, the same metadata will be written to each of the matching files. If DOCX is an input pattern, metadata from all matching files will be concatenated (i.e., combined in a single csv or json document).

Docx metadata document format¶

Each metadata document describes an individual item within the archive. The document is named following the format <original-file-basename>.metadata-<LANG>.<UUID>.docx, where <LANG> is the default language of the archive (to acommodated translated versions of the metadata file) and <UUID> is the identifier for the item it describes.

Within the document, each header is a metadata term (i.e., “description”, “creator”, etc.), while the paragraphs under that header will be taken as separate values for the preceding descriptor – such as a list of authors in separate paragraphs under the “creator” header. The level of headers used is not significant. The metadata terms written in the headers themselves will be the natural-language labels corresponding to the keys listed under [terms] in the configuration file; thus the label “creator” will be used for the term “DCTERMS.creator”, for instance. All labels are written to the document in lower case, but for the sake of future compatibility should be assumed to be case-sensitive.

The “title” field (i.e., the localized term for “DCTERMS.title”) will be rendered separately at the top of the document, using the document style “Title”.

If a medium-scale image of the archive item is available, that image will be embedded at the beginning of the document, below the title, to guide data entry.

Any table entered in the docx metadata document will be treated as a “table of contents”, and parsed as a tabular array. Such tables can be used for item coding, as part of the research process.

A document updated with this command will always contain the fields listed in the configuration file under [archive] => core_metadata, in the order given in that list, even if those fields are empty or have been deleted by the user. (These are also the fields that will be listed in published html catalogue entries.) Any remaining metadata will be placed below the default entries. While it is possible to enter custom terms that have not been defined in the configuration file, doing so may introduce unexpected errors at the catalogue generation stage.

Caution

Currently all fields are treated as plain text. Any text formatting such as bold, italic, underline, or font colour will be lost when the docx document is updated. Additionally, inline elements such as images or hyperlinks will be removed, along with any captions or link text they might contain.

It is possible to create a docx document manually and then have it updated by egaia by running egaia docx --update. Note that if the default “Title” and “Heading” styles have not been applied to any text in the input document, parsing will fail. The document template used for metadata files currently cannot be modified by end users.

Sample commands¶

In normal usage it will be sufficient to employ the commands egaia docx --new or egaia docx --update to create and update metadata files.

For batch editing:

$ egaia docx --tocsv ~/items.csv "*"
$ libreoffice ~/items.csv
$ egaia docx --fromcsv ~/items.csv

A sample command to update the creator field for the document foo-bar.metadata-en.<UUID>.docx would be:

$ egaia docx --write '{"creator":["Jane Doe"]}' --force "foo-bar*"

To add the subject “ethnography” to every item in the collection:

$ egaia docx --append '{"subject":["ethnography"]}' "*"