egaia docx¶
Create and update finding aids in docx format, containing metadata for archive items. Export and import metadata to and from CSV and json.
Usage¶
egaia docx --help
egaia docx --new
egaia docx --update
egaia docx --to-csv=CSV-FILE [ DOCX ]
egaia docx --to-json=JSON-FILE [ DOCX ]
egaia docx --from-csv=CSV-FILE [ --force ]
egaia docx --from-json=JSON-STR [ --append | --force ] [ DOCX ]
Options¶
--help
- Show this help text and exit.
--new
- Create metadata files, in docx format, for tagged items from the current collection where matching metadata files are not already present. Each file will include the title, thumbnail image, and core metadata fields.
--update
- Refresh the automatically extracted metadata, thumbnail image, etc., for all items in the collection, and write the updated data to all docx files, creating those files where necessary. This command is primarily intended to be used with text documents that are composed directly within the archive, and whose file size and modification date need to be updated in the associated metadata file. The command can also be useful if new thumbnail images have been generated and you wish to include those in the docx file.
--to-csv=CSV-FILE
- Extract metadata from the specified items and write to a CSV file, for
batch editing within a spreadsheet program. This file may be edited and
then re-imported into the collection using the
--from-csv
command. For fields with multiple values, the values will be separated by newline characters within a single cell. This command will ONLY export the “core metadata fields” as defined in the egaia configuration settings; additional fields present in the docx file will be ignored. IfDOCX
is not specified, all collection items will be processed. --to-json=JSON-FILE
- Extract metadata from the specified items and write to a json file, for
processing by external tools. Json may be re-imported into the collection
using the
--write
or--append
commands, both of which read from STDIN. IfDOCX
is not specified, all collection items will be processed. --from-csv=CSV-FILE
- Convert rows in a CSV spreadsheet to individual docx metadata documents for
items in the current collection. The first row of the CSV document should
contain column headers, which must correspond to localized labels defined
in the egaia configuration file (e.g., the Dublin Core labels). Multiple
values in a single cell should be separated by newlines, which may be
entered in graphical spreadsheet editors by using a combination such as
<SHIFT>+<ENTER> or <CTRL>+<ENTER>. Where matching metadata documents
already exist, non-duplicate metadata from the CSV file will be appended to
the existing metadata in those documents. Unlike
--to-csv
, this command is not limited to the core metadata elements; every metadata column present within the spreadsheet will be imported, even if empty. CAUTION: Metadata for items that are not present in the current collection will be ignored! If the “DCTERMS.type” field for a row has the value “collection”, the metadata for that row will be appended to the collection-level metadata file. --from-json=JSON-STR
- Write metadata to a docx document, which may or may not already exist. This
command reads a string in json format from STDIN. If the
--force
flag is given, this will overwrite all the metadata values for any matching fields in the original docx document content (non-matching fields are kept intact); otherwise, by default this performs a dry run and simply prints a list of changes. Keys used in the json should be localized metadata terms (e.g., the Dublin Core labels). IfDOCX
is not specified, all collection items will be updated. --append
- Append new metadata to existing data in docx files. Any input data that do not match existing values will be appended to the fields in the matching docx document, leaving the original data intact. New docx documents will be created where necessary.
--force
- Overwrite existing text. Otherwise, by default, a list of changes is printed to stdout for review.
DOCX
- Input/target DOCX filename or globbing pattern. Filenames, without
directory paths, will be matched recursively from the collection root. It
is possible to update the collection metadata by using
metadata-en.docx
as the input value. Input should be quoted on the shell if wildcards are used, e.g.,"*"
(to match all metadata files in the collection) or"foo*"
(to match the metadata file or files for items whose names begin with “foo”). IfDOCX
is a target and more than one filename matches a given wildcard pattern, the same metadata will be written to each of the matching files. IfDOCX
is an input pattern, metadata from all matching files will be concatenated (i.e., combined in a single csv or json document).
Docx metadata document format¶
Each metadata document describes an individual item within the archive. The
document is named following the format
<original-file-basename>.metadata-<LANG>.<UUID>.docx
, where <LANG>
is
the default language of the archive (to acommodated translated versions of the
metadata file) and <UUID>
is the identifier for the item it describes.
Within the document, each header is a metadata term (i.e., “description”,
“creator”, etc.), while the paragraphs under that header will be taken as
separate values for the preceding descriptor – such as a list of authors in
separate paragraphs under the “creator” header. The level of headers used is
not significant. The metadata terms written in the headers themselves will be
the natural-language labels corresponding to the keys listed under [terms]
in the configuration file; thus the label “creator” will be used for the term
“DCTERMS.creator”, for instance. All labels are written to the document in
lower case, but for the sake of future compatibility should be assumed to be
case-sensitive.
The “title” field (i.e., the localized term for “DCTERMS.title”) will be rendered separately at the top of the document, using the document style “Title”.
If a medium-scale image of the archive item is available, that image will be embedded at the beginning of the document, below the title, to guide data entry.
Any table entered in the docx metadata document will be treated as a “table of contents”, and parsed as a tabular array. Such tables can be used for item coding, as part of the research process.
A document updated with this command will always contain the fields listed in
the configuration file under [archive] => core_metadata
, in the order given
in that list, even if those fields are empty or have been deleted by the user.
(These are also the fields that will be listed in published html catalogue
entries.) Any remaining metadata will be placed below the default entries.
While it is possible to enter custom terms that have not been defined in the
configuration file, doing so may introduce unexpected errors at the catalogue
generation stage.
Caution
Currently all fields are treated as plain text. Any text formatting such as bold, italic, underline, or font colour will be lost when the docx document is updated. Additionally, inline elements such as images or hyperlinks will be removed, along with any captions or link text they might contain.
It is possible to create a docx document manually and then have it updated by
egaia by running egaia docx --update
. Note that if the default “Title” and
“Heading” styles have not been applied to any text in the input document,
parsing will fail. The document template used for metadata files currently
cannot be modified by end users.
Sample commands¶
In normal usage it will be sufficient to employ the commands egaia docx
--new
or egaia docx --update
to create and update metadata files.
For batch editing:
$ egaia docx --tocsv ~/items.csv "*"
$ libreoffice ~/items.csv
$ egaia docx --fromcsv ~/items.csv
A sample command to update the creator field for the document
foo-bar.metadata-en.<UUID>.docx
would be:
$ egaia docx --write '{"creator":["Jane Doe"]}' --force "foo-bar*"
To add the subject “ethnography” to every item in the collection:
$ egaia docx --append '{"subject":["ethnography"]}' "*"