Submission guide

Terms appearing in bold are referenced in the glossary.

Archival process overview

Pocket Archive receives new contents, and updates to existing contents, via submissions. A submission is an individual contribution to the archive that can add, update, or delete resources (a combination of any of these). A submission may include multiple resources, which can be related but do not necessarily have to.

The full cycle of operations for a given resource in Pocket
Archive.

  1. Archivist selects and lays out digital resources to be archived in his or her own workstation.
  2. Archivist creates a laundry list that includes an inventory of the resources and their metadata. This, together with the files and folders previously prepares, constitutes the SIP.
  3. Archivist transfers the SIP to the Drop box: first the files and folders, then the laundry list.
  4. Upon receipt of the laundry list, Pocket Archive processes the incoming materials and archives them.
  5. Pocket Archive generates a report after the process is complete (regardless of whether it was successful or failed).
  6. Depending on setup, Pocket Archive may delete the SIP from the Drop box if the submission succeeded.
  7. Depending on setup, Pocket Archive may (re-)generate the presentation.
  8. If the archivist wants to update the archived resources, they can either request a full copy of the SIP, (or to only update metadata, only the laundry list), edit it and/or replace files, and re-submit the new SIP; or edit individual resources via the admin interface.
  9. The archivist can remove a resource and, optionally, all its members at any time.

Processing of the SIP (point 4 above) either succeeds or fails as a whole. This means that a submission will never perform only a part of the task that it is meant to complete. This is called an atomic operation and it is designed to ensure consistency of the data.

Individual steps are described in detail in the following chapters.

Submission Information Package structure

A submission is performed by preparing a Submission Information Package, or SIP, which consists of data, i.e. files optionally arranged in a curator-defined folder hierarchy, and metadata, the latter gathered in a single file called a laundry list; and sending them both to Pocket Archive for processing.

A working laundry list example linking to local files, used for testing, is available as a quick reference. Other examples are illustrated further down in this document.

As the above life cycle chart shows, the SIP is a disposable asset. Once it is successfully archived, it can be deleted. The full SIP can be regenerated by the archive and retrieved at a later time.

The original files in the archivist's workstation can be optionally kept and/or copied to local storage. This is strongly recommended, at least until Pocket Archive reaches a stable status and can be exclusively relied on for long-term preservation. More copies means more chances to recover data from corruption or loss, but it also means higher storage costs.

Source file & folder layout

Preparation of the SIP begins with selecting the materials to submit. Generally, it is good practice to select a group of works more or less related to one another, e.g. a small coherent collection, or a day's work within a large collection that may take long to complete. It is not critical to get this part perfectly right, as more can be added to the archive at a later time. It is more important to keep submissions not too large, as a single malformed element can cause the whole submission to fail, and not too small, to avoid too many iterations that can become confusing. Submissions of tens to hundreds of files are in a quite safe range.

The arrangement of files and folders is important, the ordering of elements in a folder is less so. A file or sub-folder inside a parent folder creates a membership relationship between the two, so that, e.g. one can create the following structure:

my_collection
  |
  `- work1
  |   |
  |   `- file01.tiff
  |   |
  |   `- file02.pdf
  |
  `- work2
      |
      `- file3.mpg

This creates a collection, my_collection, with two members, work1 and work2, the former containing file1.tiff and file2.pdf, and the latter containing file3.mpg.

Ordering of the files or folder in a SIP is defined in the laundry list, as we will see further down, so using file names to force a certain order is not necessary (however it can provide a good starting point for large lists of files or folders under a parent).

Some file and folder structure may used in future versions of Pocket Archive to create more metadata.

Empty folders can be created and submitted: they can be used as placeholders for resources that have no files directly related. But the same effect can be obtained by other means with the laundry list.

Laundry list

Once the files to be included in the SIP is completed, a laundry list is compiled. This is basically, as the name suggests, an inventory of all the resources that go into the submission; but it provides much more information than that, by defining metadata and relationships between resources.

The laundry list is a CSV file and may be edited in any application that supports CSV reading and writing. Care must be taken to export the file to CSV. In LibreOffice, for example, "Save" writes the file as .odt format, which is not usable as a laundry list. The spreadsheet must be instead exported as a .csv format.

The use of LibreOffice is the preferred method for editing laundry lists, and this project may provide additional utilities specific to LibreOffice to facilitate laundry list authoring.

Examples of laundry lists are in the test directory of the Pocket Archive code. Most of these are tiny examples with little interesting content, but highlighting specific features. Be aware of invalid laundry lists (the pkar_submission-bad-… ones) and ones meant for updates only (the pkar_submission-update-… ones) that may be invalid without a prior submission.

Multi-sheet documents

Many spreadsheet applications allow grouping multiple tables or sheets in one file. CSV supports only one table per file. While some may find it convenient to keep multiple laundry lists in one spreadsheet file, one must take care of exporting each sheet individually as a CSV.

Laundry list format

The first row of a laundry list is reserved for the header, which indicates the field names. These can be in any order, but following a specific order is recommended. The order used in this document and in all laundry lists automatically generated by Pocket Archive is: content_type, id, source_path, and then all ordinary fields in alphabetical order.

Each subsequent row represents a resource (except in a multi-value case, described below). The content_type field is mandatory for each resource.

Except for the exceptions noted in the ""Fields with special meaning" below, all fields are optional for the submission to be considered well-formed. However, some schema definitions may have constraints in this regard and may be at least strongly recommended. This depends on the content model used. A submission will still fail if it is well-formed but does not respond to some mandatory schema constraints.

Fields with a special meaning

content_type

Mandatory, single-valued.

It defines the content type assigned to the resource. For files, it must be file or a sub-type thereof, except for inferred resources (see below). For folders it must not be a file or sub-type. Consult the content model of your archive for a list of defined type names.

id

Mandatory for resources being updated, single-valued.

For new resources it becomes the primary identifier, which is used anywhere information about the resource is retrieved.

The IDs generated by default by Pocket Archive are 16-character random strings containing only uppercase and lowercase letters and digits. The depositor is responsible for ensuring that the provided ID is unique across the system. If left blank on new resources, the system generates an identifier that is guaranteed to be unique. However, re-submitting the laundry list a second time with the same blank field will create a duplicate resource; therefore, it is recommended to always fill this field in.

source_path

Mandatory for new files, single-valued.

It refers to the file or folder path relative to the package, using forward slash / characters to separate folders and subfolders or files. It can be omitted for files being updated, and for folders (descriptive resources). If it is present on a file update and the file exists in the SIP, the file is used to replace the archived file. If it is present and different from the archived file's path, and it does not correspond to a file in the SIP, Pocket Archive will only update the file path in the archived file metadata. This path is used when rebuilding the SIP from the archive.

has_member

This behaves like all normal properties, but it has a special meaning when deleting resources. If the --members option is provided, resources linked via the has_member property to the resource being deleted are also deleted, along with their own members, recursively. See the "Deleting resources" section below.

Note: when a field is defined as "mandatory" above, this is intended per-resource. If the resource spans multiple rows, as when it has multi-valued fields, a mandatory field is only required to have a value on the first row of the resource.

Example of a table representing a work with two files:

content_type id source_path creation_date label
still_image Sg9hYIISjRjlkP62 my_collection/work1 12-07-2002 My first deposited work
still_image_file 7hic19YTXA8Fudxo my_collection/work1/file1.tiff 09-22-2025
still_image_file Z509TdNhpTjPYDS4 my_collection/work1/file2.pdf 09-23-2025

Note the difference between the still_image and the still_image_file resources. We will get back to it further down.

Multi-valued fields

Some fields may allow multiple values. To provide multiple values for one or more fields, additional values are added to rows below the previous. For these additional rows, the special fields content_type, id, and source_path must not be filled.

Example of a table with a single resource with multi-valued fields:

content_type id source_path alt_label description label
still_image Sg9hYIISjRjlkP62 my_collection/work1 An alternative label A description of the work goes here. This is the title and must have only one value.
You can have as many as you like of these Another description goes here.
FREE alt labels! (as long as supplies last)

The submission process checks if the content_type field is filled in a cell to determine whether a row in the table is a continuation from the previous one, adding multiple values. Having a row without content type and with id and/or source_path is considered an error.

Ordering

The ordering of rows in a laundry list determines the ordering of the resources in their container. The system automatically assigns an order to the resources, using their source path and their position in the laundry list. Resources at the top level, i.e. directly under the SIP folder, are not assigned an order, as they are considered self-standing. If an order is needed for those, the next property can be set to the desired resource (see point below about relationships), or they can be put in an enclosing folder that acts as a collection.

Relationships can be established between resources. These are stored as persistent links and appear as hyperlinks in the discovery interface. A relationship can only be set for a field that is configured as "resource" type. Consult your content model to find which properties are relationships.

To set a relationship with a resource in the same laundry list that doesn't have an explicit ID set, insert the source path of the resource. For a resource that has already an ID, either by being assigned one manually or by being already deposited, insert the ID string.

Example table with implicit and explicit relationships, some path-based and some ID-based:

content_type id source_path has_member label
collection p9tXQGBb9iC7xEqm my_collection-1 This collection has implicit members from the folder hierarchy.
still_image KHwYidw4R7xUAEMN my_collection-2/image001 Resource with an explicit ID. The ID can be used in a reference.
text my_collection-2/text0001 Resource without explicit ID. It can be referenced by source_path.
collection EUXRg9igmU9ouzVH my_collection-2 p9tXQGBb9iC7xEqm This collection has explicit member relationships.
my_collection-2/text0001

When the laundry list is processed for submission, the path-based references are replaced with IDs, which are automatically generated where not provided. Therefore, a laundry list generated from archived resources may look different from the original one. The generated laundry list should be used for re-submission.

Resource types and sub-types

This section is a very concise introduction to content modeling in Pocket Archive, which is treated in detail in the Content modeling introduction. It is strongly recommended to read that guide before archiving resources in earnest.

The three main resource types found in a submission are: Work, File, and Brick. See the Content modeling introduction for more information about these.

These three key content types are seldom used as-is. They usually have sub-types, which are defined in the content model. See the content modeling guide for more information about sub-types.

Types provided by Pocket Archive may have similar names but different uses. For example, the still_image type, a sub-type of work, designates a visual object, e.g. a photograph. still_image_file may be the capture (e.g. scan) of that object, but also the capture of a text work if it is the scan of a book page.

See the provided sample laundry list for examples of works, files, and bricks making up a two-sided postcard.

Submission ID and submission name

Each submission gets a randomly generated ID when it starts. This ID is attached to all the resources in the submission. This makes it easier to find out later on when and how a certain resource was submitted. It also makes it possible to generate a laundry list that contains all the resources of the original submission.

The ID is automatically generated and system-controlled. It cannot be changed. A submission can also have a name, which is optional and user-defined. The submission name is determined by the file name used for the laundry list. E.g. pkar_submission-my_new_collection.csv will use my_new_collection, i.e. the text between pkar_submission- and .csv, as the submission name. Submission names are not required to be unique. Of course, the laundry list file names must be unique in the drop box they are deposited to.

Updating resources

A submission is also used to update existing resources. Each resource update is a full replacement of all the resource's metadata, so a submission must include a full representation of each of the resources updated.

Any single submission can contain a mix of new and updated resources. If the correct fields are provided (see "Fields with special meaning" above), Pocket Archive will know which is which.

To facilitate this task while avoiding the need to hold on to all of the archive's laundry lists, Pocket Archive can generate a laundry list for one or more selected resources. This list, which represents the current state of the resources requested, can be edited and re-submitted for an update. Read the Admin interface document for further information.

The administrative interface, if enabled, has also a facility to inspect and update an individual resource, because performing such one-off tasks using only submissions via laundry lists could become a tedious job.

Deleting resources

Although some archivists advise against deleting anything from an archive, Pocket Archive acknowledges that in real life things may actually need to be removed. The cause may be a duplicate, or something that was not supposed to be archived, etc. In any case, the resource-conservative alignment of Pocket Archive supports deleting resources immediately and irreversibly. Versioning and "soft" deletion, which keep prior states of resources including deleted ones, is not supported.

A resource can be deleted via the pkar remove CLI method, or by uploading a special file to the drop box, named pkar_remove* (asterisk means zero or more characters—note that the file name does not need an extension). The delete file must be a list of archival IDs, in the short URI form (par:<ID>), one per line.

If pkar_watch, the process watching the drop box, was started with the -r option, all members of the resources are recursively deleted (this means also members of members). This is set by the system administrator and is applied to all deletions. It cannot be overridden for individual deletion requests.

Advanced techniques

Some hidden tricks can be employed to facilitate the creation and management of larger submissions.

Implicit resources

TODO

Bulk ID generation

As mentioned before, explicitly adding IDs in a laundry list simplifies later editing and management. However, this is one of the most tedious parts of a laundry list creation.

Fortunately, such repetitive and error-prone tasks can be easily automated with tools provided by most spreadsheet applications. A macro (a mini-program that runs in an application) for LibreOffice Calc is provided here to automatically generate 16-character IDs for all the cells selected in a table.