There are several types of bulk exports and database dumps folks might be interested in:
- raw, native-format database backups: for disaster recovery (would include volatile/unsupported schema details, user API credentials, full history, in-process edits, comments, etc)
- a sanitized version of the above: roughly per-table dumps of the full state of the database. Could use per-table SQL expressions with sub-queries to pull in small tables ("partial transform") and export JSON for each table; would be extra work to maintain, so not pursuing for now.
- full history, full public schema exports, in a form that might be used to mirror or entirely fork the project. Propose supplying the full "changelog" in API schema format, in a single file to capture all entity history, without "hydrating" any inter-entity references. Rely on separate dumps of non-entity, non-versioned tables (editors, abstracts, etc). Note that a variant of this could use the public interface, in particular to do incremental updates (though that wouldn't capture schema changes).
- transformed exports of the current state of the database (aka, without history). Useful for data analysis, search engines, etc. Propose supplying just the Release table in a fully "hydrated" state to start. Unclear if should be on a work or release basis; will go with release for now. Harder to do using public interface because of the need for transaction locking.
One form of bulk export is a fast, consistent (single database transaction) snapshot of all "live" entity identifiers and their current revisions. This snapshot can be used by non-blocking background scripts to generate full bulk exports that will be consistent.
These exports are generated by the
script, run on a primary database machine, and result in a single tarball,
which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike
all other dumps and public formats, the fatcat identifiers in these dumps are
in raw UUID format (not base32-encoded).
A variant of these dumps is to include external identifiers, resulting in files that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID).
Abstract Table Dumps
./extra/sql_dumps/dump_abstracts.sql file, when run from the primary
database machine, outputs all raw abstract strings in JSON format,
Abstracts are immutable and referenced by hash in the database, so the consistency of these dumps is not as much of a concern as with other exports. See the Policy page for more context around abstract exports.
"Expanded" Entity Dumps
Using the above identifier snapshots, the
fatcat-export script outputs
single-entity-per-line JSON files with the same schema as the HTTP API. The
most useful version of these for most users are the "expanded" (including
container and file metadata) release exports.
These exports are compressed and uploaded to archive.org.
Changelog Entity Dumps
A final export type are changelog dumps. Currently these are implemented in python, and anybody can create them. They contain JSON, one-line-per-changelog-entry, with the full list of entity edits and editgroup metadata for the given changelog entry. Changelog history is immutable; this script works by iterating up the (monotonic) changelog counter until it encounters a 404.