Welcome, Welcome, Welcome!

This guide you are reading contains:

  • a high-level introduction to the Fatcat catalog and software
  • a bibliographic style guide for editors, also useful for understanding metadata found in the catalog
  • technical details and guidance for use of the catalog's public REST API, for developers building bots, services, or contributing to the server software
  • policies and licensing details for all contributors and downstream users of the catalog

What is Fatcat?

Fatcat is an open bibliographic catalog of written works. The scope of works is somewhat flexible, with a focus on published research outputs like journal articles, pre-prints, and conference proceedings. Records are collaboratively editable, versioned, available in bulk form, and include URL-agnostic file-level metadata.

Both the Fatcat software and the metadata stored in the service are free (in both the libre and gratis sense) for others to share, reuse, fork, or extend. See Policies for licensing details, and Sources for attribution of the foundational metadata corpuses we build on top of.

Fatcat is currently used internally at the Internet Archive, but interested folks are welcome to contribute to it's design and development, and we hope to ultimately crowd-source corrections and additional to bibliographic metadata, and receive direct automated feeds of new content.

You can contact the Archive by email at webservices@archive.org, or the author directly at bnewbold@archive.org.

High-Level Overview

This section gives an introduction to:

  • the goals of the project, and now it relates to the rest of the Open Access and archival ecosystem
  • how catalog data is represented as entities and revisions with full edit history, and how entities are referred to and cross-referenced with identifiers
  • how humans and bots propose changes to the catalog, and how these changes are reviewed
  • the major sources of bulk and continuously updated metadata that form the foundation of the catalog
  • a rough sketch of the software back-end, database, and libraries
  • roadmap for near-future work

Project Goals and Ecosystem Niche

The Internet Archive has two primary use cases for Fatcat:

  • Tracking the "completeness" of our holdings against all known published works. In particular, allow us to monitor progress, identify gaps, and prioritize further collection work.
  • Be a public-facing catalog and access mechanism for our open access holdings.

In the larger ecosystem, Fatcat could also provide:

  • A work-level (as opposed to title-level) archival dashboard: what fraction of all published works are preserved in archives? KBART, CLOCKSS, Portico, and other preservation networks don't provide granular metadata
  • A collaborative, independent, non-commercial, fully-open, field-agnostic, "completeness"-oriented catalog of scholarly metadata
  • Unified (centralized) foundation for discovery and access across repositories and archives: discovery projects can focus on user experience instead of building their own catalog from scratch
  • Research corpus for meta-science, with an emphasis on availability and reproducibility (metadata corpus itself is open access, and file-level hashes control for content drift)
  • Foundational infrastructure for distributed digital preservation
  • On-ramp for non-traditional digital works (web-native and "grey literature") into the scholarly web


What types of works should be included in the catalog?

The goal is to capture the "scholarly web": the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is likely to be in scope. "Leaf nodes" and small islands of intra-cited works may or may not be in scope.

Fatcat does not include any fulltext content itself, even for clearly licensed open access works, but does have verified hyperlinks to fulltext content, and includes file-level metadata (hashes and fingerprints) to help identify content from any source. File-level URLs with context ("repository", "publisher", "webarchive") should make Fatcat more useful for both humans and machines to quickly access fulltext content of a given mimetype than existing redirect or landing page systems. So another factor in deciding scope is whether a work has "digital fixity" and can be contained in immutable files or can be captured by web archives.

References and Previous Work

The closest overall analog of Fatcat is MusicBrainz, a collaboratively edited music database. Open Library is a very similar existing service, which exclusively contains book metadata.

Wikidata seems to be the most successful and actively edited/developed open bibliographic database at this time (early 2018), including the wikicite conference and related Wikimedia/Wikipedia projects. Wikidata is a general purpose semantic database of entities, facts, and relationships; bibliographic metadata has become a large fraction of all content in recent years. The focus there seems to be linking knowledge (statements) to specific sources unambiguously. Potential advantages Fatcat has are a focus on a specific scope (not a general-purpose database of entities) and a goal of completeness (capturing as many works and relationships as rapidly as possible). With so much overlap, the two efforts might merge in the future.

The technical design of Fatcat is loosely inspired by the git branch/tag/commit/tree architecture, and specifically inspired by Oliver Charles' "New Edit System" blog posts from 2012.

There are a number of proprietary, for-profit bibliographic databases, including Web of Science, Google Scholar, Microsoft Academic Graph, aminer, Scopus, and Dimensions. There are excellent field-limited databases like dblp, MEDLINE, and Semantic Scholar. Large, general-purpose databases also exist that are not directly user-editable, including the OpenCitation corpus, CORE, BASE, and CrossRef. We do not know of any large (more than 60 million works), open (bulk-downloadable with permissive or no license), field agnostic, user-editable corpus of scholarly publication bibliographic metadata.

Further Reading

"From ISIS to CouchDB: Databases and Data Models for Bibliographic Records" by Luciano G. Ramalho. code4lib, 2013. https://journal.code4lib.org/articles/4893

"Representing bibliographic data in JSON". github README file, 2017. https://github.com/rdmpage/bibliographic-metadata-json

"Citation Style Language", https://citationstyles.org/

"Functional Requirements for Bibliographic Records", Wikipedia article, https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records

OpenCitations and I40C http://opencitations.net/, https://i4oc.org/

Data Model

Entity Types and Ontology

Loosely following "Functional Requirements for Bibliographic Records" (FRBR), but removing the "manifestation" abstraction, and favoring files (digital artifacts) over physical items, the primary bibliographic entity types are:

  • work: representing an abstract unit of creative output. Does not contain any metadata itself; used only to group release entities. For example, a journal article could be posted as a pre-print, published on a journal website, translated into multiple languages, and then re-published (with minimal changes) as a book chapter; these would all be variants of the same work.
  • release: a specific "release" or "publicly published" version of a work. Contains traditional bibliographic metadata (title, date of publication, media type, language, etc). Has relationships to other entities:
    • child of a single work (required)
    • multiple creator entities as "contributors" (authors, editors)
    • outbound references to multiple other release entities
    • member of a single container, for example a journal or book series
  • file: a single concrete, fixed digital artifact; a manifestation of one or more releases. Machine-verifiable metadata includes file hashes, size, and detected file format. Verified URLs link to locations on the open web where this file can be found or has been archived. Has relationships:
    • multiple release entities that this file is a complete manifestation of (almost always a single release)
  • fileset: a list of muliple concrete files, together forming complete release manifestation. Primarily intended for datasets and supplementary materials; could also contain a paper "package" (source file and figures).
  • webcapture: a single snapshot (point in time) of a webpage or small website (multiple pages) which are a complete manifestation of a release. Not a landing page or page referencing the release.
  • creator: persona (pseudonym, group, or specific human name) that has contributed to one or more release. Not necessarily one-to-one with a human person.
  • container (aka "venue", "serial", "title"): a grouping of releases from a single publisher.

Note that, compared to many similar bibliographic ontologies, the current one does not have entities to represent:

  • physical artifacts, either generically or specific copies
  • funding sources
  • publishing entities
  • "events at a time and place"

Each entity type has it's own relations and fields (captured in a schema), but there are are also generic operations and fields common across all entities. The API for creating, updating, querying, and inspecting entities is roughly the same regardless of type.

Identifiers and Revisions

A specific version of any entity in the catalog is called a "revision". Revisions are generally immutable (do not change and are not editable), and are not normally referred to directly. Instead, persistent "fatcat identifiers" can be created, which "point to" a single revision at a time. This distinction means that entities referred to by an identifier can change over time (as metadata is corrected and expanded). Revision objects do not "point" back to specific identifiers, so they are not the same as a simple "version number" for an identifier.

Identifiers also have the ability to be merged (by redirecting one identifier to another) and "deleted" (by pointing the identifier to no revision at all). All changes to identifiers are captured as an "edit" object. Edit history can be fetched and inspected on a per-identifier basis, and any changes can easily be reverted (even merges/redirects and "deletion").

"Work in progress" or "proposed" updates are staged as edit objects without updating the identifiers themselves.

Controlled Vocabularies

Some individual fields have additional constraints, either in the form of pattern validation ("values must be upper case, contain only certain characters"), or membership in a fixed set of values. These may include:

  • license and open access status
  • work "types" (article vs. book chapter vs. proceeding, etc)
  • contributor types (author, translator, illustrator, etc)
  • human languages
  • identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifiers themselves)

Other fixed-set "vocabularies" become too large to easily maintain or express in code. These could be added to the backend databases, or be enforced by bots (instead of the system itself). These mostly include externally-registered identifiers or types, such as:

  • file mimetypes
  • identifiers themselves (DOI, ORCID, etc), by checking for registration against canonical APIs and databases

Global Edit Changelog

As part of the process of "accepting" an edit group, a row is written to an immutable, append-only table (which internally is a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).


Basic Editing Workflow and Bots

Both human editors and bots should have edits go through the same API, with humans using either the default web interface, client software, or third-party integrations.

The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an "edit group" of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor "submits" the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (one week?) with no changes and no blocking issues, the edit group would be accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).

Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts manageable.

Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors can leave edit messages to clarify their sources.

A style guide and discussion forum are intended to be be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (OAuth?) for consistent account IDs across all services.


The core metadata bootstrap sources, by entity type, are:

  • releases: Crossref metadata, with DOIs as the primary identifier, and PubMed (central), Wikidata, and CORE identifiers cross-referenced
  • containers: munged metadata from the DOAJ, ROAD, and Norwegian journal list, with ISSN-Ls as the primary identifier. ISSN provides an "ISSN to ISSN-L" mapping to normalize electronic and print ISSN numbers.
  • creators: ORCID metadata and identifier.

Initial file metadata and matches (file-to-release) come from earlier Internet Archive matching efforts, and in particular efforts to extra bibliographic metadata from PDFs (using GROBID) and fuzzy match (with conservative settings) to Crossref metadata.

The intent is to continuously ingest and merge metadata from a small number of large (~2-3 million more more records) general-purpose aggregators and catalogs in a centralized fashion, using bots, and then support volunteers and organizations in writing bots to merge high-quality metadata from field or institution-specific catalogs.

Progeny information (where the metadata comes from, or who "makes specific claims") is stored in edit metadata in the data model. Value-level attribution can be achieved by looking at the full edit history for an entity as a series of patches.


The canonical backend datastore exposes a microservice-like HTTP API, which could be extended with gRPC or GraphQL interfaces. The initial datastore is a transactional SQL database, but this implementation detail is abstracted by the API.

As little "application logic" as possible should be embedded in this back-end; as much as possible would be pushed to bots which could be authored and operated by anybody. A separate web interface project talks to the API backend and can be developed more rapidly with less concern about data loss or corruption.

A cronjob will create periodic database dumps, both in "full" form (all tables and all edit history, removing only authentication credentials) and "flattened" form (with only the most recent version of each entity).

One design goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not necessarily "first". It should be possible to export the database in a relatively clean RDF form, and to fetch data in a variety of formats, but internally Fatcat is not backed by a triple-store, and is not tied to any specific third-party ontology or schema.

Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots can ingest or synchronize the database in those formats.

Fatcat Identifiers

Fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace.

128-bit (UUID size) identifiers encode as 26 characters (but note that not all such strings decode to valid UUIDs), and in the backend can be serialized in UUID columns:


In comparison, 96-bit identifiers would have 20 characters and look like:


and 64-bit:


Fatcat identifiers can used to interlink between databases, but are explicitly not intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers for general use.

Internal Schema

Internally, identifiers are lightweight pointers to "revisions" of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems (for managing changes to source code), this follows the git model, not the mercurial model.

The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by identifier not revision number. Identifier pointers also support (versioned) deletion and redirects (for merging entities).

Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance).

SQL tables look something like this (with separate tables for entity type a la work_revision and work_edit):

    id (uuid)
    current_revision (entity_revision foreign key)
    redirect_id (optional; points to another entity_ident)
    is_live (boolean; whether newly created entity has been accepted)

    <all entity-style-specific fields>
    extra: json blob for schema evolution

    editgroup_id (editgroup foreign key)
    ident (entity_ident foreign key)
    new_revision (entity_revision foreign key)
    new_redirect (optional; points to entity_ident table)
    previous_revision (optional; points to entity_revision)
    extra: json blob for provenance metadata

    editor_id (editor table foreign key)
    extra: json blob for provenance metadata

An individual entity can be in the following "states", from which the given actions (transition) can be made:

  • wip (not live; not redirect; has rev)
    • activate (to active)
  • active (live; not redirect; has rev)
    • redirect (to redirect)
    • delete (to deleted)
  • redirect (live; redirect; rev or not)
    • split (to active)
    • delete (to delete)
  • deleted (live; not redirect; no rev)
    • redirect (to redirect)
    • activate (to active)

"WIP, redirect" or "WIP, deleted" are invalid states.

Additional entity-specific columns hold actual metadata. Additional tables (which reference both entity_revision and entity_id foreign keys as appropriate) represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity requires duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.


Core unimplemented features (as of February 2019) include:

  • rate-limiting and spam/abuse mitigation
  • actual entity creation, editing, deleting through the web interface
  • several web interface views (eg, editor-specific changelog, recent changes)
  • "work aggolomeration", merging related releases under the same work
  • linking known citations (we know DOI or PMID of "target", but haven't updated reference to point to fatcat ident)

Contributions would be helpful to implement:

  • import (bulk and/or continuous updates) for more metadata sources
  • better handling of work/release distinction in, eg, search results and citation counting
  • de-duplication (via merging) for all entity types
  • matching improvements, eg, for references (citations), contributions (authorship), work grouping, and file/release matching
  • internationalization of the web interface (translation to multiple languages)
  • accessibility review of user interface

Possible breaking API and schema changes:

  • move all edit endpoints under /editgroup/<editgroup_id>/..., instead of having an editgroup_id query parameter
  • rename release_status to release_stage
  • handle retractions/withdrawls with widthdrawn and withdrawn_date release fields, and retracted status
  • new entity type for research institutions, to track author affiliation. Use the new (2019) ROR identifier/registry
  • container nesting, or some method to handle conferences (event vs. series) and other "series" or "group" containers
  • include more author name metadata (display, sur, given) in contribs, and potentially references. Need this to format citations properly (CSL) when we don't have full author linkage

Other longer term projects could include:

  • full-text search over release files
  • bi-directional synchronization with other user-editable catalogs, such as Wikidata
  • alternate/enhanced backend to store full edit history without overloading traditional relational database
  • make external identifiers generic, instead of having a fixed (indexed) list. Eg, extid table for every entity rev, with string ("issn:1234-5678") or structure ('{type: "issn", value: "1234-5678"}')
  • URLs for entities. Have avoided so far, in lieu of external identifiers or web captures
  • "save paper now" feature in web interface
  • generic tagging of entities. Needs design/scoping; a separate service? editor-specific? tag by slugs, free-form text, or wikidata entities? "delicious for papers"?. Something as an alternative to traditional hierarchal categorization.
  • first-class support for books: additional external identifiers, metadata tweaks, bulk import of MARC or other metadata records, matching to DOAB and other open-access book collections

Known Issues

  • changelog index may have gaps due to PostgreSQL sequence and transaction roll-back behavior
  • search is idiosyncratic: does not cover contrib names by default, and some queries cause errors (eg, "N/A" without quotes)

Unresolved Questions

How to handle translations of, eg, titles and author names? To be clear, not translations of works (which are just separate releases), these are more like aliases or "originally known as".

Should external identifers be made generic? Eg, instead of having arxiv_id as a column, have a table of arbitary identifers, with either an extid_type or just use a prefix like arxiv:someid.

Should contributor/author affiliation and contact information be retained? It could be very useful for disambiguation, but we don't want to build a huge database for "marketing" and other spam.

Can general-purpose SQL databases like Postgres or MySQL scale well enough to hold several tables with billions of entity revisions? Right from the start there are hundreds of millions of works and releases, many of which having dozens of citations, many authors, and many identifiers, and then we'll have potentially dozens of edits for each of these. This multiplies out to `1e8 * 2e1

  • 2e1 = 4e10`, or 40 billion rows in the citation table. If each row was 32 bytes on average (uncompressed, not including index size), that would be 1.3 TByte on its own, larger than common SSD disks. I do think a transactional SQL datastore is the right answer. In my experience locking and index rebuild times are usually the biggest scaling challenges; the largely-immutable architecture here should mitigate locking. Hopefully few indexes would be needed in the primary database, as user interfaces could rely on secondary read-only search engines for more complex queries and views.

There is a tension between focus and scope creep. If a central database like Fatcat doesn't support enough fields and metadata, then it will not be possible to completely import other corpuses, and this becomes "yet another" partial bibliographic database. On the other hand, accepting arbitrary data leads to other problems: sparseness increases (we have more "partial" data), potential for redundancy is high, humans will start editing content that might be bulk-replaced, etc.

There might be a need to support "stub" references between entities. Eg, when adding citations from PDF extraction, the cited works are likely to be ambiguous. Could create "stub" works to be merged/resolved later, or could leave the citation hanging. Same with authors, containers (journals), etc.

Cataloging Style Guide

Language and Translation of Metadata

The Fatcat data model does not include multiple titles or names for the same entity, or even a "native"/"international" representation as seems common in other bibliographic systems. This most notably applies to release titles, but also to container and publisher names, and likely other fields.

For now, editors must use their own judgment over whether to use the title of the release listed in the work itself

This is not to be confused with translations of entire works, which should be treated as an entirely separate release.

External Identifiers

"Fake identifiers", which are actually registered and used in examples and documentation (such as DOI 10.5555/12345678) are allowed (and the entity should be tagged as a fake or example). Non-registered "identifier-like strings", which are semantically valid but not registered, should not exist in Fatcat metadata in an identifier column. Invalid identifier strings can be stored in "extra" metadata. Crossref has blogged about this distinction.


All DOIs stored in an entity column should be registered (aka, should be resolvable from doi.org). Invalid identifiers may be cleaned up or removed by bots.

DOIs should always be stored and transferred in lower-case form. Note that there are almost no other constraints on DOIs (and handles in general): they may have multiple forward slashes, whitespace, of arbitrary length, etc. Crossref has a number of examples of such "valid" but frustratingly formatted strings.

In the Fatcat ontology, DOIs and release entities are one-to-one.

It is the intention to automatically (via bot) create a Fatcat release for every Crossref-registered DOI from a whitelist of media types ("journal-article" etc, but not all), and it would be desirable to auto-create entities for in-scope publications from all registrars. It is not the intention to auto-create a release for every registered DOI. In particular, "sub-component" DOIs (eg, for an individual figure or table from a publication) aren't currently auto-created, but could be stored in "extra" metadata, or on a case-by-case basis.

Human Names

Representing names of human beings in databases is a fraught subject. For some background reading, see:

Particular difficult issues in the context of a bibliographic database include the non-universal concept of "family" vs. "given" names and their relationship to first and last names; the inclusion of honorary titles and other suffixes and prefixes to a name; the distinction between "preferred", "legal", and "bibliographic" names, or other situations where a person may not wish to be known under the name they are commonly referred to under; language and character set issues; and pseudonyms, anonymous publications, and fake personas (perhaps representing a group, like Bourbaki).

The general guidance for Fatcat is to:

  • not be a "source of truth" for representing a persona or human being; ORCID and Wikidata are better suited to this task
  • represent author personas, not necessarily 1-to-1 with human beings
  • prioritize the concerns of a reader or researcher over that of the author
  • enable basic interoperability with external databases, file formats, schemas, and style guides
  • when possible, respect the wishes of individuals

The data model for the creator entity has three name fields:

  • surname and given_name: needed for "aligning" with external databases, and to export metadata to many standard formats
  • display_name: the "preferred" representation for display of the entire name, in the context of international attribution of authorship of a written work

Names to not necessarily need to expressed in a Latin character set, but also does not necessarily need to be in the native language of the creator or the language of their notable works

Ideally all three fields are populated for all creators.

It seems likely that this schema and guidance will need review. "Extra" metadata can be used to store aliases and alternative representations, which may be useful for disambiguation and automated de-duplication.

Editgroups and Meta-Meta-Data

Editors are expected to group their edits in semantically meaningful editgroups of a reasonable size for review and acceptance. For example, merging two creators and updating related releases could all go in a single editgroup. Large refactors, conversions, and imports, which may touch thousands of entities, should be grouped into reasonable size editgroups; extremely large editgroups may cause technical issues, and make review unmanageable. 50 edits is a decent batch size, and 100 is a good upper limit (and may be enforced by the server).

Common Entity Fields

All entities have:

  • extra: free-form JSON metadata

The "extra" field is an "escape hatch" to include extra fields not in the regular schema. It is intended to enable gradual evolution of the schema, as well as accommodating niche or field-specific content. Reasonable care should be taken with this extra metadata: don't include large text or binary fields, hundreds of fields, duplicate metadata, etc.

Container Entity Reference


  • name (string, required): The title of the publication, as used in international indexing services. Eg, "Journal of Important Results". Not necessarily in the native language, but also not necessarily in English. Alternative titles (and translations) can be stored in "extra" metadata (see below)
  • container_type (string): eg, journal vs. conference vs. book series. Controlled vocabulary is TODO.
  • publisher (string): The name of the publishing organization. Eg, "Society of Curious Students".
  • issnl (string): an external identifier, with registration controlled by the ISSN organization. Registration is relatively inexpensive and easy to obtain (depending on world region), so almost all serial publications have one. The ISSN-L ("linking ISSN") is one of either the print ("ISSNp") or electronic ("ISSNe") identifiers for a serial publication; not all publications have both types of ISSN, but many do, which can cause confusion. The ISSN master list is not gratis/public, but the ISSN-L mapping is.
  • wikidata_qid (string): external linking identifier to a Wikidata entity.

extra Fields

  • abbrev (string): a commonly used abbreviation for the publication, as used in citations, following the [ISO 4][] standard. Eg, "Journal of Polymer Science Part A" -> "J. Polym. Sci. A"
  • coden (string): an external identifier, the [CODEN code][]. 6 characters, all upper-case.
  • issnp (string): Print ISSN
  • issne (string): Electronic ISSN
  • default_license (string, slug): short name (eg, "CC-BY-SA") for the default/recommended license for works published in this container
  • original_name (string): native name (if name is translated)
  • platform (string): hosting platform: OJS, wordpress, scielo, etc
  • mimetypes (array of string): formats that this container publishes all works under (eg, 'application/pdf', 'text/html')
  • first_year (integer): first year of publication
  • last_year (integer): final year of publication (implies that container is no longer active)
  • languages (array of strings): ISO codes; the first entry is considered the "primary" language (if that makes sense)
  • country (string): ISO abbreviation (two characters) for the country this container is published in
  • aliases (array of strings): significant alternative names or abbreviations for this container (not just capitalization/punctuation)
  • region (string, slug): continent/world-region (vocabulary is TODO)
  • discipline (string, slug): highest-level subject aread (vocabulary is TODO)
  • urls (array of strings): known homepage URLs for this container (first in array is default)

Additional fields used in analytics and "curration" tracking:

  • doaj (object)
    • as_of (string, ISO datetime): datetime of most recent check; if not set, not actually in DOAJ
    • seal (bool): has DOAJ seal
    • work_level (bool): whether work-level publications are registered with DOAJ
    • archive (array of strings): preservation archives
  • road (object)
    • as_of (string, ISO datetime): datetime of most recent check; if not set, not actually in ROAD
  • kbart (object)
    • lockss, clockss, portico, jstor etc (object)
      • year_spans (array of arrays of integers (pairs)): year spans (inclusive) for which the given archive has preserved this container
      • volume_spans (array of arrays of integers (pairs)): volume spans (inclusive) for which the given archive has preserved this container
  • sherpa_romeo (object):
    • color (string): the SHERPA/RoMEO "color" of the publisher of this container
  • doi: TODO: include list of prefixes and which (if any) DOI registrar is used
  • dblp (object):
    • id (string)
  • ia (object): Internet Archive specific fields
    • sim (object): same format as kbart preservation above; coverage in microfilm collection
    • longtail (bool): is this considered a "long-tail" open access venue

For KBART and other "coverage" fields, we "over-count" on the assumption that works with "in-progress" status will soon actually be preserved. Elements of these arrays are either an integer (means that single year is preserved), or an array of length two (meaning everything between the two numbers (inclusive) is preserved).

Creator Entity Reference


  • display_name (string, required): Full name, as will be displayed in user interfaces. Eg, "Grace Hopper"
  • given_name (string): Also known as "first name". Eg, "Grace".
  • surname (string): Also known as "last name". Eg, "Hooper".
  • orcid (string): external identifier, as registered with ORCID.
  • wikidata_qid (string): external linking identifier to a Wikidata entity.

See also "Human Names" sub-section of style guide.

File Entity Reference


  • size (integer, positive, non-zero): Size of file in bytes. Eg: 1048576.
  • md5 (string): MD5 hash in lower-case hex. Eg: "d41efcc592d1e40ac13905377399eb9b".
  • sha1 (string): SHA-1 hash in lower-case hex. Not technically required, but the most-used of the hash fields and should always be included. Eg: "f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".
  • sha256: SHA-256 hash in lower-case hex. Eg: "a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".
  • urls: An array of "typed" URLs. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "webarchive".
  • mimetype (string): Format of the file. If XML, specific schema can be included after a +. Example: "application/pdf"
  • release_ids (array of string identifiers): references to release entities that this file represents a manifestation of. Note that a single file can contain multiple release references (eg, a PDF containing a full issue with many articles), and that a release will often have multiple files (differing only by watermarks, or different digitizations of the same printed work, or variant MIME/media types of the same published work).

Fileset Entity Reference


Warning: This schema is not yet stable.

  • manifest (array of objects): each entry represents a file
    • path (string, required): relative path to file (including filename)
    • size (integer, required): in bytes
    • md5 (string): MD5 hash in lower-case hex
    • sha1 (string): SHA-1 hash in lower-case hex
    • sha256 (string): SHA-256 hash in lower-case hex
    • extra (object): any extra metadata about this specific file
  • urls: An array of "typed" URLs. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "webarchive".
  • release_ids (array of string identifiers): references to release entities

Web Capture Entity Reference


Warning: This schema is not yet stable.

  • cdx (array of objects): each entry represents a distinct web resource (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
    • surt (string, required): sortable URL format
    • timestamp (string, datetime, required): ISO format, UTC timezone, with Z prefix required, with second (or finer) precision. Eg, "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should be converted naively.
    • url (string, required): full URL
    • mimetype (string): content type of the resource
    • status_code (integer, signed): HTTP status code
    • sha1 (string, required): SHA-1 hash in lower-case hex
    • sha256 (string): SHA-256 hash in lower-case hex
  • archive_urls: An array of "typed" URLs where this snapshot can be found. Can be wayback/memento instances, or direct links to a WARC file containing all the capture resources. Often will only be a single archive. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "wayback" or "warc"
  • original_url (string): base URL of the resource. May reference a specific CDX entry, or maybe in normalized form.
  • timestamp (string, datetime): same format as CDX line timestamp (UTC, etc). Corresponds to the overall capture timestamp. Can be the earliest of CDX timestamps if that makes sense
  • release_ids (array of string identifiers): references to release entities

Release Entity Reference


  • title (string, required): the display title of the release. May include subtitle.
  • subtitle (string): intended only to be used primarily with books, not journal articles. Subtitle may also be appended to the title instead of populating this field.
  • original_title (string): the full original language title, if title is translated
  • work_id (fatcat identifier; required): the (single) work that this release is grouped under. If not specified in a creation (POST) action, the API will auto-generate a work.
  • container_id (fatcat identifier): a (single) container that this release is part of. When expanded the container field contains the full container entity.
  • release_type (string, controlled set): represents the medium or form-factor of this release; eg, "book" versus "journal article". Not necessarily the same across all releases of a work. See definitions below.
  • release_state (string, controlled set): represents the publishing/review lifecycle status of this particular release of the work. See definitions below.
  • release_date (string, ISO date format): when this release was first made publicly available. Blank if only year is known.
  • release_year (integer): year when this release was first made publicly available; should match release_date if both are known.
  • withdrawn_status (string, controlled set):
  • release_date (string, ISO date format): when this release was first made publicly available. Blank if only year is known.
  • release_year (integer): year when this release was first made publicly available; should match release_date if both are known.
  • ext_ids (key/value object of string-to-string mappings): external identifiers. At least an empty ext_ids object is always required for release entities, so individual identifiers can be accessed directly.
  • volume (string): optionally, stores the specific volume of a serial publication this release was published in. type: string
  • issue (string): optionally, stores the specific issue of a serial publication this release was published in.
  • pages (string): the pages (within a volume/issue of a publication) that this release can be looked up under. This is a free-form string, and could represent the first page, a range of pages, or even prefix pages (like "xii-xxx").
  • version (string): optionally, describes distinguishes this release version from others. Generally a number, software-style version, or other short/slug string, not a freeform description. Book "edition" descriptions can also go in an edition extra field. Often used in conjunction with external identifiers. If you're not certain, don't use this field!
  • number (string): an inherent identifier for this release (or work), often part of the title. For example, standards numbers, technical memo numbers, book series number, etc. Not a book chapter number however (which can be stored in extra). Depending on field or series-specific norms, the number may be stored here, in the title, or in both fields.
  • publisher (string): name of the publishing entity. This does not need to be populated if the associated container entity has the publisher field set, though it is acceptable to duplicate, as the publishing entity of a container may differ over time. Should be set for singleton releases, like books.
  • language (string, slug): the primary language used in this particular release of the work. Only a single language can be specified; additional languages can be stored in "extra" metadata (TODO: which field?). This field should be a valid RFC1766/ISO639 language code (two letters). AKA, a controlled vocabulary, not a free-form name of the language.
  • license_slug (string, slug): the license of this release. Usually a creative commons short code (eg, CC-BY), though a small number of other short names for publisher-specific licenses are included (TODO: list these).
  • contribs (array of objects): an array of authorship and other creator contributions to this release. Contribution fields include:
    • index (integer, optional): the (zero-indexed) order of this author. Authorship order has significance in many fields. Non-author contributions (illustration, translation, editorship) may or may not be ordered, depending on context, but index numbers should be unique per release (aka, there should not be "first author" and "first translator")
    • creator_id (identifier): if known, a reference to a specific creator
    • raw_name (string): the name of the contributor, as attributed in the text of this work. If the creator_id is linked, this may be different from the display_name; if a creator is not linked, this field is particularly important. Syntax and name order is not specified, but most often will be "display order", not index/alphabetical (in Western tradition, surname followed by given name).
    • role (string, of a set): the type of contribution, from a controlled vocabulary. TODO: vocabulary needs review.
    • extra (string): additional context can go here. For example, author affiliation, "this is the corresponding author", etc.
  • refs (array of ident strings): references (aka, citations) to other releases. References can only be linked to a specific target release (not a work), though it may be ambiguous which release of a work is being referenced if the citation is not specific enough. Reference fields include:
    • index (integer, optional): reference lists and bibliographies almost always have an implicit order. Zero-indexed. Note that this is distinct from the key field.
    • target_release_id (fatcat identifier): if known, and the release exists, a cross-reference to the Fatcat entity
    • extra (JSON, optional): additional citation format metadata can be stored here, particularly if the citation schema does not align. Common fields might be "volume", "authors", "issue", "publisher", "url", and external identifiers ("doi", "isbn13").
    • key (string): works often reference works with a short slug or index number, which can be captured here. For example, "[BROWN2017]". Keys generally supersede the index field, though both can/should be supplied.
    • year (integer): year of publication of the cited release.
    • container_title (string): if applicable, the name of the container of the release being cited, as written in the citation (usually an abbreviation).
    • title (string): the title of the work/release being cited, as written.
    • locator (string): a more specific reference into the work/release being cited, for example the page number(s). For web reference, store the URL in "extra", not here.
  • abstracts (array of objects): see below
    • sha1 (string, hex, required): reference to the abstract content (string). Example: "3f242a192acc258bdfdb151943419437f440c313"
    • content (string): The abstract raw content itself. Example: <jats:p>Some abstract thing goes here</jats:p>
    • mimetype (string): not formally required, but should effectively always get set. text/plain if the abstract doesn't have a structured format
    • lang (string, controlled set): the human language this abstract is in. See the lang field of release for format and vocabulary.

External Identifiers (ext_ids)

The ext_ids object name-spaces external identifiers and makes it easier to add new identifiers to the schema in the future.

Many identifier fields must match an internal regex (string syntax constraint) to ensure they are properly formatted, though these checks aren't always complete or correct in more obscure cases.

  • doi (string): full DOI number, lower-case. Example: "10.1234/abcde.789". See the "External Identifiers" section of style guide for more notes about DOIs specifically.
  • wikidata_qid (string): external identifier for Wikidata entities. These are integers prefixed with "Q", like "Q4321". Each release entity can be associated with at most one Wikidata entity (this field is not an array), and Wikidata entities should be associated with at most a single release. In the future it may be possible to associate Wikidata entities with work entities instead.
  • isbn13 (string): external identifier for books. ISBN-9 and other formats should be converted to canonical ISBN-13.
  • pmid (string): external identifier for PubMed database. These are bare integers, but stored in a string format.
  • pmcid (string): external identifier for PubMed Central database. These are integers prefixed with "PMC" (upper case), like "PMC4321". Versioned PMCIDs can also be stored (eg, "PMC4321.1"; future clarification of whether versions should always be stored will be needed.
  • core (string): external identifier for the [CORE] open access aggregator. These identifiers are integers, but stored in string format.
  • arxiv (string) external identifier to a (version-specific) arxiv.org work. For releases, must always include the vN suffix (eg, v3).
  • jstor (string) external identifier for works in JSTOR.
  • ark (string) ARK identifer
  • mag (string) Microsoft Academic Graph identifier

extra Fields

  • crossref (object), for extra crossref-specific metadata
    • subject (array of strings) for subject/category of content
    • type (string) raw/original Crossref type
    • alternative-id (array of strings)
    • archive (array of strings), indicating preservation services deposited
    • funder (object/dictionary)
  • aliases (array of strings) for additional titles this release might be known by
  • container_name (string) if not matched to a container entity
  • group-title (string) for releases within an collection/group
  • translation_of (release identifier) if this release is a translation of another (usually under the same work)
  • superceded (boolean) if there is another release under the same work that should be referenced/indicated instead. Intended as a temporary hint until proper work-based search is implemented. As an example use, all arxiv release versions except for the most recent get this set.

release_type Vocabulary

This vocabulary is based on the CSL types, with a small number of (proposed) extensions:

  • article-magazine
  • article-journal, including pre-prints and working papers
  • book
  • chapter is allowed as they are frequently referenced and read independent of the entire book. The data model does not currently support linking a subset of a release to an entity representing the entire release. The release/work/file distinctions should not be used to group multiple chapters under a single work; a book chapter can be it's own work. A paper which is republished as a chapter (eg, in a collection, or "edited" book) can have both releases under one work. The criteria of whether to "split" a book and have release entities for each chapter is whether the chapter has been cited/reference as such.
  • dataset
  • entry, which can be used for generic web resources like question/answer site entries.
  • entry-encyclopedia
  • manuscript
  • paper-conference
  • patent
  • post-weblog for blog entries
  • report
  • review, for things like book reviews, not the "literature review" form of article-journal, nor peer reviews (see peer_review)
  • speech can be used for eg, slides and recorded conference presentations themselves, as distinct from paper-conference
  • thesis
  • webpage
  • peer_review (fatcat extension)
  • software (fatcat extension)
  • standard (fatcat extension), for technical standards like RFCs
  • abstract (fatcat extension), for releases that are only an abstract of a larger work. In particular, translations. Many are granted DOIs.
  • editorial (custom extension) for columns, "in this issue", and other content published along peer-reviewed content in journals. Many are granted DOIs.
  • letter for "letters to the editor", "authors respond", and sub-article-length published content. Many are granted DOIs.
  • stub (fatcat extension) for releases which have notable external identifiers, and thus are included "for completeness", but don't seem to represent a "full work".

An example of a stub might be a paper that gets an extra DOI by accident; the primary DOI should be a full release, and the accidental DOI can be a stub release under the same work. stub releases shouldn't be considered full releases when counting or aggregating (though if technically difficult this may not always be implemented). Other things that can be categorized as stubs (which seem to often end up mis-categorized as full articles in bibliographic databases):

  • commercial advertisements
  • "trap" or "honey pot" works, which are fakes included in databases to detect re-publishing without attribution
  • "This page is intentionally blank"
  • "About the author", "About the editors", "About the cover"
  • "Acknowledgments"
  • "Notices"

All other CSL types are also allowed, though they are mostly out of scope:

  • article (generic; should usually be some other type)
  • article-newspaper
  • bill
  • broadcast
  • entry-dictionary
  • figure
  • graphic
  • interview
  • legislation
  • legal_case
  • map
  • motion_picture
  • musical_score
  • pamphlet
  • personal_communication
  • post
  • review-book
  • song
  • treaty

For the purpose of statistics, the following release types are considered "papers":

  • article
  • article-journal
  • chapter
  • paper-conference
  • thesis

release_state Vocabulary

These roughly follow the DRIVER publication version guidelines, with the addition of a retracted status.

  • draft is an early version of a work which is not considered for peer review. Sometimes these are posted to websites or repositories for early comments and feedback.
  • submitted is the version that was submitted for publication. Also known as "pre-print", "pre-review", "under review". Note that this doesn't imply that the work was every actually submitted, reviewed, or accepted for publication, just that this is the version that "would be". Most versions in pre-print repositories are likely to have this status.
  • accepted is a version that has undergone peer review and accepted for published, but has not gone through any publisher copy editing or re-formatting. Also known as "post-print", "author's manuscript", "publisher's proof".
  • published is the version that the publisher distributes. May include minor (gramatical, typographical, broken link, aesthetic) corrections. Also known as "version of record", "final publication version", "archival copy".
  • updated: post-publication significant updates (considered a separate release in Fatcat). Also known as "correction" (in the context of either a published "correction notice", or the full new version)
  • retraction for post-publication retraction notices (should be a release under the same work as the published release)

Note that in the case of a retraction, the original publication does not get state retracted, only the retraction notice does. The original publication does get a withdrawn_status metadata field set.

When blank, indicates status isn't known, and wasn't inferred at creation time. Can often be interpreted as published, but be careful!

withdrawn_status Vocabulary

Don't know of an existing controlled vocabulary for things like retractions or other reasons for marking papers as removed from publication, so invented my own. These labels should be considered experimental and subject to change.

Note that some of these will apply more to pre-print servers or publishing accidents, and don't necessarily make sense as a formal change of status for a print journal publication.

Any value at all indicates that the release should be considered "no longer published by the publisher or primary host", which could mean different things in different contexts. As some concrete examples, works are often accidentally generated a duplicate DOI; physics papers have been taken down in reponse to government order under national security justifications; papers have been withdrawn for public health reasons (above and beyond any academic-style retraction); entire journals may be found to be predatory and pulled from circulation; individual papers may be retracted by authors if a serious mistake or error is found; an author's entire publication history may be retracted in cases of serious academic misconduct or fraud.

  • withdrawn is generic: the work is no longer available from the original publisher. There may be no reason, or the reason may not be known yet.
  • retracted for when a work is formally retracted, usually accompanied by a retraction notice (a separate release under the same work). Note that the retraction itself should not have a withdrawn_status.
  • concern for when publishers release an "expression of concern", often indicating that the work is not reliable in some way, but not yet formally retracted. In this case the original work is probably still available, but should be marked as suspect. This is not the same as presence of errata.
  • safety for works pulled for public health or human safety concerns.
  • national-security for works pulled over national security concerns.
  • spam for content that is considered spam (eg, bogus pre-print or repository submissions). Not to be confused with advertisements or product reviews in journals.

contribs.role Vocabulary

  • author
  • translator
  • illustrator
  • editor

All other CSL role types are also allowed, though are mostly out of scope for Fatcat:

  • collection-editor
  • composer
  • container-author
  • director
  • editorial-director
  • editortranslator
  • interviewer
  • original-author
  • recipient
  • reviewed-author

If blank, indicates that type of contribution is not known; this can often be interpreted as authorship.

Work Entity Reference

Works have no fields! They just group releases.


The Fatcat HTTP API is mostly a classic REST "CRUD" (Create, Read, Update, Delete) API, with a few twists.

A declarative specification of all API endpoints, JSON data models, and response types is available in OpenAPI 2.0 format. Code generation tools are used to generate both server-side type-safe endpoint routes and client-side libraries. Auto-generated reference documentation is, for now, available at https://api.qa.fatcat.wiki.

All API traffic is over HTTPS; there is no HTTP endpoint, even for read-only operations. All endpoints accept and return only JSON serialized content.

Entity Endpoints/Actions

Actions could, in theory, be directed at any of:

entities (ident)

Top-level entity actions (resulting in edits):

create (new rev)
update (new rev)
split (remove redirect)

On existing entity edits (within a group):


An edit group as a whole can be:


Other per-entity endpoints:

lookup (by external persistent identifier)
match (by field/context; unimplemented)


All mutating entity operations (create, update, delete) accept a required editgroup_id query parameter. Editgroups (with contextual metadata) should be created before starting edits.

Related edits (to multiple entities) should be collected under a single editgroup, up to a reasonable size. More than 50 edits per entity type, or more than 100 edits total in an editgroup become unwieldy.

After creating and modifying the editgroup, it may be "submitted", which flags it for review by bot and human editors. The editgroup may be "accepted" (merged), or if changes are necessary the edits can be updated and re-submitted.

Sub-Entity Expansion

To reduce the need for multiple GET queries when looking for common related metadata, it is possible to include linked entities in responses using the expand query parameter. For example, by default the release model only includes an optional container_id field which points to a container entity. If the expand parameter is set:


Then the full container model will be included under the container field. Multiple expand parameters can be passed, comma-separated.

Authentication and Authorization

There are two editor types: bots and humans. Additionally, either type of editor may have additional privileges which allow them to, eg, directly accept editgroups (as opposed to submitting edits for review).

All mutating API calls (POST, PUT, DELETE HTTP verbs) require token-based authentication using an HTTP Bearer token. New tokens can be generated in the web interface.

Autoaccept Flag

Currently only on batch creation (POST) for entities.

For all bulk operations, optional 'editgroup' query parameter overrides individual editgroup parameters.

If autoaccept flag is set and editgroup is not, a new editgroup is automatically created and overrides for all entities inserted. Note that this is different behavior from the "use current or create new" default behavior for regular creation.

Unfortunately, "true" and "false" are the only values acceptable for boolean rust/openapi2 query parameters

QA Instance

The intent is to run a public "sandbox" QA instance of the catalog, using a subset of the full catalog, running the most recent development branch of the API specification. This instance can be used by developers for prototyping and experimentation, though note that all data is periodically wiped, and this endpoint is more likely to have bugs or be offline.

Bulk Exports

There are several types of bulk exports and database dumps folks might be interested in:

  • complete database dumps
  • changelog history with all entity revisions and edit metadata
  • identifier snapshot tables
  • entity exports

All exports and dumps get uploaded to the Internet Archive under the bibliographic metadata collection.

Complete Database Dumps

The most simple and complete bulk export. Useful for disaster recovery, mirroring, or forking the entire service. The internal database schema is not stable, so not as useful for longitudinal analysis. These dumps will include edits-in-progress, deleted entities, old revisions, etc, which are potentially difficult or impossible to fetch through the API.

Public copies may have some tables redacted (eg, API credentials).

Dumps are in PostgreSQL pg_dump "tar" binary format, and can be restored locally with the pg_restore command. See ./extra/sql_dumps/ for commands and details. Dumps are on the order of 100 GBytes (compressed) and will grow over time.

Changelog History

These are currently unimplemented; would involve "hydrating" sub-entities into changelog exports. Useful for some mirrors, and analysis that needs to track provenance information. Format would be the public API schema (JSON).

All information in these dumps should be possible to fetch via the public API, including on a feed/streaming basis using the sequential changelog index. All information is also contained in the database dumps.

Identifier Snapshots

Many of the other dump formats are very large. To save time and bandwidth, a few simple snapshot tables can be exported directly in TSV format. Because these tables can be dumped in single SQL transactions, they are consistent point-in-time snapshots.

One format is per-entity identifier/revision tables. These contain active, deleted, and redirected identifiers, with revision and redirect references, and are used to generate the entity dumps below.

Other tables contain external identifier mappings or file hashes.

Release abstracts can be dumped in their own table (JSON format), allowing them to be included only by reference from other dumps. The copyright status and usage restrictions on abstracts are different from other catalog content; see the policy page for more context. Abstracts are immutable and referenced by hash in the database, so the consistency of these dumps is not as much of a concern as with other exports.

Unlike all other dumps and public formats, the Fatcat identifiers in these dumps are in raw UUID format (not base32-encoded), though this may be fixed in the future.

See ./extra/sql_dumps/ for scripts and details. Dumps are on the order of a couple GBytes each (compressed).

Entity Exports

Using the above identifier snapshots, the Rust fatcat-export program outputs single-entity-per-line JSON files with the same schema as the HTTP API. These might contain the default fields, or be in "expanded" format containing sub-entities for each record.

Only "active" entities are included (not deleted, work-in-progress, or redirected entities).

These dumps can be quite large when expanded (over 100 GBytes compressed), but do not include history so will not grow as fast as other exports over time. Not all entity types are dumped at the moment; if you would like specific dumps get in touch!


Updating an Existing Entity

  1. Fetch (GET) the existing entity
  2. Create (POST) a new editgroup
  3. Update (PUT) the entity, with the current revision number in the prev edit field, and the editgroup_id set
  4. Submit (PUT) the editgroup for review
  5. Somebody (human or bot) with admin privileges will Accept (POST) the editgroup.

Merging Duplicate Entities

  1. Fetch (GET) both entities
  2. Decide which will be the "primary" entity (the other will redirect to it)
  3. Create (POST) a new editgroup
  4. Update (PUT) the "primary" entity with any updated metadata merged from the other entity (optional), and the editgroup id set
  5. Update (PUT) the "other" entity with the redirect flag set to the primary's identifier, with the current revision id (of the "other" entity) in the prev field, and the editgroup id set
  6. Submit (PUT) the editgroup for review
  7. Somebody (human or bot) with admin privileges will Accept (POST) the editgroup.

Lookup Fulltext URLs by DOI

  1. Use release lookup endpoint (GET) with the doi query parameter in URL-escaped format, with expand=files. You may want to hide abstracts,references for faster responses if you aren't interested in those fields.
  2. If a release hit is found, iterate over the linked file entities, and create a ranked list of URLs based on mimetype, URL "rel" type, file size, or host domain.

Batch Insert New Entities (Bootstrapping)

When bootstrapping a blank catalog, we need to insert 10s or 100s of millions of entities as fast as possible.

  1. Batch create (POST) a set of entities, with editgroup metadata included along with list of entities (all of a single type). Entire batch is inserted and the editgroup accepted (requiring admin bits) in a single transaction.

Software Contributions

For now, issues and patches can be filed at https://github.com/internetarchive/fatcat.

The back-end (fatcatd, in Rust), web interface (fatcat-web, in Python), bots, and this guide are all versioned in the same git repository.

See the rust/README.md and rust/HACKING.md documents for some common tasks and gotchas when working with the rust backend.

When considering making a non-trivial contribution, it can save review time and duplicated work to post an issue with your intentions and plan. New code and features must include unit tests before being merged, though we can help with writing them.

Norms and Policies

These social norms are explicitly expected to evolve and mature if the number of contributors to the project grows. It is important to have some policies as a starting point, but also important not to set these policies in stone until they have been reviewed.

Social Norms and Conduct

Contributors (editors and software developers) are expected to treat each other excellently, to assume good intentions, and to participate constructively.

Metadata Licensing

The Fatcat catalog content license is the Creative Commons Zero ("CC-0") license, which is effectively a public domain grant. This applies to the catalog metadata itself (titles, entity relationships, citation metadata, URLs, hashes, identifiers), as well as "meta-meta-data" provided by editors (edit descriptions, provenance metadata, etc).

The core catalog is designed to contain only factual information: "this work, known by this title and with these third-party identifiers, is believed to be represented by these files and published under such-and-such venue". As a norm, sourcing metadata (for attribution and provenance) is retained for each edit made to the catalog.

A notable exception to this policy are abstracts, for which no copyright claims or license is made. Abstract content is kept separate from core catalog metadata; downstream users need to make their own decision regarding reuse and distribution of this material.

As a social norm, it is expected (and appreciated!) that downstream users of the public API and/or bulk exports provide attribution, and even transitive attribution (acknowledging the original source of metadata contributed to Fatcat). As an academic norm, researchers are encouraged to cite the corpus as a dataset (when this option becomes available). However, neither of these norms are enforced via the copyright mechanism.

As a strong norm, editors should expect full access to the full corpus and edit history, including all of their contributions.

Immutable History

All editors agree to the licensing terms, and understand that their full public history of contributions is made irrevocably public. Edits and contributions may be reverted, but the history (and content) of their edits are retained. Edit history is not removed from the corpus on the request of an editor or when an editor closes their account.

In an emergency situation, such as non-bibliographic content getting encoded in the corpus by bypassing normal filters (eg, base64 encoding hate crime content or exploitative photos, as has happened to some blockchain projects), the ecosystem may decide to collectively, in a coordinated manner, expunge specific records from their history.

Documentation Licensing

This guide ("Fatcat: The Guide") is licensed under the Creative Commons Attribution license.

Software Licensing

The Fatcat software project licensing policy is to adopt strong copyleft licenses for server software (where the majority of software development takes place), and permissive licenses for client library and bot framework software, and CC-0 (public grant) licensing for declarative interface specifications (such as SQL schemas and REST API specifications).

Privacy Policy

It is important to note that this section is currently aspirational: the servers hosting early deployments of Fatcat are largely in a defaults configuration and have not been audited to ensure that these guidelines are being followed.

It is a goal for Fatcat to conduct as little surveillance of reader and editor behavior and activities as possible. In practical terms, this means minimizing the overall amount of logging and collection of identifying information. This is in contrast to submitted edit content, which is captured, preserved, and republished as widely as possible.

The general intention is to:

  • not use third-party tracking (via extract browser-side requests or javascript)
  • collect aggregate metrics (overall hit numbers), but not log individual interactions ("this IP visited this page at this time")

Exceptions will likely be made:

  • temporary caching of IP addresses may be necessary to implement rate-limiting and debug traffic spikes
  • exception logging, abuse detection, and other exceptional

Some uncertain areas of privacy include:

  • should third-party authentication identities be linked to editor ids? what about the specific case of ORCID if used for login?
  • what about discussion and comments on edits? should conversations be included in full history dumps? should editors be allowed to update or remove comments?


Ito, Joichi. “Citing Blogs.” Joi Ito’s Web (2018). Accessed March 11, 2019. https://joi.ito.com/weblog/2018/05/28/citing-blogs.html.
Karaganis, Joe, ed. Shadow Libraries: Access to Knowledge in Global Higher Education. Cambridge, MA : Ottawa, ON: The MIT Press ; International Development Research Centre, 2018.
Khabsa, Madian, and C. Lee Giles. “The Number of Scholarly Documents on the Public Web.” PLOS ONE 9, no. 5 (May 9, 2014): e93949.
Knoth, Petr, and Zdenek Zdrahal. “CORE: Three Access Levels to Underpin Open Access.” D-Lib Magazine 18, no. 11/12 (November 2012). Accessed March 11, 2019. http://www.dlib.org/dlib/november12/knoth/11knoth.html.
Ortega, Jose Luis. Academic Search Enghines: New Information Trends and Services for Scientists on the Web. Chandos information professional series. Philadelphia, PA: Elsevier, 2014.
Page, Roderic. “Notes on Bibliographic Metadata in JSON.” Last modified July 12, 2017. Accessed March 11, 2019. https://github.com/rdmpage/bibliographic-metadata-json.
Piwowar, Heather, Jason Priem, Vincent Larivière, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. “The State of OA: A Large-Scale Analysis of the Prevalence and Impact of Open Access Articles.” PeerJ 6 (February 13, 2018): e4375.
Ramalho, Luciano G. “From ISIS to CouchDB: Databases and Data Models for Bibliographic Records.” The Code4Lib Journal, no. 13 (April 11, 2011). Accessed March 11, 2019. https://journal.code4lib.org/articles/4893.
rclark1. “DOI-like Strings and Fake DOIs.” Website. Crossref. Accessed March 11, 2019. https://www.crossref.org/blog/doi-like-strings-and-fake-dois/.
Svenonius, Elaine. The Intellectual Foundation of Information Organization. First MIT Press paperback ed. Digital libraries and electronic publishing. Cambridge, Mass.: MIT Press, 2009.
Van de Sompel, Herbert, Robert Sanderson, Martin Klein, Michael L. Nelson, Bernhard Haslhofer, Simeon Warner, and Carl Lagoze. “A Perspective on Resource Synchronization.” D-Lib Magazine 18, no. 9/10 (September 2012). Accessed March 11, 2019. http://www.dlib.org/dlib/september12/vandesompel/09vandesompel.html.
Wright, Alex. Cataloging the World: Paul Otlet and the Birth of the Information Age. Oxford ; New York: Oxford University Press, 2014.
“Citation Style Language.” Citation Style Language. Accessed March 11, 2019. https://citationstyles.org/.
“Open Archives Initiative Protocol for Metadata Harvesting.” Accessed March 11, 2019. https://www.openarchives.org/pmh/.

About This Guide

This guide is generated from markdown text files using the mdBook tool. The source is mirrored on Github at https://github.com/internetarchive/fatcat.

Contributions and corrections are welcome! If you create a (free) account on github you can submit comments and corrections as "Issues", or directly edit the source and submit "Pull Requests" with changes.

This guide is licensed under a Creative Commons Attribution (CC-BY) license, meaning you are free to redistribute, sell, and extend it without special permission, as long as you credit the original authors.