Welcome, Welcome, Welcome!

This guide you are reading contains:

  • a high-level introduction to the fatcat catalog and software
  • a bibliographic style guide for editors, also useful for understanding metadata found in the catalog
  • technical details and guidance for use of the catalog's public REST API, for developers building bots, services, or contributing to the server software
  • policies and licensing details for all contributors and downstream users of the catalog

What is Fatcat?

Fatcat is an open bibliographic catalog of written works. The scope of works is somewhat flexible, with a focus on published research outputs like journal articles, pre-prints, and conference proceedings. Records are collaboratively editable, versioned, available in bulk form, and include URL-agnostic file-level metadata.

Both the fatcat software and the metadata stored in the service are free (in both the libre and gratis sense) for others to share, reuse, fork, or extend. See Policies for licensing details, and Sources for attribution of the foundational metadata corpuses we build on top of.

Fatcat is currently used internally at the Internet Archive, but interested folks are welcome to contribute to it's design and development, and we hope to ultimately crowd-source corrections and additional to bibliographic metadata, and receive direct automated feeds of new content.

You can contact the Archive by email at info@archive.org, or the author directly at bnewbold@archive.org.

High-Level Overview

This section gives an introduction to:

  • the goals of the project, and now it relates to the rest of the Open Access and archival ecosystem
  • how catalog data is represented as entities and revisions with full edit history, and how entities are referred to and cross-referenced with identifiers
  • how humans and bots propose changes to the catalog, and how these changes are reviewed
  • the major sources of bulk and continuously updated metadata that form the foundation of the catalog
  • a rough sketch of the software back-end, database, and libraries
  • roadmap for near-future work

Project Goals and Ecosystem Niche

The Internet Archive has two primary use cases for fatcat:

  • Tracking the "completeness" of our holdings against all known published works. In particular, allow us to monitor progress, identify gaps, and prioritize further collection work.
  • Be a public-facing catalog and access mechanism for our open access holdings.

In the larger ecosystem, fatcat could also provide:

  • A work-level (as opposed to title-level) archival dashboard: what fraction of all published works are preserved in archives? KBART, CLOCKSS, Portico, and other preservation networks don't provide granular metadata
  • A collaborative, independent, non-commercial, fully-open, field-agnostic, "completeness"-oriented catalog of scholarly metadata
  • Unified (centralized) foundation for discovery and access across repositories and archives: discovery projects can focus on user experience instead of building their own catalog from scratch
  • Research corpus for meta-science, with an emphasis on availability and reproducibility (metadata corpus itself is open access, and file-level hashes control for content drift)
  • Foundational infrastructure for distributed digital preservation
  • On-ramp for non-traditional digital works ("grey literature") into the scholarly web

Scope

What types of works should be included in the catalog?

The goal is to capture the "scholarly web": the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is very likely to be in scope. "Leaf nodes" and small islands of intra-cited works may or may not be in scope.

Fatcat does not include any fulltext content itself, even for cleanly licensed (open access) works, but does have "strong" (verified) links to fulltext content, and includes file-level metadata (like hashes and fingerprints) to help discovery and identify content from any source. File-level URLs with context ("repository", "author-homepage", "web-archive") should make fatcat more useful for both humans and machines to quickly access fulltext content of a given mimetype than existing redirect or landing page systems. So another factor in deciding scope is whether a work has "digital fixity" and can be contained in a single immutable file.

References and Previous Work

The closest overall analog of fatcat is MusicBrainz, a collaboratively edited music database. Open Library is a very similar existing service, which exclusively contains book metadata.

Wikidata seems to be the most successful and actively edited/developed open bibliographic database at this time (early 2018), including the wikicite conference and related Wikimedia/Wikipedia projects. Wikidata is a general purpose semantic database of entities, facts, and relationships; bibliographic metadata has become a large fraction of all content in recent years. The focus there seems to be linking knowledge (statements) to specific sources unambiguously. Potential advantages fatcat has are a focus on a specific scope (not a general-purpose database of entities) and a goal of completeness (capturing as many works and relationships as rapidly as possible). With so much overlap, the two efforts might merge in the future.

The technical design of fatcat is loosely inspired by the git branch/tag/commit/tree architecture, and specifically inspired by Oliver Charles' "New Edit System" blog posts from 2012.

There are a whole bunch of proprietary, for-profit bibliographic databases, including Web of Science, Google Scholar, Microsoft Academic Graph, aminer, Scopus, and Dimensions. There are excellent field-limited databases like dblp, MEDLINE, and Semantic Scholar. There are some large general-purpose databases that are not directly user-editable, including the OpenCitation corpus, CORE, BASE, and CrossRef. We do not know of any large (more than 60 million works), open (bulk-downloadable with permissive or no license), field agnostic, user-editable corpus of scholarly publication bibliographic metadata.

Further Reading

"From ISIS to CouchDB: Databases and Data Models for Bibliographic Records" by Luciano G. Ramalho. code4lib, 2013. https://journal.code4lib.org/articles/4893

"Representing bibliographic data in JSON". github README file, 2017. https://github.com/rdmpage/bibliographic-metadata-json

"Citation Style Language", https://citationstyles.org/

"Functional Requirements for Bibliographic Records", Wikipedia article, https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records

OpenCitations and I40C http://opencitations.net/, https://i4oc.org/

Data Model

Entity Types and Ontology

Loosely following "Functional Requirements for Bibliographic Records" (FRBR), but removing the "manifestation" abstraction, and favoring files (digital artifacts) over physical items, the primary bibliographic entity types are:

  • work: representing an abstract unit of creative output. Does not contain any metadata itself; used only to group release entities. For example, a journal article could be posted as a pre-print, published on a journal website, translated into multiple languages, and then re-published (with minimal changes) as a book chapter; these would all be variants of the same work.
  • release: a specific "release" or "publicly published" (in a formal or informal sense) version of a work. Contains traditional bibliographic metadata (title, date of publication, media type, language, etc). Has relationships to other entities:
    • "variant of" a single work
    • "contributed to by" multiple creators
    • "references to" (cites) multiple releases
    • "published as part of" a single container
  • file: a single concrete, fixed digital artifact; a manifestation of one or more releases. Machine-verifiable metadata includes file hashes, size, and detected file format. Verified URLs link to locations on the open web where this file can be found or has been archived. Has relationships:
    • "manifestation of" multiple releases (though usually a single release)
  • creator: persona (pseudonym, group, or specific human name) that contributions to releases have been attributed to. Not necessarily one-to-one with a human person.
  • container (aka "venue", "serial", "title"): a grouping of releases from a single publisher.

Note that, compared to many similar bibliographic ontologies, the current one does not have entities to represent:

  • funding sources
  • publishing entities
  • "events at a time and place"
  • physical artifacts, either generically or specific copies
  • sets of files (eg, a dataset or webpage with media)

Each entity type has it's own relations and fields (captured in a schema), but there are are also generic operations and fields common across all entities. The process of creating, updating, querying, and inspecting entities is roughly the same regardless of type.

Identifiers and Revisions

A specific version of any entity in the catalog is called a "revision". Revisions are generally immutable (do not change and are not editable), and are not usually referred to directly by users. Instead, persistent identifiers can be created, which "point to" a specific revision at a time. This distinction means that entities referred to by an identifier can change over time (as metadata is corrected and expanded). Revision objects do not "point" back to specific identifiers, so they are not the same as a simple "version number" for an identifier.

Identifiers also have the ability to be merged (by redirecting one identifier to another) and "deleted" (by pointing the identifier to no revision at all). All changes to identifiers are captured as an "edit" object. Edit history can be fetched and inspected on a per-identifier basis, and any changes can easily be reverted (even merges/redirects and "deletion").

"Staged" or "proposed" changes are captured as edit objects without updating the identifiers themselves.

Fatcat Identifiers

Fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace.

128-bit (UUID size) identifiers encode as 26 characters (but note that not all such strings decode to valid UUIDs), and in the backend can be serialized in UUID columns:

work_rzga5b9cd7efgh04iljk8f3jvz
https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz

In comparison, 96-bit identifiers would have 20 characters and look like:

work_rzga5b9cd7efgh04iljk
https://fatcat.wiki/work/rzga5b9cd7efgh04iljk

A 64-bit namespace would probably be large enough, and would work with database Integer columns:

work_rzga5b9cd7efg
https://fatcat.wiki/work/rzga5b9cd7efg

Fatcat identifiers can used to interlink between databases, but are explicitly not intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers.

Entity States

Internal Schema

Internally, identifiers are lightweight pointers to "revisions" of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems (for managing changes to source code), this follows the git model, not the mercurial model.

The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by identifier not revision number. Identifier pointers also support (versioned) deletion and redirects (for merging entities).

Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance).

SQL tables look something like this (with separate tables for entity type a la work_revision and work_edit):

entity_ident
    id (uuid)
    current_revision (entity_revision foreign key)
    redirect_id (optional; points to another entity_ident)
    is_live (boolean; whether newly created entity has been accepted)

entity_revision
    revision_id
    <all entity-style-specific fields>
    extra: json blob for schema evolution

entity_edit
    timestamp
    editgroup_id (editgroup foreign key)
    ident (entity_ident foreign key)
    new_revision (entity_revision foreign key)
    new_redirect (optional; points to entity_ident table)
    previous_revision (optional; points to entity_revision)
    extra: json blob for progeny metadata

editgroup
    editor_id (editor table foreign key)
    description
    extra: json blob for progeny metadata

An individual entity can be in the following "states", from which the given actions (transition) can be made:

  • wip (not live; not redirect; has rev)
    • activate (to active)
  • active (live; not redirect; has rev)
    • redirect (to redirect)
    • delete (to deleted)
  • redirect (live; redirect; rev or not)
    • split (to active)
    • delete (to delete)
  • deleted (live; not redirect; no rev)
    • redirect (to redirect)
    • activate (to active)

"WIP, redirect" or "WIP, deleted" are invalid states.

Additional entity-specific columns hold actual metadata. Additional tables (which reference both entity_revision and entity_id foreign keys as appropriate) represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity requires duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.

Controlled Vocabularies

Some individual fields have additional constraints, either in the form of pattern validation ("values must be upper case, contain only certain characters"), or membership in a fixed set of values. These may include:

  • subject categorization
  • license and open access status
  • work "types" (article vs. book chapter vs. proceeding, etc)
  • contributor types (author, translator, illustrator, etc)
  • human languages
  • identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifiers themselves)

Other fixed-set "vocabularies" become too large to easily maintain or express in code. These could be added to the backend databases, or be enforced by bots (instead of the core system itself). These mostly include externally-registered identifiers or types, such as:

  • file mimetypes
  • identifiers themselves (DOI, ORCID, etc), by checking for registration against canonical APIs and databases

Global Edit Changelog

As part of the process of "accepting" an edit group, a row is written to an immutable, append-only table (which internally is a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).

Workflow

Basic Editing Workflow and Bots

Both human editors and bots should have edits go through the same API, with humans using either the default web interface, integration, or client software.

The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an "edit group" of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor "submits" the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (72 hours?) with no changes and no blocking issues, the edit group would be auto-accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).

Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts manageable.

Data progeny and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors can leave edit messages to clarify their sources.

A style guide and discussion forum are intended to be be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (OAuth?) for consistent account IDs across all services.

Sources

The core metadata bootstrap sources, by entity type, are:

  • releases: Crossref metadata, with DOIs as the primary identifier, and PubMed (central), Wikidata, and CORE identifiers cross-referenced
  • containers: munged metadata from the DOAJ, ROAD, and Norwegian journal list, with ISSN-Ls as the primary identifier. ISSN provides an "ISSN to ISSN-L" mapping to normalize electronic and print ISSN numbers.
  • creators: ORCID metadata and identifier.

Initial file metadata and matches (file-to-release) come from earlier Internet Archive matching efforts, and in particular efforts to extra bibliographic metadata from PDFs (using GROBID) and fuzzy match (with conservative settings) to Crossref metadata.

The intent is to continuously ingest and merge metadata from a small number of large (~2-3 million more more records) general-purpose aggregators and catalogs in a centralized fashion, using bots, and then support volunteers and organizations in writing bots to merge high-quality metadata from field or institution-specific catalogs.

Progeny information (where the metadata comes from, or who "makes specific claims") is stored in edit metadata in the data model. Value-level attribution can be achieved by looking at the full edit history for an entity as a series of patches.

Implementation

The canonical backend datastore exposes a microservice-like HTTP API, which could be extended with gRPC or GraphQL interfaces. The initial datastore is a transactional SQL database, but this implementation detail is abstracted by the API.

As little "application logic" as possible should be embedded in this back-end; as much as possible would be pushed to bots which could be authored and operated by anybody. A separate web interface project talks to the API backend and can be developed more rapidly with less concern about data loss or corruption.

A cronjob will create periodic database dumps, both in "full" form (all tables and all edit history, removing only authentication credentials) and "flattened" form (with only the most recent version of each entity).

A goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not necessarily "first". It should be possible to export the database in a relatively clean RDF form, and to fetch data in a variety of formats, but internally fatcat will not be backed by a triple-store, and will not be bound to a rigid third-party ontology or schema.

Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the database in those formats.

Roadmap

Major unimplemented features (as of September 2018) include:

  • backend "soundness" work to ensure corrupt data model states aren't reachable via the API
  • authentication and account creation
  • rate-limiting and spam/abuse mitigation
  • "automated update" bots to consume metadata feeds (as opposed to one-time bulk imports)
  • actual entity creation, editing, deleting through the web interface
  • updating the search index in near-real-time following editgroup merges. In particular, the cache invalidation problem is tricky for some relationships (eg, updating all releases if a container is updated)

Once a reasonable degree of schema and API stability is attained, contributions would be helpful to implement:

  • import (bulk and/or continuous updates) for more metadata sources
  • better handling of work/release distinction in, eg, search results and citation counting
  • de-duplication (via merging) for all entity types
  • matching improvements, eg, for references (citations), contributions (authorship), work grouping, and file/release matching
  • internationalization of the web interface (translation to multiple languages)
  • review of design for accessibility
  • better handling of non-PDF file formats

Longer term projects could include:

  • full-text search over release files
  • bi-directional synchronization with other user-editable catalogs, such as Wikidata
  • better representation of multi-file objects such as websites and datasets
  • alternate/enhanced backend to store full edit history without overloading traditional relational database

Known Issues

Too many right now, but this section will be populated soon.

  • changelog index may have gaps due to postgresql sequence and transaction roll-back behavior

Unresolved Questions

How to handle translations of, eg, titles and author names? To be clear, not translations of works (which are just separate releases), these are more like aliases or "originally known as".

Are bi-directional links a schema anti-pattern? Eg, should "work" point to a "primary release" (which itself points back to the work)?

Should identifier and citation be their own entities, referencing other entities by UUID instead of by revision? Not sure if this would increase or decrease database resource utilization.

Should contributor/author affiliation and contact information be retained? It could be very useful for disambiguation, but we don't want to build a huge database for spammers or "innovative" start-up marketing.

Can general-purpose SQL databases like Postgres or MySQL scale well enough to hold several tables with billions of entity revisions? Right from the start there are hundreds of millions of works and releases, many of which having dozens of citations, many authors, and many identifiers, and then we'll have potentially dozens of edits for each of these, which multiply out to `1e8 * 2e1

  • 2e1 = 4e10`, or 40 billion rows in the citation table. If each row was 32 bytes on average (uncompressed, not including index size), that would be 1.3 TByte on its own, larger than common SSD disks. I do think a transactional SQL datastore is the right answer. In my experience locking and index rebuild times are usually the biggest scaling challenges; the largely-immutable architecture here should mitigate locking. Hopefully few indexes would be needed in the primary database, as user interfaces could rely on secondary read-only search engines for more complex queries and views.

There is a tension between focus and scope creep. If a central database like fatcat doesn't support enough fields and metadata, then it will not be possible to completely import other corpuses, and this becomes "yet another" partial bibliographic database. On the other hand, accepting arbitrary data leads to other problems: sparseness increases (we have more "partial" data), potential for redundancy is high, humans will start editing content that might be bulk-replaced, etc.

There might be a need to support "stub" references between entities. Eg, when adding citations from PDF extraction, the cited works are likely to be ambiguous. Could create "stub" works to be merged/resolved later, or could leave the citation hanging. Same with authors, containers (journals), etc.

Cataloging Style Guide

Language and Translation of Metadata

The Fatcat data model does not include multiple titles or names for the same entity, or even a "native"/"international" representation as seems common in other bibliographic systems. This most notably applies to release titles, but also to container and publisher names, and likely other fields.

For now, editors must use their own judgment over whether to use the title of the release listed in the work itself

This is not to be confused with translations of entire works, which should be treated as an entirely separate release.

External Identifiers

"Fake identifiers", which are actually registered and used in examples and documentation (such as DOI 10.5555/12345678) are allowed (and the entity should be tagged as a fake or example). Non-registered "identifier-like strings", which are semantically valid but not registered, should not exist in fatcat metadata in an identifier column. Invalid identifier strings can be stored in "extra" metadata. Crossref has blogged about this distinction.

DOI

All DOIs stored in an entity column should be registered (aka, should be resolvable from doi.org). Invalid identifiers may be cleaned up or removed by bots.

DOIs should always be stored and transferred in lower-case form. Note that there are almost no other constraints on DOIs (and handles in general): they may have multiple forward slashes, whitespace, of arbitrary length, etc. Crossref has a number of examples of such "valid" but frustratingly formatted strings.

In the fatcat ontology, DOIs and release entities are one-to-one.

It is the intention to automatically (via bot) create a fatcat release for every Crossref-registered DOI from a whitelist of media types ("journal-article" etc, but not all), and it would be desirable to auto-create entities for in-scope publications from all registrars. It is not the intention to auto-create a release for every registered DOI. In particular, "sub-component" DOIs (eg, for an individual figure or table from a publication) aren't currently auto-created, but could be stored in "extra" metadata, or on a case-by-case basis.

Human Names

Representing names of human beings in databases is a fraught subject. For some background reading, see:

Particular difficult issues in the context of a bibliographic database include the non-universal concept of "family" vs. "given" names and their relationship to first and last names; the inclusion of honorary titles and other suffixes and prefixes to a name; the distinction between "preferred", "legal", and "bibliographic" names, or other situations where a person may not wish to be known under the name they are commonly referred to under; language and character set issues; and pseudonyms, anonymous publications, and fake personas (perhaps representing a group, like Bourbaki).

The general guidance for Fatcat is to:

  • not be a "source of truth" for representing a persona or human being; ORCID and Wikidata are better suited to this task
  • represent author personas, not necessarily 1-to-1 with human beings
  • prioritize the concerns of a reader or researcher over that of the author
  • enable basic interoperability with external databases, file formats, schemas, and style guides
  • when possible, respect the wishes of individuals

The data model for the creator entity has three name fields:

  • surname and given_name: needed for "aligning" with external databases, and to export metadata to many standard formats
  • display_name: the "preferred" representation for display of the entire name, in the context of international attribution of authorship of a written work

Names to not necessarily need to expressed in a Latin character set, but also does not necessarily need to be in the native language of the creator or the language of their notable works

Ideally all three fields are populated for all creators.

It seems likely that this schema and guidance will need review. "Extra" metadata can be used to store aliases and alternative representations, which may be useful for disambiguation and automated de-duplication.

Editgroups and Meta-Meta-Data

Editors are expected to group their edits in semantically meaningful editgroups of a reasonable size for review and acceptance. For example, merging two creators and updating related releases could all go in a single editgroup. Large refactors, conversions, and imports, which may touch thousands of entities, should be grouped into reasonable size editgroups; extremely large editgroups may cause technical issues, and make review unmanageable. 50 edits is a decent batch size, and 100 is a good upper limit (and may be enforced by the server).

Entity Types

TODO: entity-type-specific scope and quality guidance

Work/Release/File Distinctions

TODO: clarify distinctions and relationship between these three entity types

Schema "Alignments"

A table (CSV) of "alignments" between fatcat entity types and fields with other file formats and standards is available under the ./notes/ directory of the source git repository..

TODO: in particular, highlight alignments with:

  • citation style language (CSL)
  • bibtex
  • crossref API schema
  • dublin core (schema.org, OAI-PMH)
  • BIBFRAME
  • resourceSync
  • google scholar
  • pubmed/medline

Entity Field Reference

All entities have:

  • extra: free-form JSON metadata

The "extra" field is an "escape hatch" to include extra fields not in the regular schema. It is intended to enable gradual evolution of the schema, as well as accommodating niche or field-specific content. That being said, reasonable limits should be adhered to.

Containers

  • name: (string, required). The title of the publication, as used in international indexing services. Eg, "Journal of Important Results". Not necessarily in the native language, but also not necessarily in English. Alternative titles (and translations) can be stored in "extra" metadata (TODO: what field?).
  • publisher (string): The name of the publishing organization. Eg, "Society of Curious Students".
  • issnl (string): an external identifier, with registration controlled by the ISSN organization. Registration is relatively inexpensive and easy to obtain (depending on world region), so almost all serial publications have one. The ISSN-L ("linking ISSN") is one of either the print ("ISSNp") or electronic ("ISSNe") identifiers for a serial publication; not all publications have both types of ISSN, but many do, which can cause confusion. The ISSN master list is not gratis/public, but the ISSN-L mapping is.
  • wikidata_qid (string): external linking identifier to a Wikidata entity.
  • abbrev (string): a commonly used abbreviation for the publication, as used in citations, following the ISO 4 standard. Eg, "Journal of Polymer Science Part A" -> "J. Polym. Sci. A". Alternative abbreviations can be stored in "extra" metadata. (TODO: what field?)
  • coden (string): an external identifier, the CODEN code. 6 characters, all upper-case.

Creators

See "Human Names" sub-section of style guide.

  • display_name (string, required): Eg, "Grace Hopper".
  • given_name (string): Eg, "Grace".
  • surname (string): Eg, "Hooper".
  • orcid (string): external identifier, as registered with ORCID.
  • wikidata_qid (string): external linking identifier to a Wikidata entity.

Files

  • size (positive, non-zero integer): Eg: 1048576.
  • sha1 (string): Eg: "f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".
  • md5: Eg: "d41efcc592d1e40ac13905377399eb9b".
  • sha256: Eg: "a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".
  • urls: An array of "typed" URLs. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "webarchive".
  • mimetype (string): example: "application/pdf"
  • releases (array of identifiers): references to release entities that this file represents a manifestation of. Note that a single file can contain multiple release references (eg, a PDF containing a full issue with many articles), and that a release will often have multiple files (differing only by watermarks, or different digitizations of the same printed work, or variant MIME/media types of the same published work). See also "Work/Release/File Distinctions".

Releases

  • title (required): the title of the release.
  • work_id (fatcat identifier; required): the (single) work that this release is grouped under. If not specified in a creation (POST) action, the API will auto-generate a work.
  • container_id (fatcat identifier): a (single) container that this release is part of. When expanded the container field contains the full container entity.
  • release_type (string, controlled set): represents the medium or form-factor of this release; eg, "book" versus "journal article". Not necessarily consistent across all releases of a work. See definitions below.
  • release_status (string, controlled set): represents the publishing/review lifecycle status of this particular release of the work. See definitions below.
  • release_date (string, date format): when this release was first made publicly available
  • doi (string): full DOI number, lower-case. Example: "10.1234/abcde.789". See the "External Identifiers" section of style guide.
  • isbn13 (string): external identifier for books. ISBN-9 and other formats should be converted to canonical ISBN-13. See the "External Identifiers" section of style guide.
  • core_id (string): external identifier for the [CORE] open access aggregator. These identifiers are integers, but stored in string format. See the "External Identifiers" section of style guide.
  • pmid (string): external identifier for PubMed database. These are bare integers, but stored in a string format. See the "External Identifiers" section of style guide.
  • pmcid (string): external identifier for PubMed Central database. These are integers prefixed with "PMC" (upper case), like "PMC4321". See the "External Identifiers" section of style guide.
  • wikidata_qid (string): external identifier for Wikidata entities. These are integers prefixed with "Q", like "Q4321". Each release entity can be associated with at most one Wikidata entity (this field is not an array), and Wikidata entities should be associated with at most a single release. In the future it may be possible to associate Wikidata entities with work entities instead. See the "External Identifiers" section of style guide.
  • volume (string): optionally, stores the specific volume of a serial publication this release was published in. type: string
  • issue (string): optionally, stores the specific issue of a serial publication this release was published in.
  • pages (string): the pages (within a volume/issue of a publication) that this release can be looked up under. This is a free-form string, and could represent the first page, a range of pages, or even prefix pages (like "xii-xxx").
  • publisher (string): name of the publishing entity. This does not need to be populated if the associated container entity has the publisher field set, though it is acceptable to duplicate, as the publishing entity of a container may differ over time. Should be set for singleton releases, like books.
  • language (string): the primary language used in this particular release of the work. Only a single language can be specified; additional languages can be stored in "extra" metadata (TODO: which field?). This field should be a valid RFC1766/ISO639-1 language code ("with extensions"), aka a controlled vocabulary, not a free-form name of the language.
  • contribs: an array of authorship and other creator contributions to this release. Contribution fields include:
    • index (integer, optional): the (zero-indexed) order of this author. Authorship order has significance in many fields. Non-author contributions (illustration, translation, editorship) may or may not be ordered, depending on context, but index numbers should be unique per release (aka, there should not be "first author" and "first translator")
    • creator_id (identifier): if known, a reference to a specific creator
    • raw_name (string): the name of the contributor, as attributed in the text of this work. If the creator_id is linked, this may be different from the display_name; if a creator is not linked, this field is particularly important. Syntax and name order is not specified, but most often will be "display order", not index/alphabetical (in Western tradition, surname followed by given name).
    • role (string, of a set): the type of contribution, from a controlled vocabulary. TODO: vocabulary needs review.
    • extra (string): additional context can go here. For example, author affiliation, "this is the corresponding author", etc.
  • refs: an array of references (aka, citations) to other releases. References can only be linked to a specific target release (not a work), though it may be ambiguous which release of a work is being referenced if the citation is not specific enough. Reference fields include:
    • index (integer, optional): reference lists and bibliographies almost always have an implicit order. Zero-indexed. Note that this is distinct from the key field.
    • target_release_id (fatcat identifier): if known, and the release exists, a cross-reference to the fatcat entity
    • extra (JSON, optional): additional citation format metadata can be stored here, particularly if the citation schema does not align. Common fields might be "volume", "authors", "issue", "publisher", "url", and external identifiers ("doi", "isbn13").
    • key (string): works often reference works with a short slug or index number, which can be captured here. For example, "[BROWN2017]". Keys generally supersede the index field, though both can/should be supplied.
    • year (integer): year of publication of the cited release.
    • container_title (string): if applicable, the name of the container of the release being cited, as written in the citation (usually an abbreviation).
    • title (string): the title of the work/release being cited, as written.
    • locator (string): a more specific reference into the work/release being cited, for example the page number(s). For web reference, store the URL in "extra", not here.

Controlled vocabulary for release_type is derived from the Crossref type vocabulary (TODO: should it follow CSL types instead?):

  • journal-article
  • proceedings-article
  • monograph
  • dissertation
  • book (and edited-book, reference-book)
  • book-chapter (and book-part, book-section, though much rarer) is allowed as these are frequently referenced and read independent of the entire book. The data model does not currently support linking a subset of a release to an entity representing the entire release. The release/work/file distinctions should not be used to group chapters into complete work; a book chapter can be it's own work. A paper which is republished as a chapter (eg, in a collection, or "edited" book) can have both releases under one work. The criteria of whether to "split" a book and have release entities for each chapter is whether the chapter has been cited/reference as such.
  • dissertation
  • dataset (though representation with file entities is TBD).
  • monograph
  • report
  • standard
  • posted-content is allowed, but may be re-categorized. For crossref, this seems to imply a journal article or report which is not published (pre-print)
  • other matches Crossref other works, which may (and generally should) have a more specific type set.
  • web-post (custom extension) for blog posts, essays, and other individual works on websites
  • website (custom extension) for entire web sites and wikis.
  • presentation (custom extension) for, eg, slides and recorded conference presentations themselves, as distinct from proceedings-article
  • editorial (custom extension) for columns, "in this issue", and other content published along peer-reviewed content in journals. Can bleed in to "other" or "stub"
  • book-review (custom extension)
  • letter for "letters to the editor", "authors respond", and sub-article-length published content
  • example (custom extension) for dummy or example releases that have valid (registered) identifiers. Other metadata does not need to match "canonical" examples.
  • stub (custom extension) for releases which have notable external identifiers, and thus are included "for completeness", but don't seem to represent a "full work". An example might be a paper that gets an extra DOI by accident; the primary DOI should be a full release, and the accidental DOI can be a stub release under the same work. stub releases shouldn't be considered full releases when counting or aggregating (though if technically difficult this may not always be implemented). Other things that can be categorized as stubs (which seem to often end up mis-categorized as full articles in bibliographic databases):
    • an abstract, which is only an abstract of a larger work
    • commercial advertisements
    • "trap" or "honey pot" works, which are fakes included in databases to detect re-publishing without attribution
    • "This page is intentionally blank"
    • "About the author", "About the editors", "About the cover"
    • "Acknowledgments"
    • "Notices"

Other types from Crossref (such as component, reference-entry) are valid, but are not actively solicited for inclusion, as they are not the current focus of the database.

In the future, some types (like journal, proceedings, and book-series) will probably be represented as container entities. How to represent other container-like types (like report-series or book-series) is TBD.

Controlled vocabulary for release_status:

  • published for any version of the work that was "formally published", or any variant that can be considered a "proof", "camera ready", "archival", "version of record" or "definitive" that have no meaningful differences from the "published" version. Note that "meaningful" here will need to be explored.
  • corrected for a version of a work that, after formal publication, has been revised and updated. Could be the "version of record".
  • pre-print, for versions of a work which have not been submitted for peer review or formal publication
  • post-print, often a post-peer-review version of a work that does not have publisher-supplied copy-editing, typesetting, etc.
  • draft in the context of book publication or online content (shouldn't be applied to journal articles), is an unpublished, but somehow notable version of a work.
  • If blank, indicates status isn't known, and wasn't inferred at creation time. Can often be interpreted as published.

Controlled vocabulary for role field on contribs:

  • author
  • translator
  • illustrator
  • editor
  • If blank, indicates that type of contribution is not known; this can often be interpreted as authorship.

Current "extra" fields, flags, and content:

  • crossref (object), for extra crossref-specific metadata
  • is_retracted (boolean flag) if this work has been retracted
  • translation_of (release identifier) if this release is a translation of another (usually under the same work)
  • arxiv_id (string) external identifier to a (version-specific) arxiv.org work

Abstracts

Abstract contents (in raw string form) are stored in their own table, and are immutable (not editable), but there is release-specific metadata as part of release entities.

  • sha1 (string, hex, required): reference to the abstract content (string). Example: "3f242a192acc258bdfdb151943419437f440c313"
  • content (string): The abstract raw content itself. Example: <jats:p>Some abstract thing goes here</jats:p>
  • mimetype (string): not formally required, but should effectively always get set. text/plain if the abstract doesn't have a structured format
  • lang (string, controlled set): the human language this abstract is in. See the lang field of release for format and vocabulary.

Works

Works have no field! They just group releases.

REST API

The fatcat HTTP API is mostly a classic REST CRUD (Create, Read, Update, Delete) API, with a few twists.

A declarative specification of all API endpoints, JSON data models, and response types is available in OpenAPI 2.0 format. Code generation tools are used to generate both server-side type-safe endpoint routes and client-side libraries. Auto-generated reference documentation is, for now, available at https://api.qa.fatcat.wiki.

All API traffic is over HTTPS; there is no insecure HTTP endpoint, even for read-only operations. To start, all endpoints accept and return only JSON serialized content.

Entity Endpoints/Actions

Actions could, in theory, be directed at any of:

entities (ident)
revision
edit

A design decision to be made is how much to abstract away the distinction between these three types (particularly the identifier/revision distinction).

Top-level entity actions (resulting in edits):

create (new rev)
redirect
split
update (new rev)
delete

On existing entity edits (within a group):

update
delete

An edit group as a whole can be:

create
submit
accept

Other per-entity endpoints:

match (by field/context)
lookup (by external persistent identifier)

Editgroups

All mutating entity operations (create, update, delete) accept an editgroup_id query parameter. If the parameter isn't set, the editor's "currently active" editgroup will be used, or a new editgroup will be created from scratch. It's generally preferable to manually create an editgroup and use the id in edit requests; the allows appropriate metadata to be set. The "currently active" editgroup behavior may be removed in the future.

Sub-Entity Expansion

To reduce the need for multiple GET queries when looking for common related metadata, it is possible to include linked entities in responses using the expand query parameter. For example, by default the release model only includes an optional container_id field which points to a container entity. If the expand parameter is set:

https://api.qa.fatcat.wiki/v0/release/aaaaaaaaaaaaarceaaaaaaaaam?expand=container

Then the full container model will be included under the container field. Multiple expand parameters can be passed, comma-separated.

Authentication and Authorization

There are two editor types: bots and humans. Additionally, either type of editor may have additional privileges which allow them to, eg, directly accept editgroups (as opposed to submitting edits for review).

All mutating API calls (POST, PUT, DELETE HTTP verbs) require token-based authentication using an HTTP Bearer token. If you can't generate such a token from the web interface (because that feature hasn't been implemented), look for a public demo token for experimentation, or ask an administrator for a token.

Autoaccept Flag

Currently only on batch creation (POST) for entities.

For all bulk operations, optional 'editgroup' query parameter overrides individual editgroup parameters.

If autoaccept flag is set and editgroup is not, a new editgroup is automatically created and overrides for all entities inserted. Note that this is different behavior from the "use current or create new" default behavior for regular creation.

Unfortunately, "true" and "false" are the only values acceptable for boolean rust/openapi2 query parameters

QA Instance

The intent is to run a public "sandbox" QA instance of the catalog, using a subset of the full catalog, running the most recent development branch of the API specification. This instance can be used by developers for prototyping and experimentation, though note that all data is periodically wiped, and this endpoint is more likely to have bugs or be offline.

Bulk Exports

There are several types of bulk exports and database dumps folks might be interested in:

  • raw, native-format database backups: for disaster recovery (would include volatile/unsupported schema details, user API credentials, full history, in-process edits, comments, etc)
  • a sanitized version of the above: roughly per-table dumps of the full state of the database. Could use per-table SQL expressions with sub-queries to pull in small tables ("partial transform") and export JSON for each table; would be extra work to maintain, so not pursuing for now.
  • full history, full public schema exports, in a form that might be used to mirror or entirely fork the project. Propose supplying the full "changelog" in API schema format, in a single file to capture all entity history, without "hydrating" any inter-entity references. Rely on separate dumps of non-entity, non-versioned tables (editors, abstracts, etc). Note that a variant of this could use the public interface, in particular to do incremental updates (though that wouldn't capture schema changes).
  • transformed exports of the current state of the database (aka, without history). Useful for data analysis, search engines, etc. Propose supplying just the Release table in a fully "hydrated" state to start. Unclear if should be on a work or release basis; will go with release for now. Harder to do using public interface because of the need for transaction locking.

Identifier Snapshots

One form of bulk export is a fast, consistent (single database transaction) snapshot of all "live" entity identifiers and their current revisions. This snapshot can be used by non-blocking background scripts to generate full bulk exports that will be consistent.

These exports are generated by the ./extra/sql_dumps/ident_table_snapshot.sh script, run on a primary database machine, and result in a single tarball, which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike all other dumps and public formats, the fatcat identifiers in these dumps are in raw UUID format (not base32-encoded).

A variant of these dumps is to include external identifiers, resulting in files that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID).

Abstract Table Dumps

The ./extra/sql_dumps/dump_abstracts.sql file, when run from the primary database machine, outputs all raw abstract strings in JSON format, one-object-per-line.

Abstracts are immutable and referenced by hash in the database, so the consistency of these dumps is not as much of a concern as with other exports. See the Policy page for more context around abstract exports.

"Expanded" Entity Dumps

Using the above identifier snapshots, the fatcat-export script outputs single-entity-per-line JSON files with the same schema as the HTTP API. The most useful version of these for most users are the "expanded" (including container and file metadata) release exports.

These exports are compressed and uploaded to archive.org.

Changelog Entity Dumps

A final export type are changelog dumps. Currently these are implemented in python, and anybody can create them. They contain JSON, one-line-per-changelog-entry, with the full list of entity edits and editgroup metadata for the given changelog entry. Changelog history is immutable; this script works by iterating up the (monotonic) changelog counter until it encounters a 404.

Cookbook

Updating an Existing Entity

  1. Fetch (GET) the existing entity
  2. Create (POST) a new editgroup
  3. Update (PUT) the entity, with the current revision number in the prev edit field, and the editgroup id set
  4. Submit (POST? TBD) the editgroup for review

Merging Duplicate Entities

  1. Fetch (GET) both entities
  2. Decide which will be the "primary" entity (the other will redirect to it)
  3. Create (POST) a new editgroup
  4. Update (PUT) the "primary" entity with any updated metadata merged from the other entity (optional), and the editgroup id set
  5. Update (PUT) the "other" entity with the redirect flag set to the primary's identifier, with the current revision id (of the "other" entity) in the prev field, and the editgroup id set
  6. Submit (POST? TBD) the editgroup for review

Lookup Fulltext URLs by DOI

  1. Use release lookup endpoint (GET) with the DOI a query parameter, with expand=files
  2. If a release hit is found, iterate over the linked file entities, and create a ranked list of URLs based on mimetype, URL "rel" type, file size, or host domain.

Batch Insert New Entities (Bootstrapping)

When bootstrapping a blank catalog, we need to insert 10s or 100s of millions of entities as fast as possible.

  1. Create (POST) a new editgroup, with progeny information included
  2. Batch create (POST) entities

Software Contributions

For now, issues and patches can be filed at https://github.com/internetarchive/fatcat.

To start, the back-end (fatcatd, in rust), web interface (fatcat-web, in python), bots, and this guide are all versioned in the same git repository.

See the rust/README and rust/HACKING documents for some common tasks and gotchas when working with the rust backend.

When considering making a non-trivial contribution, it can save review time and duplicated work to post an issue with your intentions and plan. New code and features will need to include unit tests before being merged, though we can help with writing them.

Norms and Policies

These social norms are explicitly expected to evolve and mature if the number of contributors to the project grows. It is important to have some policies as a starting point, but also important not to set these policies in stone until they have been reviewed.

Social Norms and Conduct

Contributors (editors and software developers) are expected to treat each other excellently, to assume good intentions, and to participate constructively.

Metadata Licensing

The Fatcat catalog content license is the Creative Commons Zero ("CC-0") license, which is effectively a public domain grant. This applies to the catalog metadata itself (titles, entity relationships, citation metadata, URLs, hashes, identifiers), as well as "meta-meta-data" provided by editors (edit descriptions, progeny metadata, etc).

The core catalog is designed to contain only factual information: "this work, known by this title and with these third-party identifiers, is believed to be represented by these files and published under such-and-such venue". As a norm, sourcing metadata (for attribution and progeny) is retained for each edit made to the catalog.

A notable exception to this policy are abstracts, for which no copyright claims or license is made. Abstract content is kept separate from core catalog metadata; downstream users need to make their own decision regarding reuse and distribution of this material.

As a social norm, it is expected (and appreciated!) that downstream users of the public API and/or bulk exports provide attribution, and even transitive attribution (acknowledging the original source of metadata contributed to Fatcat). As an academic norm, researchers are encouraged to cite the corpus as a dataset (when this option becomes available). However, neither of these norms are enforced via the copyright mechanism.

As a strong norm, editors should expect full access to the full corpus and edit history, including all of their contributions.

Immutable History

All editors agree to the licensing terms, and understand that their full public history of contributions is made irrevocably public. Edits and contributions may be reverted, but the history (and content) of their edits are retained. Edit history is not removed from the corpus on the request of an editor or when an editor closes their account.

In an emergency situation, such as non-bibliographic content getting encoded in the corpus by bypassing normal filters (eg, base64 encoding hate crime content or exploitative photos, as has happened to some blockchain projects), the ecosystem may decide to collectively, in a coordinated manner, expunge specific records from their history.

Documentation Licensing

This guide ("Fatcat: The Guide") is licensed under the Creative Commons Attribution license.

Software Licensing

The Fatcat software project licensing policy is to adopt strong copyleft licenses for server software (where the majority of software development takes place), and permissive licenses for client library and bot framework software, and CC-0 (public grant) licensing for declarative interface specifications (such as SQL schemas and REST API specifications).

Privacy Policy

It is important to note that this section is currently aspirational: the servers hosting early deployments of fatcat are largely in a default configuration and have not been audited to ensure that these guidelines are being followed.

It is a goal for fatcat to conduct as little surveillance of reader and editor behavior and activities as possible. In practical terms, this means minimizing the overall amount of logging and collection of identifying information. This is in contrast to submitted edit content, which is captured, preserved, and republished as widely as possible.

The general intention is to:

  • not use third-party tracking (via extract browser-side requests or javascript)
  • collect aggregate metrics (overall hit numbers), but not log individual interactions ("this IP visited this page at this time")

Exceptions will likely be made:

  • temporary caching of IP addresses may be necessary to implement rate-limiting and debug traffic spikes
  • exception logging, abuse detection, and other exceptional

Some uncertain areas of privacy include:

  • should third-party authentication identities be linked to editor ids? what about the specific case of ORCID if used for login?
  • what about discussion and comments on edits? should conversations be included in full history dumps? should editors be allowed to update or remove comments?

About This Guide

This guide is generated from markdown text files using the mdBook tool. The source is mirrored on Github at https://github.com/internetarchive/fatcat.

Contributions and corrections are welcome! If you create a (free) account on github you can submit comments and corrections as "Issues", or directly edit the source and submit "Pull Requests" with changes.

This guide is licensed under a Creative Commons Attribution (CC-BY) license, meaning you are free to redistribute, sell, and extend it without special permission, as long as you credit the original authors.