Web Capture Entity Reference
Warning: This schema is not yet stable.
cdx(array of objects): each entry represents a distinct web resource (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
surt(string, required): sortable URL format
timestamp(string, datetime, required): ISO format, UTC timezone, with
Zprefix required, with second (or finer) precision. Eg, "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should be converted naively.
url(string, required): full URL
mimetype(string): content type of the resource
status_code(integer, signed): HTTP status code
sha1(string, required): SHA-1 hash in lower-case hex
sha256(string): SHA-256 hash in lower-case hex
archive_urls: An array of "typed" URLs where this snapshot can be found. Can be wayback/memento instances, or direct links to a WARC file containing all the capture resources. Often will only be a single archive. Order is not meaningful, and may not be preserved.
url(string, required): Eg: "https://example.edu/~frau/prcding.pdf".
rel(string, required): Eg: "wayback" or "warc"
original_url(string): base URL of the resource. May reference a specific CDX entry, or maybe in normalized form.
timestamp(string, datetime): same format as CDX line timestamp (UTC, etc). Corresponds to the overall capture timestamp. Can be the earliest of CDX timestamps if that makes sense
content_scope(string): for situations where the webcapture does not simply contain the full representation of a work (eg, HTML fulltext, for an
article-journalrelease), describes what that scope of coverage is. Eg,
landing-pageit doesn't contain the full content. Landing pages are out-of-scope for fatcat, but if they were accidentally imported, should mark them as such so they aren't re-imported. Uses same vocabulary as File entity.
release_ids(array of string identifiers): references to