The LDES DCAT-AP Feed specification

Living Standard,

This version:
https://semiceu.github.io/LDES-DCAT-AP-feeds/index.html
Issue Tracking:
GitHub
Editors:
- Pieter Colpaert
- Matthias Palmér

Abstract

Publishing a full data dump repetitevely will delegate change detection -- a fault-prone process -- to data consumers. With DCAT-AP Feeds we propose that DCAT-AP catalog maintainers publish an event source API that can help to replicate the catalog towards a harvester, and always keep it in-sync in the way that is intended by the publisher. Therefore, this spec describes how to publish your DCAT-AP entity changes using the Activity Streams vocabulary and LDES. It also provides a specification for harvesters to provide transparency into their harvesting progress.

1. Publishing changes about DCAT-AP entities

A DCAT-AP Feed is a Linked Data Event Stream with ActivityStream entities Create, Update and Delete in it about the DCAT-AP entities in a catalog. DCAT-AP Feeds uses the [activitystreams-vocabulary] to indicate the type of change. Three type of activities can be described:

These activities MUST provide using the property object an IRI of the DCAT-AP entity (this thus cannot be a blank node), SHOULD come with a published property with an xsd:dateTime datatype, and SHOULD provide a type. The activity MUST be identified using an IRI. The payload of the DCAT-AP entity MUST be provided in the named graph with the activity IRI as the graph.

Note: When a harvester processes the set of quads in the named graph, it can create or replace all quads in a named graph of the DCAT-AP entity, whom’s IRI then possible is a concatenation of the entity IRI with the LDES IRI in order to ensure that multiple representations of the DCAT-AP entities from various sources can be provided.

Fall-backs for when one of these optional properties are not available:

All activities are immutable: once published one cannot alter the same member again. Each activity MUST be a member of an append-only change log or event stream typed EventStream, that MUST be given an IRI. This EventStream is the DCAT-AP Feed that conforms to the Linked Data Event Stream specification. On a DCAT-AP Feed, the timestampPath MUST be set to published, unless the publisher knows what they are doing, or when the timestamp cannot be provided. The versionOfPath MUST be set to object. This configures the property that will be used to point to the entity that is being altered.

A DCAT-AP Feed harvester SHOULD implement the full LDES specification, or re-use an existing LDES Client.

A JSON-LD example:

{
    "@context" : {
      "ldes": "https://w3id.org/ldes#",
      "tree": "https://w3id.org/tree#",
      "as": "https://www.w3.org/ns/activitystreams#",
      "dct": "http://purl.org/dc/terms/",
      "xsd":"http://www.w3.org/2001/XMLSchema#",
      "EventStream" : "ldes:EventStream",
      "shape": { "@id": "tree:shape", "@type": "@id"},
      "title": "dct:title",
      "timestampPath":  { "@id": "ldes:timestampPath", "@type": "@id"},
      "versionOfPath": { "@id": "ldes:versionOfPath", "@type": "@id"},
      "view": "tree:view",
      "member": "tree:member",
      "Create": "as:Create",
      "Delete": "as:Delete",
      "Update": "as:Update",
      "published": { "@id": "as:published", "@type": "xsd:dateTime"},
      "object": { "@id": "as:object", "@type": "@id"},
      "dcat":"http://www.w3.org/ns/dcat#"
    },
    "@id": "#Feed",
    "@type": "EventStream",
    "shape": "https://semiceu.github.io/LDES-DCAT-AP-feeds/shape.ttl#ActivityShape",
    "title": "My DCAT-AP Feed",
    "timestampPath": "published",
    "versionOfPath": "object",
    "view": {
        "@id": "",
        "comment": "This is the event source"
    },
    "member": [
        {
            "@id": "https://example.org/Dataset1#Event1",
            "@type": "Create",
            "object": "https://example.org/Dataset1",
            "published" : "2023-10-01T12:00:00Z",
            "@graph": {
                "@id": "https://example.org/Dataset1",
                "@type": "dcat:Dataset",
                "comment": "Everything in here is the actual data that needs to be upserted"
            }
        },
        {
            "@id": "https://example.org/Dataset1#Event2",
            "@type": "Delete",
            "object": "https://example.org/Dataset1",
            "published" : "2023-10-01T13:00:00Z"
        }
    ]
}

Or the same data in TRiG:

<#Feed> a ldes:EventStream ;
        tree:shape <https://semiceu.github.io/LDES-DCAT-AP-feeds/shape.ttl#ActivityShape> ;
        dct:title "My DCAT-AP Feed" ;
        ldes:timestampPath as:published ;
        tree:view <> ;
        tree:member <https://example.org/Dataset1#Event1>, <https://example.org/Dataset1#Event2> .

# This member is further described in the default graph 
<https://example.org/Dataset1#Event1> a as:Create ;
    as:object <https://example.org/Dataset1> ;
    as:published "2023-10-01T12:00:00Z"^^xsd:dateTime .

<https://example.org/Dataset1#Event1>  {
    <https://example.org/Dataset1> a dcat:Dataset ;
        ## The (updated) representation of this particular dataset
        ## ...
}
<https://example.org/Dataset1#Event2> a as:Delete ;
    as:object <https://example.org/Dataset1> ;
    as:published "2023-10-01T13:00:00Z"^^xsd:dateTime .

A DCAT-AP Feed MUST be published using either application/ld+json or application/trig and it MUST set the Content-Type header accordingly. In this spec, examples are provided for both serializations. Through content negotiation, other formats MAY be provided.

This context information MUST be present:

# Typing it as an EventStream
<#Feed> a ldes:EventStream ;
        # Indicating every member will adhere to the ActivityShape defined by the DCAT-AP-Feeds specification
        tree:shape <https://semiceu.github.io/LDES-DCAT-AP-feeds/shape.ttl#ActivityShape> ;
        # Indicating the timestampPath will be as:published
        ldes:timestampPath as:published ;
        # The current page is a page of this event stream
        tree:view <> ;  # See pagination and retention policies for extra controls we will be able to describe here
        # a link to all members
        tree:member <...> .

The shape.ttl is part of this specification. A DCAT-AP Feeds provider SHOULD test their members before adding them to the feed.

Note: The DCAT-AP Feed shapes graph extend the official DCAT-APv3 shapes, but don’t fork it: we only add the concepts of how to use these shapes in an DCAT-AP Feed.

1.1. Entity types

In DCAT-AP2.2 entity types are divided into main and supportive entity types based on their importance in the application profile. In DCAT-AP Feeds we need to make a slightly different division based on how they appear in the event stream. We will refer to the following three kind of entity types:

  1. Standalone - these entities will appear in the event stream.

  2. Embedded - these entities will always be provided as part of standalone entities.

  3. Referenced - these entities are never described with triples, they are only referred to via their URIs.

Note: LDES feed publishers should not add references to standalone entities before they have been added. Conversely, when removing entities all references should be removed first.

Note: Any dcat:CatalogRecord entities can be provided as part the dcat:Dataset entity. Alternatively, and perhaps more appropriately, the event itself could be seen as an dcat:CatalogRecord with modification date and other useful information.

1.1.1. Standalone entities

The main entity types are identified based on their class:

Note: Only standalone entities that could be part of exactly one other standalone entity -- although not recommended -- can instead be optionally included in the parent standalone entity, allowing the option of having a blank node for that standalone entity that now becomes an embedded entity. This for example allows to embed a dcat:Distribution

Standalone entities MUST be a named node and cannot be a blank node. A harvester SHOULD use this IRI to know where to upsert entities.

Note: Double typing entities is not explicitely disallowed. It thus is possible something is both a dcat:Catalog and a dcat:DataService for example.

1.1.2. Embedded entities

The embedded entity types are identified based on their class:

1.1.3. Referenced entities

The referenced entity types are identified based on the properties that point to them:

1.2. Retention policies

Without further explanation, a server publishing a Linked Data Event Stream (LDES), is considered to keep the full history of all elements. In DCAT-AP Feeds, harvesters are generally not interested in the full history. Therefore we recommend only keeping the latest activity (the create, updates, and remove entities) about an entity in the feed, yet transparently indicating this retention policy.

Note: It may also be possible that the data catalog does not keep track of the deleted entities. In this case, it will be impossible to provide the delete activities. While it is not recommended, we will propose an implicit remove retention policy in the LDES specification. This is currently not supported though.

Note: Also having to keep delete activities indefinetily will be difficult after a long period of time. Therefore a third retention policy will be able to be put in place in order to say that deletions are not kept in the feed after a certain period of time. This is also not supported at this time, but proposed to the LDES specification.

1.2.1. LatestVersionSubset with deletions

By adding a latest version subset retention policy, we will allow for only the last activities of an object to be added.

<> ldes:retentionPolicy [
        a ldes:LatestVersionSubset ;
        ldes:amount 1    
    ] .

Or in JSON-LD:

{
    "@context" : {
      "ldes": "https://w3id.org/ldes#",
      "tree": "https://w3id.org/tree#",
      "as": "https://www.w3.org/ns/activitystreams#",
      "dct": "http://purl.org/dc/terms/",
      "xsd":"http://www.w3.org/2001/XMLSchema#",
      "EventStream" : "ldes:EventStream",
      "shape": { "@id": "tree:shape", "@type": "@id"},
      "title": "dct:title",
      "timestampPath": "ldes:timestampPath",
      "versionOfPath": "ldes:versionOfPath",
      "view": "tree:view",
      "member": "tree:member",
      "Create": "as:Create",
      "Delete": "as:Delete",
      "Update": "as:Update",
      "published": "as:published",
      "object": "as:object"
    },
    "@id": "#Feed",
    "@type": "EventStream",
    "timestampPath": "published",
    "versionOfPath": "object",
    "shape": "https://semiceu.github.io/LDES-DCAT-AP-feeds/shape.ttl#ActivityShape",
    "view": {
      "@id": "",
      "ldes:retentionPolicy": {
        "@type": "ldes:LatestVersionSubset",
        "ldes:amount": "1"
      }
}

1.3. Pagination

A DCAT-AP Feed MAY have multiple views. The main view the DCAT-AP Feed MUST publish is a DCAT-AP Feed source. A DCAT-AP Feed source MAY follow any TREE relation structure it desires. A DCAT-AP Feed source SHOULD however use a search tree based on the as:published timestamp. Depending on the amount of updates the DCAT-AP Feed is expected to have, one can play with the granularity. For example, a search tree could be create where on the first level you can select the year, the second you can select the month, then day, and then hour.

A link to a lower lever can be achieved using two relations to the same node, one specifying the lower bound, and another the upper, as follows:

@prefix : <https://data.example.org/feed> .
@prefix ldes: <https://w3id.org/ldes#>.
@prefix tree: <https://w3id.org/tree#>.
@prefix as:  <https://www.w3.org/ns/activitystreams#>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dcat: <http://www.w3.org/ns/dcat#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.


:#Stream a ldes:EventStream ;
    tree:member <Dataset1#Event1>, <DataService1#Event1> ;
    ldes:timestampPath as:published ;
    ldes:versionOfPath as:object ;
    tree:view : .

: tree:viewDescription [
        ldes:retentionPolicy [
            a ldes:LatestVersionSubset ;
            ldes:amount 1
        ]
    ] ;
    # Recommended: multiple pages in a chronological search-tree fragmentation
    tree:relation [
        a tree:GreaterThanOrEqualToRelation ;
        tree:path as:published ;
        tree:value "2020-01-01T00:00:00Z"^^xsd:dateTime ;
        tree:node :2020
    ] ,
    [
        a tree:LessThanRelation ;
        tree:path as:published ;
        tree:value  "2021-01-01T00:00:00Z"^^xsd:dateTime ;
        tree:node :2020
    ]
    #... More relations
    .

A DCAT-AP Feed view SHOULD for every page provide an accurate Cache-control header. In case the page (such as the root and the pages on the right of the search tree) can still update, an etag header SHOULD be provided and conditional caching SHOULD be supported. For pages that will not change any longer, a Cache-Control: public, max-age=604800, immutable header SHOULD be set.

2. Publishing a harverster’s event log

A DCAT-AP feeds harvester consumes one or more DCAT-AP Feeds. In order to do so, it SHOULD use an LDES compliant client. For the emitted objects by such an LDES client, the harvester can count on the fact that the official SHACL shape validates. The payload of an update will be contained within the named graph that has the same IRI as the member.

A harvester SHOULD publish the status of their logging on a page.

Note: Currently there is no further text on what this status log should look like or how it should be described. We are waiting for consensus on this in the general LDES specification that should be a topic in the SEMIC LDES standardization activity starting September 2024.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

References

Normative References

[ACTIVITYPUB]
Christopher Webber; Jessica Tallon. ActivityPub. URL: https://w3c.github.io/activitypub/
[ACTIVITYSTREAMS-VOCABULARY]
James Snell; Evan Prodromou. Activity Vocabulary. URL: https://w3c.github.io/activitystreams/vocabulary/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119