Data shape conventions

Data shape language choice

Title: Data shape language choice

Identifier DSC-R1

Statement:

The data shapes shall be expressed in SHACL.

Description:

There are various languages allowing for data shape expression and thus define validation rules for RDF data. Among the executable ones, the most well-known are SHACL [shacl] and ShEx [shex]. We are aware that a strong community of practice exists for ShEx. However, as SHACL is a W3C standard, we strongly recommend using it.

Loose versus rigid constraints

Title: Loose versus rigid constraints

Identifier DSC-R2

Statement:

In a Core Vocabulary, then the data shape constraints shall be as loose as possible, i.e., permissive, while in an Application Profile, the data shape constraints shall be as rigid as necessary, i.e. restrictive.

Description:

The Core Vocabularies are intended for a broad reuse and are concerned with establishing the concepts. This is addressed by the ontology artefact. Optionally it may be accompanied by a minimal set of data shape constraints expressing general domain limitations. For example, it may be already defined in the Core Vocabulary that a class cannot be instantiated without certain mandatory properties or that a declared property shall be used only with values from a pre-established controlled vocabulary. If a data shape artefact is provided, then any Application Profile reusing the Core Vocabulary shall not only import the ontology, but the data shapes as well.

The Application Profiles are usually intended for narrower use cases and mostly are concerned with setting restrictions on the way reused vocabulary ought to be used. It is a good practice to make such restrictions as restrictive as necessary but still provide space for variation as it typically occurs in practice. This is even more important when an Application Profile is conceived as reusable in other Application Profiles.

Examples:

Consider DCAT-AP, for example. It is an Application Profile on top of the DCAT vocabulary, but other more specific Application Profiles are still created on top of it (e.g., DCAT-AP CH, DCAT-AP DE, DCAT-AP NO, etc.)

Open and Closed world assumptions

Title: Open and Closed world assumptions

Identifier DSC-R3

Statement:

The data instantiating a core model may be fragmented across information systems. For representation purposes, open-world assumptions shall be adopted, and for validation purposes, a closed-world assumption shall be adopted.

Description:

Operations on the data, such as automated reasoning and validation, are impacted by what is assumed about its completeness. If it is considered complete, then we say that we operate under the closed-world assumption; otherwise, in the cases of continually evolving and distributed data (that is more in line with reality), the data is considered possibly incomplete, and we say that operating is under the open-world assumption.

Under the closed-world assumption, it is presumed that a true statement is also known to be true. Therefore, conversely, what is not currently known to be true, is false [reiter81]. The opposite of the closed-world assumption is the open-world assumption stating that lack of knowledge does not imply falsity. The truth value of a statement may be true despite whether or not it is known to be true.

This means that when validating, the data must be considered complete to assess whether it fulfils the imposed constraints or recommended shapes. Therefore, everything that is not known to be true must be considered false.

This may have an impact, especially for larger vocabularies (such as the eProcurement ontology), on how the data shapes are organised. As data shapes may be used to suggest how the data may be fragmented and how it shall not.

Shape definitions

Title: Shape definitions

Identifier DSC-R4

Statement:

Each ontology class shall be mirrored by a sh:NodeShape. The property constraints may be embedded (contextualised) within node shapes or defined as freestanding.

Description:

When generating SHACL shapes, it is a good practice to create a node shape for each class in the Conceptual Model and, within it, a property shape for each property of that class. If the property shapes are defined within the node shapes (rather than freestanding), then we mirror the structure depicted in the Conceptual Model. Doing so is intuitive for the readers, the resulting constraint definitions are organically integrated, and the maintenance costs are decreased.

However, sometimes it may be useful to define the property shapes as freestanding, therefore making them reusable independent of the node shape context. Freestanding property shapes are universal constraints on the property. They apply to all usages, in all possible data exchanges. Therefore, their use should be limited to those cases that support the broadest reuse context, and do not encode a specific application usage context. As such, freestanding property shapes play more a role in the validation of the correct usage of the property in an application context, rather than in the correct data exchange. The above considerations do not exclude that property shapes used within a node shape could be named with a unique URI, through the reuse of which a coherent network of constraints can be facilitated.

The shapes shall be created in a separate namespace in order to allow dereferencing to either OWL 2 or SHACL definitions.

In some communities, the URIs used in the OWL 2 specifications are reused in SHACL. The class definitions are extended as node shapes. In SHACL semantics, this is interpreted as an equivalent to specifying a sh:targetNode class. This approach allows the use of a single reference in both the ontology and data shape definition yet specifies separate concerns in separate artefacts. The advantage of the approach is increased consistency and readability while maintenance costs decrease. Yet, this approach does not allow for dereferencing separately OWL 2 or SHACL representations. Hence, it is recommended to avoid such practice.

Data shape severity

Title: Data shape severity

Identifier DSC-R5

Statement:

It is recommended to use constraint severity for distinguishing critical violations from non-critical recommendation warnings and from nice-to-have information.

Description:

Data shapes can be assigned a level of severity. The specific values of severity have no impact on the validation but may be used by user interface tools to categorise validation results. The severity levels are:

Info - A non-critical constraint violation indicating an informative message.
Warning - A non-critical constraint violation indicating a warning.
Violation - A constraint violation.