Conceptual framework
This section delves into the conceptual framework of semantic data specifications. Understanding this framework allows stakeholders to use the semantic data specifications effectively and align expectations and practices ensuring consistent and effective communication.
The structure of the section is methodically organised into several subsections, each focusing on a different element of semantic data specifications. It begins with a broad overview of the specifications, and establishes what the artefacts are. Then progressively it narrows down to specific artefacts types, namely data models and documentation. Further subsections explore data models interaction across different layers of data interoperability, and semantic data specification types. This sequential approach helps readers build a comprehensive understanding from general concepts to specific explanations.
Semantic data specifications
Semantic data specifications are composite standards designed to facilitate data exchange and interoperability among diverse systems, characterised by their descriptive and prescriptive nature. These specifications are realised through a suite of artefacts that are harmoniously interrelated and address different interoperability scopes and use cases—ranging from semantic to technical concerns. The artefacts are fashioned to be both machine-readable and human-understandable, ensuring consistent interpretation and utilisation.
Figure 1 depicts a conceptualisation of how various components that make up a complete semantic data specification interconnect. At the top of the diagram is the "Semantic data specification" indicating its overarching role. It serves a "Purpose/Goal" which frames various specific "Concern/Need". The semantic data specification comprises various "Artefacts" denoting the different elements that make up the specification.
Figure 1
Beneath this, the framework branches into two main types of artefacts: "Data models” and "Documentation". The most relevant data models are "Vocabulary", "Ontology", "Data shape" and "UML Class model". Each data model is expressed in a "Modelling language" appropriate for the concern or the need addressed. The next section introduces the relevant artefact types.
Artefacts
Integral to semantic data specifications is the intrinsic consistency and coherence among the artefacts. Each represents a facet of the same domain knowledge, but is tailored to address specific concerns—such as human understandability, semantic underpinning, formal definition, and data serialisation (addressed in the next section). This alignment ensures that each artefact, while distinct in function, contributes to a unified view of the domain, making the entire specification accessible and actionable. Such consistency is pivotal in maintaining semantic integrity, leading to robust technical interoperability and seamless information exchange.
Each artefact, while unique in its form and function, represents different facets of the same domain. They are harmonised, yet distinct, with each created to address specific concerns, such as:
-
Semantic Underpinning: The semantic data specification needs to encapsulate formally the domain knowledge, capturing the essence of its concepts and the possible relationships between them. Ontologies play a key role here, offering a structured and logical framework that lays out the domain knowledge in a way that is both comprehensive and actionable.
-
Formal Definition: Using formal languages such as OWL or RDFS for semantic representation enables precise interpretation and inference over the ontologies and instance data. Moreover, data shapes facilitate data structuring on top of the ontology, defining precise constraints and conditions under which the data can be instantiated. This formalisation ensures that the data adheres to the standard, facilitating automated validation and processing.
-
Human Understandability: This aspect ensures that individuals, regardless of their technical expertise, can comprehend and engage with the semantic data specifications. The reference documentation along with visual representation of UML class diagrams brings clarity and guidance for human users to grasp the meaning of the semantic data specification and its intended use.
-
Visual Representation: The semantic data specification is much easier to understand once it is presented in a visual format. Typically class diagrams are the most suitable to encapsulate the concepts and the relations between boosting significantly comprehension.
-
Data Serialisation: The technical artefacts of the specification, such as information exchange data models for various serialisation formats (e.g., JSON-LD, XML), ensure that the data can be correctly serialised, deserialized, and exchanged across systems and platforms. They cater for the technical requirements of data transport (and storage).
The coherence among these artefacts ensures that despite their different purposes and audiences, they all align in their representation of the domain knowledge. This alignment guarantees that whether a stakeholder is interpreting the model conceptually, engaging with it through documentation, or implementing it technically, they are presented with a unified and consistent view of the semantic data specification. This cohesive approach is pivotal for maintaining semantic integrity across various applications and systems.
Data models
The key data model artefacts of a specification include:
-
Vocabulary: An established list of preferred terms that signify concepts or relationships within a domain of discourse. All terms must have an unambiguous and non-redundant definition. Optionally it may include synonyms, notes and their translation to multiple languages. It is represented informally, for example, as a spreadsheet or SKOS thesaurus.
-
Ontology: An ontology is a formal, machine-readable specification of a conceptual model [Harpring2016]. It encompasses a representation, formal naming using URIs, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains of discourse, effectively enabling a shared and common understanding of data [wiki-onto]. It is usually expressed in OWL and RDFS.
-
Data Shape: Constraints or patterns that describe how instantiations of an ontology should be structured. Data shape artefacts can be used not only to ensure that RDF data adheres to predefined structure and validation rules, but also as a blueprint for information exchange data models, preserving semantics and ensuring consistency in data exchange. It is usually expressed in SHACL.
-
Controlled list (of values): A value vocabulary used to express concepts or values in instance data. It defines resources (such as instances of topics, languages, countries, or authors) that are used as values for elements in metadata records. Typically, the value vocabularies serve as reference data and constitute "building blocks" with which metadata records can be populated.
-
UML class model: A static structure UML model and associated diagrams that describes the structure of data by showing the classes, their attributes, and the relationships among objects. It may include documentation, description and various annotations. Such data model shall be conformant to the SEMIC style guide in order to fit for purpose of semantic data specifications.
-
Human-readable documentation: This artefact elucidates the specifications for stakeholders of varying business and technical backgrounds, detailing the structure, intent, and practical application of the semantic data specification.
Beyond these foundational elements, semantic data specifications may also incorporate artefacts designed for the technical interoperability layer called information exchange data models. They define and describe in a technology-specific manner the structure and content of information that is exchanged between organisations in a specific information exchange context. They detail the syntax, structure, data types, and constraints necessary for effective data communication between systems. These artefacts are necessary to realise the technical interoperability. If the chosen technology for exchange is of semantic nature (e.g. RDF), then a perfect syntax-semantics conflation is readily available through ontologies and data shapes. Otherwise, if a more traditional technology is selected due to popularity or legacy reasons, such as XML/XSD or JSON, then a mapping that acts as a syntax-semantics interface needs to be established, binding the physical model and the semantic specification. These can include various information exchange data models like:
-
JSON-LD context definitions: Facilitating the mapping of JSON to RDF linked data representation.
-
XML Schemas (XSD): Defining the structure and validating the constraints of XML documents.
-
API (Component) Specifications (REST, WSDL or GraphQL): Outlining the request, response and parameters for web-based data access and manipulation [swagger]. Such components are generally reusable blocks to facilitate reusability and maintenance of APIs.
Artefacts across interoperability layers
In the framework depicted [below], the artefacts are strategically organised within the semantic and technical interoperability layers, with each layer focusing on different but complementary aspects of data interoperability. As it can be seen, some artefacts belong to the semantic layer, and others to the technical layer, while the data shapes are present in both, according to their multipurpose nature.
The Semantic Layer encapsulates artefacts associated with the conceptual understanding of data. It is focused on defining the vocabulary and ontology that provide the foundational elements for data interoperability. These artefacts ensure that the meaning of data is clearly defined and shared across different systems, establishing the semantic rules that govern data exchange.
The Technical Layer is concerned with the practical aspects of data handling, such as data representation formats, communication protocols, and interface specifications. Artefacts in this layer address the technical requirements necessary for data to be physically exchanged and processed by information systems.
Documentation and UML Class models are depicted as orthogonal to these layers, as they facilitate human understanding and transcend the semantic-technical divide. These artefacts provide clarity and guidance, helping stakeholders visualise and comprehend the data structures and relationships without being confined to the constraints of either layer.
Data specification types
We can discern three interconnected layers each representing a different level of abstraction in semantic data specifications. The arrangement signifies the gradation from the abstract to the specific.
The Upper Layer accommodates the most abstract form of semantic data specifications. These specifications are context-free, meaning that they are not tied to any particular domain or application, and can be universally applied across various fields. These semantic data specifications, provide the broadest concepts that can be reused in numerous contexts. Here we generally find upper level ontologies (defining highly abstract foundational concepts such as “object”, “property”, “event” etc.), but also the core semantic data specifications, which, although more specific, can also be applied across multiple domains. The main objective of the Core Vocabularies is to provide terms to be reused in the broadest possible context [sem-sg-wsds].
Upper ontologies and core semantic data specifications serve as a scaffolding for domain ontologies, offering a hierarchy where the more general terms of the upper ontology act as superclasses (in some cases even as metaclasses) to the more specific classes of domain ontologies. This arrangement supports the structuring and integration of knowledge by providing common reference points that enhance understanding and data processing across different systems.
Notable examples:
-
Upper ontologies: DOLCE, Gist, BFO, etc.
-
Core Vocabularies: Dublin Core Terms, Data Catalog Vocabulary (DCAT), The Organization Ontology (ORG), European Legislation Identifier (ELI)
The Domain Layer sits at the intersection of the upper and application layers. It contains specifications that are more specific than the upper layer, but not as narrowly focused as the application layer. The semantic data specifications in this layer incorporate concepts relevant to a domain or sector (e.g. the justice domain, the public procurement domain, the healthcare domain) and represent the most specific knowledge from the perspective of that domain.
The domain layer is visually overlapped by both the upper and application layers, symbolising that some domain-specific semantic data specifications can inherit traits from, or lend characteristics to, both the more abstract upper layer and the more concrete application layer.
Notable examples:
-
DCAT-AP
-
eProcurement Ontology
The Application Layer is the most concrete and context-specific, containing semantic data specifications tailored for particular applications or families of applications. Application Profiles are detailed in constraints and data shapes, addressing explicit needs and constraints of a specific system or use case, and generally provide precise technical artefacts that can be used in data exchange.
Notable examples:
-
GeoDCAT-AP
-
BRegDCAT-AP
-
Stat-DCAT-AP
Terminological Clarification: The level of abstraction pertaining to a semantic data specification—be it core, domain, or application—can be applied as an adjective to describe its constituent artefacts. Thus, for a "core semantic data specification" the included components would be referred to as "core vocabulary", "core ontology", "core data shape" and "core exchange data model" and so on. Similarly, for a "domain semantic data specification," the elements would be denoted as "domain vocabulary", "domain ontology", "domain data shape" and "domain exchange data model", respectively.
Documentation
In the semantic data specification framework depicted [below], the documentation artefacts are organised into three distinct types, as illustrated in Figure 2, each catering to different aspects of user engagement with the data model. For effective documentation practices, we recommend principles laid out in the Diátaxis framework, which is a systematic approach to understanding the needs of documentation users [dtx]. It identifies four distinct needs (learning, understanding, consulting reference, achieving goals), and four corresponding forms of documentation - tutorials, handbooks, reference documentation and textbooks. It places them in a systematic relationship, and proposes that documentation should itself be organised around the structures of those needs.
In addition, we mention the Diagrams, which are usually embedded into documents, to underline the importance of visual depiction of models and to recognise them as distinct artefacts from the models (e.g. UML class diagrams). In the context of semantic data specifications we find relevant the following documentation kinds.
Figure 2
The Handbook (or usage manual) is a how-to guide and serves as an introductory reading to users new to the semantic data specification. It can also take the form of a tutorial to achieve predefined goals. It typically comprises use-case descriptions, examples and practical, step-by-step instructions designed to help users acquire the necessary skills to effectively use the semantic data specification.
Examples: This document
The Textbook (or explanatory manual) is an explanatory type of documentation and focuses on deepening the users’ understanding of the underlying concepts and principles incorporated into the semantic data specification. It aims to inform cognition, enhancing the user’s theoretical knowledge and conceptual insight, which is critical for those looking to gain a more profound grasp of the specification’s rationale, decisions, strengths and limitations.
Examples: SEMIC Style Guide [sem-sg]
The Reference document is a technical type of documentation and provides concise, detailed information about various elements of the semantic data specification. It serves users who are already familiar with the theoretical framework and need to apply their knowledge to specific tasks. This artefact is a go-to resource for factual and objective data about the semantic specifications, such as semantics, syntax, entities, properties, relationships, and constraints within the data model.
These documentation artefacts are designed to collectively support the user’s journey from novice to expert within the semantic data specification domain. The Usage Manual aids in initial skill acquisition, the Explanatory Textbook supports deeper learning and understanding, and the Reference Documentation acts as a reliable resource for informed application and use. Together, they ensure that users at different stages of learning and practice have access to the appropriate materials to meet their needs.