The SEMIC Style Guide for Semantic Engineers

Table of Contents

1. Core Vocabularies Handbook

1. Core Vocabularies Handbook

This Handbook aims to explain the role of Core Vocabularies in enabling semantic interoperability at the EU level and to practically guide public administrations in using Core Vocabularies to achieve this goal. This section introduces what interoperability is, what makes it semantic, and how Core Vocabularies contribute to it.

1.1. SEMIC Core Vocabularies

This section contains a brief overview of the Core Vocabularies, indicating how they were developed and how they are maintained.

Since 2011, the European Commission facilitates international working groups to forge consensus and maintain the SEMIC Core Vocabularies. A short description of these vocabularies is included in the Table below. The latest release of the Core Vocabularies can be retrieved via the SEMIC Support Center [semic] or directly from the GitHub repository [semic-gh].

Vocabulary	Description
	The Core Person Vocabulary (CPV) is a simplified, reusable and extensible data model that captures the fundamental characteristics of a person; e.g., the name, gender, date of birth, and location. This specification enables interoperability among registers and any other ICT based solutions exchanging and processing person-related information.
	The Core Business Vocabulary (CBV) is a simplified, reusable and extensible data model that captures the fundamental characteristics of a legal entity; e.g., the legal name, activity, and address. The Core Business Vocabulary includes a minimal number of classes and properties modelled to capture the typical details recorded by business registers. It facilitates information exchange between business registers despite differences in what they record and publish.
	The Core Location Vocabulary (CLV) is a simplified, reusable and extensible data model that captures the fundamental characteristics of a location, represented as an address, a geographic name, or a geometry. The Location Core Vocabulary provides a minimum set of classes and properties for describing a location represented as an address, a geographic name, or a geometry. This specification enables interoperability among land registers and any other ICT based solution exchanging and processing location information.
	The Core Criterion and Core Evidence Vocabulary (CCCEV) supports the exchange of information between organisations that define criteria and organisations that respond to these criteria by means of evidence. The Core Evidence and Core Criterion Vocabulary (CCCEV) addresses specific needs of businesses, public administrations and citizens across the European Union, including the following use cases: Facilitate development of interoperable information systems: the use of common vocabularies to describe criteria and evidence facilitates the development of information systems and improves their interoperability. Create a repository of reusable criteria in machine-readable formats: the use of common vocabularies promotes the creation of a repository of criteria and evidence information. Automate the assessment of criteria: the Core Vocabulary describing the criterion responses allows systems to easily compare the information collected from different parties and enables automatic assessment of the responses for a specific criterion. Automate scoring of responses: weighting criteria, the assessment can be followed by an automated scoring of the responses provided by different parties. Promote cross-border participation in public procurement: the use of the Core Vocabulary for electronic criterion and evidence allows for removing language barriers thereby improving the cross border exchange of information, and the cross-border participation in pan-European selection processes. Calculating statistics: standardising data for criterion, criterion responses and evidence allows calculating statistical information on the most commonly used criteria for a given process, the most relevant evidence, etc. Create a registry of mappings of criteria: using the Core Vocabulary, it is possible to create a registry of mappings to allow cross-checking of the criteria with the evidence of each particular Member State.
	The Core Public Organisation Vocabulary provides a common data model for describing public organisations in the European Union. The Core Public Organisation Vocabulary (CPOV) addresses specific needs of businesses, public administrations and citizens across the European Union, including the following use cases: Facilitate information sharing: the CPOV enables G2G (Government-to-Government), G2B (Government-to-Business) and G2C (Government-to-Citizen) information sharing. Facilitate the development of common information systems: the use of existing data models for the development of common information systems facilitates the development of those systems and improves their interoperability. Linked Open Organograms: the Core Public Organisation Vocabulary has the potential to link organograms to each other and to high-value data sets. Cross border information exchange: the CPOV allows the management of a cross-border repository of public services and organisations. Find a PO by its function: the public organisation portfolio facilitates discovery of which public authorities and departments are responsible for given areas of the public task. Increase efficiency: the CPOV helps to identify where responsibilities and functions are duplicated or overlap.
	The Core Public Event Vocabulary (CPEV) is a simplified, reusable and extensible data model that captures the fundamental characteristics of a public event, e.g. the title, the date, the location, the organiser etc. The Core Public Event Vocabulary aspires to become a common data model for describing public events (conferences, summits, etc.) in the European Union. This specification enables interoperability among registers and any other ICT based solutions exchanging and processing information related to public events.

Vocabulary

Description

The Core Person Vocabulary (CPV) is a simplified, reusable and extensible data model that captures the fundamental characteristics of a person; e.g., the name, gender, date of birth, and location.

This specification enables interoperability among registers and any other ICT based solutions exchanging and processing person-related information.

The Core Business Vocabulary (CBV) is a simplified, reusable and extensible data model that captures the fundamental characteristics of a legal entity; e.g., the legal name, activity, and address.

The Core Business Vocabulary includes a minimal number of classes and properties modelled to capture the typical details recorded by business registers. It facilitates information exchange between business registers despite differences in what they record and publish.

The Core Location Vocabulary (CLV) is a simplified, reusable and extensible data model that captures the fundamental characteristics of a location, represented as an address, a geographic name, or a geometry.

The Location Core Vocabulary provides a minimum set of classes and properties for describing a location represented as an address, a geographic name, or a geometry. This specification enables interoperability among land registers and any other ICT based solution exchanging and processing location information.

MPu4yRXj56z76nAlmbZoYy9g0scSPRPsMJRiIZd90BgS2S9hHmyqd3iUDonDlOrttzXZkkn3xu79f8HMpem9IgdXjgAAAAASUVORK5CYII=

The Core Criterion and Core Evidence Vocabulary (CCCEV) supports the exchange of information between organisations that define criteria and organisations that respond to these criteria by means of evidence.

The Core Evidence and Core Criterion Vocabulary (CCCEV) addresses specific needs of businesses, public administrations and citizens across the European Union, including the following use cases:

Facilitate development of interoperable information systems: the use of common vocabularies to describe criteria and evidence facilitates the development of information systems and improves their interoperability.
Create a repository of reusable criteria in machine-readable formats: the use of common vocabularies promotes the creation of a repository of criteria and evidence information.
Automate the assessment of criteria: the Core Vocabulary describing the criterion responses allows systems to easily compare the information collected from different parties and enables automatic assessment of the responses for a specific criterion.
Automate scoring of responses: weighting criteria, the assessment can be followed by an automated scoring of the responses provided by different parties.
Promote cross-border participation in public procurement: the use of the Core Vocabulary for electronic criterion and evidence allows for removing language barriers thereby improving the cross border exchange of information, and the cross-border participation in pan-European selection processes.
Calculating statistics: standardising data for criterion, criterion responses and evidence allows calculating statistical information on the most commonly used criteria for a given process, the most relevant evidence, etc.
Create a registry of mappings of criteria: using the Core Vocabulary, it is possible to create a registry of mappings to allow cross-checking of the criteria with the evidence of each particular Member State.

The Core Public Organisation Vocabulary provides a common data model for describing public organisations in the European Union.

The Core Public Organisation Vocabulary (CPOV) addresses specific needs of businesses, public administrations and citizens across the European Union, including the following use cases:

Facilitate information sharing: the CPOV enables G2G (Government-to-Government), G2B (Government-to-Business) and G2C (Government-to-Citizen) information sharing.
Facilitate the development of common information systems: the use of existing data models for the development of common information systems facilitates the development of those systems and improves their interoperability.
Linked Open Organograms: the Core Public Organisation Vocabulary has the potential to link organograms to each other and to high-value data sets.
Cross border information exchange: the CPOV allows the management of a cross-border repository of public services and organisations.
Find a PO by its function: the public organisation portfolio facilitates discovery of which public authorities and departments are responsible for given areas of the public task.
Increase efficiency: the CPOV helps to identify where responsibilities and functions are duplicated or overlap.

The Core Public Event Vocabulary (CPEV) is a simplified, reusable and extensible data model that captures the fundamental characteristics of a public event, e.g. the title, the date, the location, the organiser etc.

The Core Public Event Vocabulary aspires to become a common data model for describing public events (conferences, summits, etc.) in the European Union. This specification enables interoperability among registers and any other ICT based solutions exchanging and processing information related to public events.

1.1.1. Representation formats

The Core Vocabularies are semantic data specifications that are disseminated as the following artefacts:

lightweight ontology [sem-sg-wio] for vocabulary definition expressed in OWL [owl2];
loose data shape specification [sem-sg-wds] expressed in SHACL [shacl];
human-readable reference documentation [sem-sg-wdsd] in HTML (based on ReSpec [respec]);
JSON-LD [w3c] context definitions [json-ld];
conceptual model specification [sem-sg-wcm] expressed in UML [uml].

1.1.2. Licensing conditions

The Core Vocabularies are published under the CC-BY 4.0 licence [cc-by].

1.1.3. Core Vocabularies lifecycle

The Core Vocabularies have been developed following the ‘Process and methodology for developing Core Vocabularies’ [cv-met], which has an open change and release management process , supported by SEMIC, that ensures continuous improvement and relevance to evolving user needs.

This process begins with the identification of needs from stakeholders or issues raised in existing implementations. The Working Group members, SEMIC team or community of users propose changes that are thoroughly assessed for their impact and feasibility. Once a change is deemed necessary, it undergoes a drafting phase where the technical details are fleshed out, followed by public consultations to gather wider input and ensure transparency.

Following consultations, the changes are refined and prepared for implementation. This stage may involve further iteration based on feedback or additional insights from ongoing discussions. The finalised changes are then formally approved and documented, ensuring they are well-understood and agreed upon by all relevant parties.

The release management of Core Vocabularies follows a structured timeline that includes pre-announced releases and public consultation periods to allow users to prepare for changes. Each release includes detailed documentation to support implementation, ensuring users can integrate new versions with minimal disruption. This process not only maintains the quality and relevance of the Core Vocabularies, but also supports a dynamic and responsive framework for semantic interoperability within digital public services.

1.1.4. Claiming conformance

Claiming conformance to Core Vocabularies is an integral part of validating (a) how well a new or a mapped data model or semantic data specification aligns with the principles and practices established in the SEMIC Style Guide [sem-sg] and (b) to what degree the Core Vocabularies are reused (fully or partially) [sem-sg-reuse]. The conformance assessment is voluntary, and shall be published as a self-conformance statement. This statement must assert which requirements are met by the data model or semantic specification.

The conformance statement highlights various levels of adherence, ranging from basic implementation to more complex semantic representations. At the basic level, conformance might simply involve ensuring that data usage is consistent with the terms (and structure, but no formal semantics) defined by the Core Vocabularies. Moving to a more advanced level of conformance, data may be easily transformed into formats like RDF or JSON-LD, which are conducive to richer semantic processing and integration. This level of conformance signifies a deeper integration of the Core Vocabularies, facilitating a more robust semantic interoperability across systems. Ultimately, the highest level of conformance is achieved when the data is represented in RDF and fully leverages the semantic capabilities of the Core Vocabularies. This includes using a range of semantic technologies, adhering to the SEMIC Style Guide, fully reusing the Core Vocabularies, and respecting the associated data shapes.

1.2. Conceptual framework

This section delves into the conceptual framework of semantic data specifications. Understanding this framework allows stakeholders to use the semantic data specifications effectively and align expectations and practices ensuring consistent and effective communication.

The structure of the section is methodically organised into several subsections, each focusing on a different element of semantic data specifications. It begins with a broad overview of the specifications, and establishes what the artefacts are. Then progressively it narrows down to specific artefacts types, namely data models and documentation. Further subsections explore data models interaction across different layers of data interoperability, and semantic data specification types. This sequential approach helps readers build a comprehensive understanding from general concepts to specific explanations.

1.2.1. Semantic data specifications

Semantic data specifications are composite standards designed to facilitate data exchange and interoperability among diverse systems, characterised by their descriptive and prescriptive nature. These specifications are realised through a suite of artefacts that are harmoniously interrelated and address different interoperability scopes and use cases—ranging from semantic to technical concerns. The artefacts are fashioned to be both machine-readable and human-understandable, ensuring consistent interpretation and utilisation.

Figure 1 depicts a conceptualisation of how various components that make up a complete semantic data specification interconnect. At the top of the diagram is the "Semantic data specification" indicating its overarching role. It serves a "Purpose/Goal" which frames various specific "Concern/Need". The semantic data specification comprises various "Artefacts" denoting the different elements that make up the specification.

Figure 1

Beneath this, the framework branches into two main types of artefacts: "Data models” and "Documentation". The most relevant data models are "Vocabulary", "Ontology", "Data shape" and "UML Class model". Each data model is expressed in a "Modelling language" appropriate for the concern or the need addressed. The next section introduces the relevant artefact types.

1.2.2. Artefacts

Integral to semantic data specifications is the intrinsic consistency and coherence among the artefacts. Each represents a facet of the same domain knowledge, but is tailored to address specific concerns—such as human understandability, semantic underpinning, formal definition, and data serialisation (addressed in the next section). This alignment ensures that each artefact, while distinct in function, contributes to a unified view of the domain, making the entire specification accessible and actionable. Such consistency is pivotal in maintaining semantic integrity, leading to robust technical interoperability and seamless information exchange.

Each artefact, while unique in its form and function, represents different facets of the same domain. They are harmonised, yet distinct, with each created to address specific concerns, such as:

Semantic Underpinning: The semantic data specification needs to encapsulate formally the domain knowledge, capturing the essence of its concepts and the possible relationships between them. Ontologies play a key role here, offering a structured and logical framework that lays out the domain knowledge in a way that is both comprehensive and actionable.
Formal Definition: Using formal languages such as OWL or RDFS for semantic representation enables precise interpretation and inference over the ontologies and instance data. Moreover, data shapes facilitate data structuring on top of the ontology, defining precise constraints and conditions under which the data can be instantiated. This formalisation ensures that the data adheres to the standard, facilitating automated validation and processing.
Human Understandability: This aspect ensures that individuals, regardless of their technical expertise, can comprehend and engage with the semantic data specifications. The reference documentation along with visual representation of UML class diagrams brings clarity and guidance for human users to grasp the meaning of the semantic data specification and its intended use.
Visual Representation: The semantic data specification is much easier to understand once it is presented in a visual format. Typically class diagrams are the most suitable to encapsulate the concepts and the relations between boosting significantly comprehension.
Data Serialisation: The technical artefacts of the specification, such as information exchange data models for various serialisation formats (e.g., JSON-LD, XML), ensure that the data can be correctly serialised, deserialized, and exchanged across systems and platforms. They cater for the technical requirements of data transport (and storage).

The coherence among these artefacts ensures that despite their different purposes and audiences, they all align in their representation of the domain knowledge. This alignment guarantees that whether a stakeholder is interpreting the model conceptually, engaging with it through documentation, or implementing it technically, they are presented with a unified and consistent view of the semantic data specification. This cohesive approach is pivotal for maintaining semantic integrity across various applications and systems.

1.2.3. Data models

The key data model artefacts of a specification include:

Vocabulary: An established list of preferred terms that signify concepts or relationships within a domain of discourse. All terms must have an unambiguous and non-redundant definition. Optionally it may include synonyms, notes and their translation to multiple languages. It is represented informally, for example, as a spreadsheet or SKOS thesaurus.
Ontology: An ontology is a formal, machine-readable specification of a conceptual model [Harpring2016]. It encompasses a representation, formal naming using URIs, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains of discourse, effectively enabling a shared and common understanding of data [wiki-onto]. It is usually expressed in OWL and RDFS.
Data Shape: Constraints or patterns that describe how instantiations of an ontology should be structured. Data shape artefacts can be used not only to ensure that RDF data adheres to predefined structure and validation rules, but also as a blueprint for information exchange data models, preserving semantics and ensuring consistency in data exchange. It is usually expressed in SHACL.
Controlled list (of values): A value vocabulary used to express concepts or values in instance data. It defines resources (such as instances of topics, languages, countries, or authors) that are used as values for elements in metadata records. Typically, the value vocabularies serve as reference data and constitute "building blocks" with which metadata records can be populated.
UML class model: A static structure UML model and associated diagrams that describes the structure of data by showing the classes, their attributes, and the relationships among objects. It may include documentation, description and various annotations. Such data model shall be conformant to the SEMIC style guide in order to fit for purpose of semantic data specifications.
Human-readable documentation: This artefact elucidates the specifications for stakeholders of varying business and technical backgrounds, detailing the structure, intent, and practical application of the semantic data specification.

Beyond these foundational elements, semantic data specifications may also incorporate artefacts designed for the technical interoperability layer called information exchange data models. They define and describe in a technology-specific manner the structure and content of information that is exchanged between organisations in a specific information exchange context. They detail the syntax, structure, data types, and constraints necessary for effective data communication between systems. These artefacts are necessary to realise the technical interoperability. If the chosen technology for exchange is of semantic nature (e.g. RDF), then a perfect syntax-semantics conflation is readily available through ontologies and data shapes. Otherwise, if a more traditional technology is selected due to popularity or legacy reasons, such as XML/XSD or JSON, then a mapping that acts as a syntax-semantics interface needs to be established, binding the physical model and the semantic specification. These can include various information exchange data models like:

JSON-LD context definitions: Facilitating the mapping of JSON to RDF linked data representation.
XML Schemas (XSD): Defining the structure and validating the constraints of XML documents.
API (Component) Specifications (REST, WSDL or GraphQL): Outlining the request, response and parameters for web-based data access and manipulation [swagger]. Such components are generally reusable blocks to facilitate reusability and maintenance of APIs.

1.2.4. Artefacts across interoperability layers

In the framework depicted [below], the artefacts are strategically organised within the semantic and technical interoperability layers, with each layer focusing on different but complementary aspects of data interoperability. As it can be seen, some artefacts belong to the semantic layer, and others to the technical layer, while the data shapes are present in both, according to their multipurpose nature.

The Semantic Layer encapsulates artefacts associated with the conceptual understanding of data. It is focused on defining the vocabulary and ontology that provide the foundational elements for data interoperability. These artefacts ensure that the meaning of data is clearly defined and shared across different systems, establishing the semantic rules that govern data exchange.

The Technical Layer is concerned with the practical aspects of data handling, such as data representation formats, communication protocols, and interface specifications. Artefacts in this layer address the technical requirements necessary for data to be physically exchanged and processed by information systems.

Documentation and UML Class models are depicted as orthogonal to these layers, as they facilitate human understanding and transcend the semantic-technical divide. These artefacts provide clarity and guidance, helping stakeholders visualise and comprehend the data structures and relationships without being confined to the constraints of either layer.

1.2.5. Data specification types

We can discern three interconnected layers each representing a different level of abstraction in semantic data specifications. The arrangement signifies the gradation from the abstract to the specific.

The Upper Layer accommodates the most abstract form of semantic data specifications. These specifications are context-free, meaning that they are not tied to any particular domain or application, and can be universally applied across various fields. These semantic data specifications, provide the broadest concepts that can be reused in numerous contexts. Here we generally find upper level ontologies (defining highly abstract foundational concepts such as “object”, “property”, “event” etc.), but also the core semantic data specifications, which, although more specific, can also be applied across multiple domains. The main objective of the Core Vocabularies is to provide terms to be reused in the broadest possible context [sem-sg-wsds].

Upper ontologies and core semantic data specifications serve as a scaffolding for domain ontologies, offering a hierarchy where the more general terms of the upper ontology act as superclasses (in some cases even as metaclasses) to the more specific classes of domain ontologies. This arrangement supports the structuring and integration of knowledge by providing common reference points that enhance understanding and data processing across different systems.

Notable examples:

Upper ontologies: DOLCE, Gist, BFO, etc.
Core Vocabularies: Dublin Core Terms, Data Catalog Vocabulary (DCAT), The Organization Ontology (ORG), European Legislation Identifier (ELI)

The Domain Layer sits at the intersection of the upper and application layers. It contains specifications that are more specific than the upper layer, but not as narrowly focused as the application layer. The semantic data specifications in this layer incorporate concepts relevant to a domain or sector (e.g. the justice domain, the public procurement domain, the healthcare domain) and represent the most specific knowledge from the perspective of that domain.

The domain layer is visually overlapped by both the upper and application layers, symbolising that some domain-specific semantic data specifications can inherit traits from, or lend characteristics to, both the more abstract upper layer and the more concrete application layer.

Notable examples:

DCAT-AP
eProcurement Ontology

The Application Layer is the most concrete and context-specific, containing semantic data specifications tailored for particular applications or families of applications. Application Profiles are detailed in constraints and data shapes, addressing explicit needs and constraints of a specific system or use case, and generally provide precise technical artefacts that can be used in data exchange.

Notable examples:

GeoDCAT-AP
BRegDCAT-AP
Stat-DCAT-AP

Terminological Clarification: The level of abstraction pertaining to a semantic data specification—be it core, domain, or application—can be applied as an adjective to describe its constituent artefacts. Thus, for a "core semantic data specification" the included components would be referred to as "core vocabulary", "core ontology", "core data shape" and "core exchange data model" and so on. Similarly, for a "domain semantic data specification," the elements would be denoted as "domain vocabulary", "domain ontology", "domain data shape" and "domain exchange data model", respectively.

1.2.6. Documentation

In the semantic data specification framework depicted [below], the documentation artefacts are organised into three distinct types, as illustrated in Figure 2, each catering to different aspects of user engagement with the data model. For effective documentation practices, we recommend principles laid out in the Diátaxis framework, which is a systematic approach to understanding the needs of documentation users [dtx]. It identifies four distinct needs (learning, understanding, consulting reference, achieving goals), and four corresponding forms of documentation - tutorials, handbooks, reference documentation and textbooks. It places them in a systematic relationship, and proposes that documentation should itself be organised around the structures of those needs.

In addition, we mention the Diagrams, which are usually embedded into documents, to underline the importance of visual depiction of models and to recognise them as distinct artefacts from the models (e.g. UML class diagrams). In the context of semantic data specifications we find relevant the following documentation kinds.

Figure 2

The Handbook (or usage manual) is a how-to guide and serves as an introductory reading to users new to the semantic data specification. It can also take the form of a tutorial to achieve predefined goals. It typically comprises use-case descriptions, examples and practical, step-by-step instructions designed to help users acquire the necessary skills to effectively use the semantic data specification.

Examples: This document

The Textbook (or explanatory manual) is an explanatory type of documentation and focuses on deepening the users’ understanding of the underlying concepts and principles incorporated into the semantic data specification. It aims to inform cognition, enhancing the user’s theoretical knowledge and conceptual insight, which is critical for those looking to gain a more profound grasp of the specification’s rationale, decisions, strengths and limitations.

Examples: SEMIC Style Guide [sem-sg]

The Reference document is a technical type of documentation and provides concise, detailed information about various elements of the semantic data specification. It serves users who are already familiar with the theoretical framework and need to apply their knowledge to specific tasks. This artefact is a go-to resource for factual and objective data about the semantic specifications, such as semantics, syntax, entities, properties, relationships, and constraints within the data model.

Examples: reference documents for Core Person [cpv], Core Business [cbv], etc.

These documentation artefacts are designed to collectively support the user’s journey from novice to expert within the semantic data specification domain. The Usage Manual aids in initial skill acquisition, the Explanatory Textbook supports deeper learning and understanding, and the Reference Documentation acts as a reliable resource for informed application and use. Together, they ensure that users at different stages of learning and practice have access to the appropriate materials to meet their needs.

1.3. Use cases

This handbook serves as a practical guide for using Core Vocabularies in various common situations. To provide clear and actionable insights, we have categorized potential use cases into two groups:

Primary Use Cases: These are the most common, interesting, and/or challenging scenarios, all thoroughly covered within this handbook.
Additional Use Cases: These briefly introduce other relevant scenarios but are not elaborated on in detail in this handbook.

Within both groups, we differentiate between use cases focused on the creation of NEW artefacts and those involving the mapping of EXISTING artefacts to Core Vocabularies.

For a better overview, we numbered the use cases and organised them into two diagrams, followed by the description of these use cases in two separate subsections, one dedicated to the addressed use cases and one to the use cases that are not addressed in this handbook.

The use cases provided in this handbook are written in white-box point of style oriented towards user goals following the classification of [Cockburn99].

We will use the following template to describe the relevant use cases:

Use Case <UC>: Title of the use case

Goal: A succinct sentence describing the goal of the use case

Primary Actor: The primary actor or actors of this use case

Actors: (Optional) Other actors involved in the use case

Description: Short description of the use case providing relevant information for its understanding

Example: An example to illustrate the application of this use case

Note: (Optional) notes about this use case, especially related to its coverage in this handbook

1.3.1. Primary use cases

Use Case UC1: Create a new information exchange data model

Goal: Create a new standalone data schema that uses terms from Core Vocabularies.

Primary Actors: Semantic Engineer, Software Engineer

Description: The goal is to design and create a new data schema or information exchange data model that is not part of a more comprehensive semantic data specification, relying on terms from existing CVs as much as possible.

Note: As this is a more generic use case it will be broken down into concrete use cases that focus on specific data formats.

Use Case UC1.1: Create a new XSD schema

Goal: Create a new standalone XSD schema that uses terms from Core Vocabularies.

Primary Actors: Semantic Engineer, Software Engineer

Description: The goal is to design and create a new XSD schema that is not part of a more comprehensive semantic data specification, relying on terms from existing CVs as much as possible. As an information exchange data model, an XSD Schema can be used to create and validate XML data to be exchanged between information systems.

Example: OOTS XML schema mappings [oots]

Note: A detailed methodology to be applied for this use case will be provided in the Create a new XSD schema

Use Case UC1.2: Create a new JSON-LD context definition

Goal: Create a new standalone JSON-LD context definition that uses terms from Core Vocabularies.

Primary Actors: Semantic Engineer, Software Engineer

Description: The goal is to design and create a new JSON-LD context definition that is not part of a more comprehensive semantic data specification, relying on terms from existing CVs as much as possible. As an information exchange data model, a JSON-LD context definition can be integrated in describing data, building APIs, and other operations involved in information exchange.

Example: Core Person Vocabulary [cpv-json-ld], Core Business Vocabulary [cbv-json-ld]

Note: A detailed methodology to be applied for use cases will be provided in the Create a new JSON-LD context definition section.

Use Case UC2: Map an existing data model to a Core Vocabulary

Goal: Create a mapping of an existing (information exchange) data model, to terms from Core Vocabularies.

Primary Actors: Semantic Engineer

Actors: Domain Expert, Software Engineer

Description: The goal is to design and create a mapping of an ontology, vocabulary, or some kind of data schema or information exchange data model that is not part of a more comprehensive semantic data specification, to terms from CVs. Such a mapping can be done at a conceptual level, or formally, e.g. in the form of transformation rules, and most often will include both.

Note: Since this is a more generic use case it will be broken down into concrete use cases that focus on specific data models and/or data formats. Some of those use cases will be described in detail below, while others will be included in the Appendix, which is dedicated to the additional use cases.

Use Case UC2.1: Map an existing Ontology to a Core Vocabulary

Goal: Create a mapping between the terms of an existing ontology and the terms of Core Vocabularies.

Primary Actors: Semantic Engineer

Actors: Domain Expert, Business Analyst, Software Engineer

Description: The goal is to create a formal mapping expressed in Semantic Web terminology (for example using rdfs:subClassOf, rdfs:subPropertyOf, owl:equivalentClass, owl:equivalentProperty, owl:sameAs properties), associating the terms in an existing ontology that defines relevant concepts in a given domain, to terms defined in one or more CVs. This activity is usually performed by a semantic engineer based on input received from domain experts and/or business analysts, who can assist with the creation of a conceptual mapping. The conceptual mapping associates the terms in an existing ontology, which defines relevant concepts within a specific domain, to terms defined in one or more SEMIC Core Vocabularies. The result of the formal mapping can be used later by software engineers to build information exchange systems.

Example: Mapping Core Person to Schema.org [map-cp2org], Core Business to Schema.org [map-cb2org], etc.

Note: A detailed methodology to be applied for this use case will be provided in the Map an existing Model section.

Use Case UC2.2: Map an existing XSD Schema to a Core Vocabulary

Goal: Define the data transformation rules for the mapping of an XSD schema to terms from Core Vocabularies. Create a mapping of XML data that conforms to an existing XSD schema to an RDF representation that conforms to a Core Vocabulary for formal data transformation.

Primary Actors: Semantic Engineer

Actors: Domain Expert, Business Analyst, Software Engineer

Description: The goal is to create a formal mapping using Semantic Web technologies (e.g. RML or other languages), to allow automated translation of XML data conforming to a certain XSD schema, to RDF data expressed in terms defined in one or more SEMIC Core Vocabularies. This use case required definitions of an Application Profile for a Core Vocabulary because the CV alone does not specify sufficient instantiation constraints to be precisely mappable.
Such activity can be done by semantic engineers based on input from domain experts and/or business analysts, who can assist with the creation of a conceptual mapping. The conceptual mapping is usually used as the basis for the formal mapping. The conceptual mapping can be a simple correspondence table associating the XML Elements defined in an XSD schema, to terms defined in one or more SEMIC Core Vocabularies. In some cases the creation of the conceptual mapping can be done by the semantic engineers themselves, or even by the software engineers building information exchange systems.

Example: ISA2core SAWSDL mapping [isa2-map]

Note: A detailed methodology to be applied for this use case will be provided in the Map an existing XSD schema section.

The additional use cases are described in the Appendix.

1.4. How to create new data models

1.4.1. Create a new XSD schema

This section provides detailed instructions for addressing use case UC1.1.

To create a new XSD schema, the following steps need to be observed:

Import or define elements
Shape structure with patterns

Import or define elements

When working with XML schemas, particularly in relation to semantic artefacts like ontologies or data shapes, managing the imports and namespaces are vital considerations that ensure clarity, reusability, and proper integration of various data models.

When a core vocabulary has defined an associated XSD schema, it is not only easy but also advisable to directly import this schema using the xsd:import statement. This enables seamless reuse and guarantees that any complex types or elements defined within the core vocabulary are integrated correctly and transparently within new schemas.

The imported elements are then employed in the definition of a specific document structure. For example, Core Vocabularies are based on DCTERMS [ref], that provides an XML schema, so Core Person could import the DCTERMS XML schema for the usage of AgentType:

XgwUbQR6GqQwrqu74zQUAACznLKJ1efwzrVNgfkO4abYHAQAAHAxE60iYZSIB0QIAgGMD0TqcspPpH1NBtAAA4Ni8EdECAADwGoBoAQAAGAaIFgAAgGGAaAEAABgGiBYAAIBhgGgBAAAYBogWAACAYYBoAQAAGAaIFgAAgGHYQbTwwQcffPDB5+IfrU1d0QIAAACujf8HBge2DIp+2zQAAAAASUVORK5CYII=

In cases where the Core Vocabulary does not provide an XSD schema, it is necessary to create for the reused URIs the corresponding XML element definitions in the new XSD schema. Crucially, these new elements must adhere to the namespace defined by the Core Vocabulary to maintain consistency. For example “AgentType” must be defined within the “http://data.europa.eu/m8g/” namespace of the Core Vocabularies.

Furthermore, when integrating these elements into a new schema, it is essential to reflect the constraints from the core vocabulary’s data shape-specifically, which properties are optional and which are mandatory - within the XSD schema element definitions.

Shape XML document structure

In designing XML schemas, the selection of a design pattern has implications for the reusability and extension of the schema. The Venetian Blind and Garden of Eden patterns stand out as preferable for their ability to allow complex types to be reused by different elements [sem-map].

The Venetian Blind pattern is characterised by having a single global element that serves as the entry point for the XML document, from which all the elements can be reached. This pattern implies a certain directionality and starting point, analogous to choosing a primary class in an ontology that has direct relationships to other classes, and from which one can navigate to the rest of the classes.

For instance, in the Core Business Vocabulary, if one were to select the "Legal Entity" class as the starting point, it would shape the XML schema in such a way that all other classes could be reached from this entry point, reflecting its central role within the ontology. A possible implementation with Venetian Blind with “Legal Entity” as the root element would be:

Adopting Venetian Blind pattern reduces the variability in its application and deems the schema usable in specific scenarios by providing not only well-defined elements, but also a rigid and predictable structure.

On the other hand, the Garden of Eden pattern allows for multiple global elements, providing various entry points into the XML document. This pattern accommodates ontologies where no single class is inherently central, mirroring the flexibility of graph representations in ontologies that do not have a strict hierarchical starting point.

Adopting the Garden of Eden pattern provides a less constrained approach, enabling users to represent information starting from different elements that may hold significance in different contexts. This approach has been adopted by standardisation initiatives such as NIEM and UBL, which recommend such flexibility for broader applicability and ease of information representation.

However, the Garden of Eden pattern does not lead to a schema that can be used in final application scenarios, because it does not ensure a single stable document structure but leaves the possibility for variations. This schema pattern requires an additional composition specification. For example, if it is used in a SOAP API, the developers can decide on using multiple starting points to facilitate exchange of granular messages specific per API endpoint. This way the XSD schema remains reusable for different API endpoints and even API implementations.

Overall, the choice between these patterns should be informed by the intended use of the schema, the level of abstraction of the ontology it represents, and the needs of the end-users, aiming to strike a balance between structure and flexibility.

Recommendation: We consider the Garden of Eden pattern suitable for designing XSD schemas at the level of core or domain semantic data specifications, and the Venetian Blind pattern suitable for XSD schemas at the level of specific Application Profiles.

1.4.2. Create a new JSON-LD context definition

This section provides detailed instructions for addressing use case UC1.2.

JSON-LD combines the simplicity, power, and web ubiquity of JSON with the concepts of Linked Data. Creating JSON-LD context definitions facilitates this synergy. This ensures that when data is shared or integrated across systems, it maintains its meaning and can be understood in the same way across different contexts. Here’s a guide on how to create new JSON-LD contexts for existing CVs, using the Core Person Vocabulary as an example.

Import or define elements
Shape structure

Import or define elements

When a CV has defined an associated JSON-LD context, it is not only easy, but also advisable to directly import this context using the @import keyword. This enables seamless reuse and guarantees that any complex types or elements defined within the vocabulary are integrated correctly and transparently within new schemas.

"@context": {"@import": "https://json-ld.org/contexts/remote-context.jsonld", }

In cases where the CV does not provide an JSON-LD context, it is necessary to create for the reused URIs the corresponding field element definitions. To start, gather all the terms from the Core Person Vocabulary that you want to include in your JSON-LD context. Terms can include properties like given name, family name, date of birth, and relationships like residency or contact point.

Then, decide the desired structure of the JSON-LD file, by defining the corresponding keys, for example Person.givenName, Person.familyName, Person.dateOfBirth, Person.residency, Person.contactPoint. These new fields must adhere to the naming defined by the CV to maintain consistency.

Finally, assign URIs to keys. Each term in your JSON-LD context must be associated with a URI from an ontology that defines its meaning in a globally unambiguous way. Associate the URIs established in CVs to JSON keys using the same CV terms. For example:

"Person.contactPoint": {"@id": "http://data.europa.eu/m8g/contactPoint"}.

The ones that are imported by the CVs, shall be used as originally defined, for example from FOAF:

"Person.givenName": {"@id": "http://xmlns.com/foaf/0.1/givenName"}.

Shape structure

Start defining the structure of the context by relating class terms with property terms and then, if necessary, property terms with other classes.

Commence by creating a JSON structure that starts with a @context field. This field will contain mappings from your vocabulary terms to their respective URIs. Continue by defining fields for Classes and subfields for their properties.

If the JSON-LD context is developed with the aim of being used directly in exchange specific to an application scenario, then aim to establish a complete tree structure that starts with a single root class. To do so, specify precise @type references linking to the specific Class. For example:

"Person.contactPoint" : {"@id": "http://data.europa.eu/m8g/contactPoint", "@type": "ContactPoint"}.

If the aim of the developed JSON-LD context is rather ensuring semantic correspondences, without any structural constraints, which is the case for core or domain semantic data specification, then definitions of structures specific to each entity type and its properties suffice, using only loose references to other objects. For example:

"Person.contactPoint": {"@id": "http://data.europa.eu/m8g/contactPoint", "@type": "@id"}

Unresolved include directive in modules/ROOT/pages/handbook-as-a-whole.adoc - include::how-to-map-existing-data-models.adoc[]

1.5. Glossary

1.5.1. Application Profile

Alternative names: AP, context-specific semantic data specification
Definition: Semantic data specification aimed to facilitate the data exchange in a well-defined application context.
Additional information: It re-uses concepts from one or more semantic data specifications, while adding more specificity, by identifying mandatory, recommended, and optional elements, addressing particular application needs, and providing recommendations for controlled vocabularies to be used.
Source/Reference: SEMIC Style Guide

1.5.2. Conceptual model

Alternative names: conceptual model specification
Definition: An abstract representation of a system that comprises well-defined concepts, their qualities or attributes, and their relationships to other concepts.
Additional information: A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole.
Source/Reference: SEMIC Style Guide

1.5.3. Constraint

Alternative names: restriction, axiom, shape
Definition: Restriction to which an entity or relation must adhere.
Additional information: Models normally consist not only of the entity types and relationships between them, but also contain constraints that hold over them. The types of constraints that can be declared depend on the type of model. For instance, a SQL schema for a relational database has, among others, a data type constraint for each column specification and referential integrity constraints, a UML Class diagram has multiplicity constraints declared on an association to specify the amount of relations each instance is permitted to have, and an ontology may contain an axiom that declares an object property to be, e.g., symmetric or transitive. The list of permissible constraints typically is part of the modelling language, but it also may be an associated additional constraint language, such as SHACL for RDF and OCL for UML.

1.5.4. Core Vocabulary

Alternative names: CV
Definition: A basic, reusable and extensible semantic data specification that captures the fundamental characteristics of an entity in a context-neutral fashion.
Additional information: Its main objective is to provide terms to be reused in the broadest possible context.
Source/Reference: SEMIC Style Guide

1.5.5. Data model

Definition: A structured representation of data elements and relationships used to facilitate semantic interoperability within and across domains.
Additional information: Data models are represented in common languages to facilitate semantic interoperability in a data space, including ontologies, data models, schema specifications, mappings and API specifications that can be used to annotate and describe data sets and data services. They are often domain-specific.
Source/Reference: Data Spaces Blueprint

1.5.6. Information exchange data model

Alternative names: data schema
Definition: Information exchange data model is a technology-specific framework for data exchange, detailing the syntax, structure, data types, and constraints necessary for effective data communication between systems. It serves as a practical blueprint for implementing an application profile in specific data exchange contexts.
Additional information: An ontology and an exchange data model serve distinct yet complementary roles across different abstraction levels within data management systems. While a Data Schema specifies the technical structure for storing and exchanging data, primarily concerned with the syntactical and structural aspects of data, it is typically articulated using metamodel standards such as JSON Schema and XML Schema.
In contrast, ontologies and data shapes operate at a higher conceptual level, outlining the knowledge and relational dynamics within a particular domain without delving into the specifics of data storage or structural implementations. Although a Data Schema can embody certain elements of an ontology or application profile—particularly attributes related to data structure and cardinalities necessary for data exchange—it does not encapsulate the complete semantics of the domain as expressed in an ontology.
Thus, while exchange data models are essential for the technical realisation of data storage and exchange, they do not replace the broader, semantic understanding provided by ontologies. The interplay between these layers ensures that data schemas contribute to a holistic data management strategy by providing the necessary structure and constraints for data exchange, while ontologies offer the overarching semantic framework that guides the meaningful interpretation and utilisation of data across systems. Together, they facilitate a structured yet semantically rich data ecosystem conducive to advanced data interoperability and effective communication.
Source/Reference: Data Spaces Blueprint

1.5.7. Data specification artefact

Alternative names: specification artefact, artefact
Definition: A materialisation of a semantic data specification in a concrete representation that is appropriate for addressing one or more concerns (e.g. use cases, requirements).
Source/Reference: SEMIC Style Guide

1.5.8. Data specification document

Alternative names: specification document
Definition: The human-readable representation of an ontology, a data shape, or a combination of both.
Additional information: A semantic data specification document is created with the objective of making it simple for the end-user to understand (a) how a model encodes knowledge of a particular domain, and (b) how this model can be technically adopted and used for a purpose. It is to serve as technical documentation for anyone interested in using (e.g. adopting or extending) a semantic data specification.
Source/Reference: SEMIC Style Guide

1.5.9. Data shape specification

Alternative names: data shape constraint specification, data shape constraint, data shape
Definition: A set of conditions on top of an ontology, limiting how the ontology can be instantiated.
Additional information: The conditions and constraints that apply to a given ontology are provided as shapes and other constructs expressed in the form of an RDF graph. We assume that the data shapes are expressed in SHACL language.
Source/Reference: SEMIC Style Guide

1.5.10. Model

Additional information: Generic term for any of the entries in this glossary, without the need for the existence of data and they may also serve purposes other than facilitating interoperability. SEMIC’s usage of models refers to structured information or knowledge that is represented in a suitable representation language, rather than to individual objects or the notion of ‘model’ in model-theoretic semantics of a logic.

1.5.11. Ontology

Definition: A formal specification describing the concepts and relationships that can formally exist for an agent or a community of agents (e.g. domain experts)
Additional information: It encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains of discourse.
Source/Reference: SEMIC Style Guide

1.5.12. Semantic data specification

Alternative names: data specification
Definition: An union of machine- and human-readable artefacts addressing clearly defined concerns, interoperability scope and use-cases.
Additional information: A semantic data specification comprises at least an ontology and a data shape (or either of them individually) accompanied by a human-readable data specification document.
Source/Reference: SEMIC Style Guide

1.5.13. Vocabulary

Definition: An established list of preferred terms that signify concepts or relationships within a domain of discourse. All terms must have an unambiguous and non-redundant definition. Optionally it may include synonyms, notes and their translation to multiple languages.

1.5.14. Upper Ontology

Alternative names: top-level ontology, foundational ontology
Definition: An upper ontology is a highly generalised ontology that includes entities considered useful across all subject domains, such as “endurant”, “independent continuant”, “process”, and “participates in”. Additional information: Its primary role is to facilitate broad semantic interoperability among numerous domain ontologies by offering a standardised foundational/top level hierarchy and relations together with its underlying philosophical commitments. This framework assists in harmonising diverse domain ontologies, allowing for consistent data interpretation and efficient information exchange.