The SEMIC Style Guide for Semantic Engineers
1. Introduction
1.1. What is interoperability?
The term "interoperability" comprising of ‘inter’ (Latin for between), ‘opera’ (Latin for work), and ‘ability’, refers to the intrinsic nature of systems or entities working together to achieve shared goals.
Interoperability in the EU context refers to the capacity of systems or organisations, including public administrations, businesses, and citizens, to collaborate effectively and pursue common objectives across borders. This capability is crucial for providing efficient digital public services, facilitating economic transactions, and supporting the free movement of goods, services, people, and data. The European Interoperability Framework (EIF)[eif], [eif2] and Interoperable Europe Act (IEA) [reg24-903] emphasise that interoperability involves the seamless exchange of information and trusted data sharing across sectors and administrative layers, which is essential for improving policy-making and public service delivery.
1.2. Interoperability through semantic specifications
Semantic interoperability ensures that the precise meaning of exchanged data is maintained throughout its transmission, adhering to the principle that "what is sent is what is understood", encompassing both the semantic and syntactic aspects of data. The semantic aspect focuses on the meaning of data elements and their relationships, whereas the syntactic aspect deals with the structure or format of the data as it is exchanged. On the other hand, technical interoperability covers the infrastructures and applications that facilitate the linkage between systems and services. This includes aspects such as data representation, transmission methods, API design, access rights management, security, and overall system performance.
Semantic data specifications are detailed, standardised data modelling descriptions that help manage how data is defined, represented, and communicated across different systems. They comprise various artefacts that are both machine-readable and human-understandable, thus supporting consistent interpretation and utilisation across diverse IT environments and stakeholders (e.g. developers, business experts, end users, administrators, etc.).
The SEMIC Style Guide [sem-sg] provides essential guidelines for creating and managing such specifications, covering naming conventions, syntax, and the organisation of artefacts into two critical types of semantic data specifications: Core Vocabularies and Application Profiles.
The Core Vocabularies are semantic data specifications that enable public administrations to standardise data exchange processes, thus enhancing the clarity and consistency of data across different systems and sectors. By leveraging these standards, administrations can effectively bridge the gap between differing data practices, ensuring seamless service delivery that meets the needs of citizens and businesses alike.
1.3. What are the Core Vocabularies?
Core Vocabularies are simplified, reusable and extensible data models that capture the fundamental characteristics of a data entity in a context-neutral and syntax-neutral fashion [cv-hb]. The SEMIC style guide exlains how the Core Vocabularies [sem-sg-cvs] are context-neutral semantic building blocks that can be extended into context-specific semantic data specifications to ensure semantic consistency. When the Core Vocabularies are extended to create domain specifications and information exchange models, additional meaning (semantics) is added to the specifications, due to the contextualisation.
2. SEMIC Core Vocabularies
This section contains a brief overview of the Core Vocabularies, indicating how they were developed and how they are maintained.
Since 2011 the European Commission facilitates international working groups to forge consensus and maintain the SEMIC Core Vocabularies. A short description of these vocabularies is included in the Table [below]. The latest release of the Core Vocabularies can be retrieved via the SEMIC Support Center[semic], or directly in the GitHub repository[semic-gh].
Vocabulary | Description |
---|---|
The Core Person Vocabulary is a simplified, reusable and extensible data model that captures the fundamental characteristics of a person, e.g. the name, the gender, the date of birth, the location etc. This specification enables interoperability among registers and any other ICT based solutions exchanging and processing person-related information. |
|
The Core Business Vocabulary is a simplified, reusable and extensible data model that captures the fundamental characteristics of a legal entity, e.g. the legal name, the activity, address, etc. The Core Business Vocabulary includes a minimal number of classes and properties modelled to capture the typical details recorded by business registers. It facilitates information exchange between business registers despite differences in what they record and publish. |
|
The Core Location Vocabulary is a simplified, reusable and extensible data model that captures the fundamental characteristics of a location, represented as an address, a geographic name, or a geometry. The Location Core Vocabulary provides a minimum set of classes and properties for describing a location represented as an address, a geographic name, or a geometry. This specification enables interoperability among land registers and any other ICT based solution exchanging and processing location information. |
|
The Core Criterion and Core Evidence Vocabulary (CCCEV) supports the exchange of information between organisations that define criteria and organisations that respond to these criteria by means of evidences. The Core Evidence and Core Criterion Vocabulary (CCCEV) addresses specific needs of businesses, public administrations and citizens across the European Union, including the following use cases:
|
|
The Core Public Organisation Vocabulary provides a common data model for describing public organisations in the European Union. The Core Public Organisation Vocabulary (CPOV) addresses specific needs of businesses, public administrations and citizens across the European Union, including the following use cases:
|
|
The Core Public Event Vocabulary is a simplified, reusable and extensible data model that captures the fundamental characteristics of a public event, e.g. the title, the date, the location, the organiser etc. The Core Public Event Vocabulary aspires to become a common data model for describing public events (conferences, summits, etc.) in the European Union. This specification enables interoperability among registers and any other ICT based solutions exchanging and processing information related to public events. |
2.1. Representation formats
The Core Vocabularies are semantic data specifications that are disseminated as the following artefacts:
-
lightweight ontology [sem-sg-wio] for vocabulary definition expressed in OWL [owl2],
-
loose data shape specification [sem-sg-wds] expressed in SHACL [shacl],
-
human-readable reference documentation [sem-sg-wdsd] in HTML (based on ReSpec [respec]),
-
conceptual model specification [sem-sg-wcm] expressed in UML [uml].
2.2. Licensing conditions
The Core Vocabularies are published under the CC-BY 4.0 licence [cc-by].
2.3. Core Vocabularies lifecycle
The Core Vocabularies have been developed following the ‘Process and methodology for developing Core Vocabularies’ [ec11a]. The Core Vocabularies have an open change and release management process [cv-met], supported by SEMIC, that ensures continuous improvement and relevance to evolving user needs.
This process begins with the identification of needs from stakeholders or issues raised in existing implementations. The Working Group members, SEMIC team or community of users propose changes that are thoroughly assessed for their impact and feasibility. Once a change is deemed necessary, it undergoes a drafting phase where the technical details are fleshed out, followed by public consultations to gather wider input and ensure transparency.
Following consultations, the changes are refined and prepared for implementation. This stage may involve further iteration based on feedback or additional insights from ongoing discussions. The finalised changes are then formally approved and documented, ensuring they are well-understood and agreed upon by all relevant parties.
The release management of Core Vocabularies follows a structured timeline that includes pre-announced releases and public consultation periods to allow users to prepare for changes. Each release includes detailed documentation to support implementation, ensuring users can integrate new versions with minimal disruption. This process not only maintains the quality and relevance of the Core Vocabularies, but also supports a dynamic and responsive framework for semantic interoperability within digital public services.
2.4. Claiming conformance
Claiming conformance to Core Vocabularies is an integral part of validating (a) how well a new or a mapped data model or semantic data specification aligns with the principles and practices established in the SEMIC Style Guide [sem-sg] and (b) to what degree the Core Vocabularies are reused (fully or partially) [sem-sg-reuse]. The conformance assessment is voluntary, and shall be published as a self-conformance statement. This statement must assert which requirements are met by the data model or semantic specification.
The conformance statement highlights various levels of adherence, ranging from basic implementation to more complex semantic representations. At the basic level, conformance might simply involve ensuring that data usage is consistent with the terms (and structure, but no formal semantics) defined by the Core Vocabularies. Moving to a more advanced level of conformance, data may be easily transformed into formats like RDF or JSON-LD, which are conducive to richer semantic processing and integration. This level of conformance signifies a deeper integration of the Core Vocabularies, facilitating a more robust semantic interoperability across systems. Ultimately, the highest level of conformance is achieved when the data is represented in RDF and fully leverages the semantic capabilities of the Core Vocabularies. This includes using a range of semantic technologies, adhering to the SEMIC Style Guide, fully reusing the Core Vocabularies, and respecting the associated data shapes.
3. Conceptual framework
This section delves into the conceptual framework of semantic data specifications. Understanding this framework allows stakeholders to use the semantic data specifications effectively and align expectations and practices ensuring consistent and effective communication.
The structure of the section is methodically organised into several subsections, each focusing on a different element of semantic data specifications. It begins with a broad overview of the specifications, and establishes what the artefacts are. Then progressively it narrows down to specific artefacts types, namely data models and documentation. Further subsections explore data models interaction across different layers of data interoperability, and semantic data specification types. This sequential approach helps readers build a comprehensive understanding from general concepts to specific explanations.
3.1. Semantic data specifications
Semantic data specifications are composite standards designed to facilitate data exchange and interoperability among diverse systems, characterised by their descriptive and prescriptive nature. These specifications are realised through a suite of artefacts that are harmoniously interrelated and address different interoperability scopes and use cases—ranging from semantic to technical concerns. The artefacts are fashioned to be both machine-readable and human-understandable, ensuring consistent interpretation and utilisation.
Figure 1 depicts a conceptualisation of how various components that make up a complete semantic data specification interconnect. At the top of the diagram is the "Semantic data specification" indicating its overarching role. It serves a "Purpose/Goal" which frames various specific "Concern/Need". The semantic data specification comprises various "Artefacts" denoting the different elements that make up the specification.
Figure 1
Beneath this, the framework branches into two main types of artefacts: "Data models” and "Documentation". The most relevant data models are "Vocabulary", "Ontology", "Data shape" and "UML Class model". Each data model is expressed in a "Modelling language" appropriate for the concern or the need addressed. The next section introduces the relevant artefact types.
3.2. Artefacts
Integral to semantic data specifications is the intrinsic consistency and coherence among the artefacts. Each represents a facet of the same domain knowledge, but is tailored to address specific concerns—such as human understandability, semantic underpinning, formal definition, and data serialisation (addressed in the next section). This alignment ensures that each artefact, while distinct in function, contributes to a unified view of the domain, making the entire specification accessible and actionable. Such consistency is pivotal in maintaining semantic integrity, leading to robust technical interoperability and seamless information exchange.
Each artefact, while unique in its form and function, represents different facets of the same domain. They are harmonised, yet distinct, with each created to address specific concerns, such as:
-
Semantic Underpinning: The semantic data specification needs to encapsulate formally the domain knowledge, capturing the essence of its concepts and the possible relationships between them. Ontologies play a key role here, offering a structured and logical framework that lays out the domain knowledge in a way that is both comprehensive and actionable.
-
Formal Definition: Using formal languages such as OWL or RDFS for semantic representation enables precise interpretation and inference over the ontologies and instance data. Moreover, data shapes facilitate data structuring on top of the ontology, defining precise constraints and conditions under which the data can be instantiated. This formalisation ensures that the data adheres to the standard, facilitating automated validation and processing.
-
Human Understandability: This aspect ensures that individuals, regardless of their technical expertise, can comprehend and engage with the semantic data specifications. The reference documentation along with visual representation of UML class diagrams brings clarity and guidance for human users to grasp the meaning of the semantic data specification and its intended use.
-
Visual Representation: The semantic data specification is much easier to understand once it is presented in a visual format. Typically class diagrams are the most suitable to encapsulate the concepts and the relations between boosting significantly comprehension.
-
Data Serialisation: The technical artefacts of the specification, such as information exchange data models for various serialisation formats (e.g., JSON-LD, XML), ensure that the data can be correctly serialised, deserialized, and exchanged across systems and platforms. They cater for the technical requirements of data transport (and storage).
The coherence among these artefacts ensures that despite their different purposes and audiences, they all align in their representation of the domain knowledge. This alignment guarantees that whether a stakeholder is interpreting the model conceptually, engaging with it through documentation, or implementing it technically, they are presented with a unified and consistent view of the semantic data specification. This cohesive approach is pivotal for maintaining semantic integrity across various applications and systems.
3.3. Data models
The key data model artefacts of a specification include:
-
Vocabulary: An established list of preferred terms that signify concepts or relationships within a domain of discourse. All terms must have an unambiguous and non-redundant definition. Optionally it may include synonyms, notes and their translation to multiple languages. It is represented informally, for example, as a spreadsheet or SKOS thesaurus.
-
Ontology: An ontology is a formal, machine-readable specification of a conceptual model [Harpring2016]. It encompasses a representation, formal naming using URIs, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains of discourse, effectively enabling a shared and common understanding of data [wiki-onto]. It is usually expressed in OWL and RDFS.
-
Data Shape: Constraints or patterns that describe how instantiations of an ontology should be structured. Data shape artefacts can be used not only to ensure that RDF data adheres to predefined structure and validation rules, but also as a blueprint for information exchange data models, preserving semantics and ensuring consistency in data exchange. It is usually expressed in SHACL.
-
Controlled list (of values): A value vocabulary used to express concepts or values in instance data. It defines resources (such as instances of topics, languages, countries, or authors) that are used as values for elements in metadata records. Typically, the value vocabularies serve as reference data and constitute "building blocks" with which metadata records can be populated.
-
UML class model: A static structure UML model and associated diagrams that describes the structure of data by showing the classes, their attributes, and the relationships among objects. It may include documentation, description and various annotations. Such data model shall be conformant to the SEMIC style guide in order to fit for purpose of semantic data specifications.
-
Human-readable documentation: This artefact elucidates the specifications for stakeholders of varying business and technical backgrounds, detailing the structure, intent, and practical application of the semantic data specification.
Beyond these foundational elements, semantic data specifications may also incorporate artefacts designed for the technical interoperability layer called information exchange data models. They define and describe in a technology-specific manner the structure and content of information that is exchanged between organisations in a specific information exchange context. They detail the syntax, structure, data types, and constraints necessary for effective data communication between systems. These artefacts are necessary to realise the technical interoperability. If the chosen technology for exchange is of semantic nature (e.g. RDF), then a perfect syntax-semantics conflation is readily available through ontologies and data shapes. Otherwise, if a more traditional technology is selected due to popularity or legacy reasons, such as XML/XSD or JSON, then a mapping that acts as a syntax-semantics interface needs to be established, binding the physical model and the semantic specification. These can include various information exchange data models like:
-
JSON-LD context definitions: Facilitating the mapping of JSON to RDF linked data representation.
-
XML Schemas (XSD): Defining the structure and validating the constraints of XML documents.
-
API (Component) Specifications (REST, WSDL or GraphQL): Outlining the request, response and parameters for web-based data access and manipulation [swagger]. Such components are generally reusable blocks to facilitate reusability and maintenance of APIs.
3.4. Artefacts across interoperability layers
In the framework depicted [below], the artefacts are strategically organised within the semantic and technical interoperability layers, with each layer focusing on different but complementary aspects of data interoperability. As it can be seen, some artefacts belong to the semantic layer, and others to the technical layer, while the data shapes are present in both, according to their multipurpose nature.
The Semantic Layer encapsulates artefacts associated with the conceptual understanding of data. It is focused on defining the vocabulary and ontology that provide the foundational elements for data interoperability. These artefacts ensure that the meaning of data is clearly defined and shared across different systems, establishing the semantic rules that govern data exchange.
The Technical Layer is concerned with the practical aspects of data handling, such as data representation formats, communication protocols, and interface specifications. Artefacts in this layer address the technical requirements necessary for data to be physically exchanged and processed by information systems.
Documentation and UML Class models are depicted as orthogonal to these layers, as they facilitate human understanding and transcend the semantic-technical divide. These artefacts provide clarity and guidance, helping stakeholders visualise and comprehend the data structures and relationships without being confined to the constraints of either layer.
3.5. Data specification types
We can discern three interconnected layers each representing a different level of abstraction in semantic data specifications. The arrangement signifies the gradation from the abstract to the specific.
The Upper Layer accommodates the most abstract form of semantic data specifications. These specifications are context-free, meaning that they are not tied to any particular domain or application, and can be universally applied across various fields. These semantic data specifications, provide the broadest concepts that can be reused in numerous contexts. Here we generally find upper level ontologies (defining highly abstract foundational concepts such as “object”, “property”, “event” etc.), but also the core semantic data specifications, which, although more specific, can also be applied across multiple domains. The main objective of the Core Vocabularies is to provide terms to be reused in the broadest possible context [sem-sg-wsds].
Upper ontologies and core semantic data specifications serve as a scaffolding for domain ontologies, offering a hierarchy where the more general terms of the upper ontology act as superclasses (in some cases even as metaclasses) to the more specific classes of domain ontologies. This arrangement supports the structuring and integration of knowledge by providing common reference points that enhance understanding and data processing across different systems.
Notable examples:
-
Upper ontologies: DOLCE, Gist, BFO, etc.
-
Core Vocabularies: Dublin Core Terms, Data Catalog Vocabulary (DCAT), The Organization Ontology (ORG), European Legislation Identifier (ELI)
The Domain Layer sits at the intersection of the upper and application layers. It contains specifications that are more specific than the upper layer, but not as narrowly focused as the application layer. The semantic data specifications in this layer incorporate concepts relevant to a domain or sector (e.g. the justice domain, the public procurement domain, the healthcare domain) and represent the most specific knowledge from the perspective of that domain.
The domain layer is visually overlapped by both the upper and application layers, symbolising that some domain-specific semantic data specifications can inherit traits from, or lend characteristics to, both the more abstract upper layer and the more concrete application layer.
Notable examples:
-
DCAT-AP
-
eProcurement Ontology
The Application Layer is the most concrete and context-specific, containing semantic data specifications tailored for particular applications or families of applications. Application Profiles are detailed in constraints and data shapes, addressing explicit needs and constraints of a specific system or use case, and generally provide precise technical artefacts that can be used in data exchange.
Notable examples:
-
GeoDCAT-AP
-
BRegDCAT-AP
-
Stat-DCAT-AP
Terminological Clarification: The level of abstraction pertaining to a semantic data specification—be it core, domain, or application—can be applied as an adjective to describe its constituent artefacts. Thus, for a "core semantic data specification" the included components would be referred to as "core vocabulary", "core ontology", "core data shape" and "core exchange data model" and so on. Similarly, for a "domain semantic data specification," the elements would be denoted as "domain vocabulary", "domain ontology", "domain data shape" and "domain exchange data model", respectively.
3.6. Documentation
In the semantic data specification framework depicted [below], the documentation artefacts are organised into three distinct types, as illustrated in Figure 2, each catering to different aspects of user engagement with the data model. For effective documentation practices, we recommend principles laid out in the Diátaxis framework, which is a systematic approach to understanding the needs of documentation users [dtx]. It identifies four distinct needs (learning, understanding, consulting reference, achieving goals), and four corresponding forms of documentation - tutorials, handbooks, reference documentation and textbooks. It places them in a systematic relationship, and proposes that documentation should itself be organised around the structures of those needs.
In addition, we mention the Diagrams, which are usually embedded into documents, to underline the importance of visual depiction of models and to recognise them as distinct artefacts from the models (e.g. UML class diagrams). In the context of semantic data specifications we find relevant the following documentation kinds.
Figure 2
The Handbook (or usage manual) is a how-to guide and serves as an introductory reading to users new to the semantic data specification. It can also take the form of a tutorial to achieve predefined goals. It typically comprises use-case descriptions, examples and practical, step-by-step instructions designed to help users acquire the necessary skills to effectively use the semantic data specification.
Examples: This document
The Textbook (or explanatory manual) is an explanatory type of documentation and focuses on deepening the users’ understanding of the underlying concepts and principles incorporated into the semantic data specification. It aims to inform cognition, enhancing the user’s theoretical knowledge and conceptual insight, which is critical for those looking to gain a more profound grasp of the specification’s rationale, decisions, strengths and limitations.
Examples: SEMIC Style Guide [sem-sg]
The Reference document is a technical type of documentation and provides concise, detailed information about various elements of the semantic data specification. It serves users who are already familiar with the theoretical framework and need to apply their knowledge to specific tasks. This artefact is a go-to resource for factual and objective data about the semantic specifications, such as semantics, syntax, entities, properties, relationships, and constraints within the data model.
These documentation artefacts are designed to collectively support the user’s journey from novice to expert within the semantic data specification domain. The Usage Manual aids in initial skill acquisition, the Explanatory Textbook supports deeper learning and understanding, and the Reference Documentation acts as a reliable resource for informed application and use. Together, they ensure that users at different stages of learning and practice have access to the appropriate materials to meet their needs.
4. Use cases
This handbook serves as a practical guide for using Core Vocabularies in various common situations. To provide clear and actionable insights, we have categorized potential use cases into two groups:
-
Primary Use Cases: These are the most common, interesting, and/or challenging scenarios, all thoroughly covered within this handbook.
-
Additional Use Cases: These briefly introduce other relevant scenarios not elaborated on in detail, for the sake of brevity.
Within both groups, we differentiate between use cases focused on the creation of NEW artefacts and those involving the mapping of EXISTING artefacts to Core Vocabularies.
For a better overview, we numbered the use cases and organised them into two tables, followed by the description of these use cases in two separate subsections, one dedicated to the addressed use cases and one to the use cases that are not addressed in this handbook.
ID | Goal | Data specification / Artefact |
---|---|---|
UC1 |
Create a NEW |
Information exchange data model |
UC1.1 |
Create a NEW |
XSD schema |
UC1.2 |
Create a NEW |
JSON-LD context definition |
UC2 |
Map to a Core Vocabulary an EXISTING |
Data model |
UC2.1 |
Map to a Core Vocabulary an EXISTING |
Ontology |
UC2.2 |
Map to a Core Vocabulary an EXISTING |
XSD schema |
Table: Listing of addressed use cases
ID | Goal | Data specification / Artefact |
---|---|---|
UC3 |
Create a NEW |
Semantic data specification |
UC3.1 |
Create a NEW |
Core Vocabulary |
UC3.2 |
Create a NEW |
Application Profile |
UC4 |
Create a NEW |
Data model |
UC4.1 |
Create a NEW |
Ontology |
UC4.2 |
Create a NEW |
Data shape |
UC2.3 |
Map to a Core Vocabulary an EXISTING |
JSON schema |
Table: Listing of unaddressed use cases
The use cases provided in this handbook are written in white-box point of style oriented towards user goals[weuc].
We will use the following template to describe the relevant use cases that were listed above:
Use Case <UC>: Title of the use case |
Goal: A succinct sentence describing the goal of the use case |
Primary Actor: The primary actor or actors of this use case |
Actors: (Optional) Other actors involved in the use case |
Description: Short description of the use case providing relevant information for its understanding |
Example: An example to illustrate the application of this use case |
Note: (Optional) notes about this use case, especially related to its coverage in this handbook |
4.1. Addressed use cases
Use Case UC1: Create a new information exchange data model |
Goal: Create a new standalone data schema that uses terms from Core Vocabularies. |
Primary Actors: Semantic Engineer, Software Engineer |
Description: The goal is to design and create a new data schema or information exchange data model that is not part of a more comprehensive semantic data specification, relying on terms from existing CVs as much as possible. |
Note: As this is a more generic use case it will be broken down into concrete use cases that focus on specific data formats. |
Use Case UC1.1: Create a new XSD schema |
Goal: Create a new standalone XSD schema that uses terms from Core Vocabularies. |
Primary Actors: Semantic Engineer, Software Engineer |
Description: The goal is to design and create a new XSD schema that is not part of a more comprehensive semantic data specification, relying on terms from existing CVs as much as possible. As an information exchange data model, an XSD Schema can be used to create and validate XML data to be exchanged between information systems. |
Example: OOTS XML schema mappings [oots] |
Note: A detailed methodology to be applied for this use case will be provided in the Create a new XSD schema |
Use Case UC1.2: Create a new JSON-LD context definition |
Goal: Create a new standalone JSON-LD context definition that uses terms from Core Vocabularies. |
Primary Actors: Semantic Engineer, Software Engineer |
Description: The goal is to design and create a new JSON-LD context definition that is not part of a more comprehensive semantic data specification, relying on terms from existing CVs as much as possible. As an information exchange data model, a JSON-LD context definition can be integrated in describing data, building APIs, and other operations involved in information exchange. |
Example: Core Person Vocabulary [cpv-json-ld], Core Business Vocabulary [cbv-json-ld] |
Note: A detailed methodology to be applied for use cases will be provided in the Create a new JSON-LD context definition section. |
Use Case UC2: Map an existing data model to a Core Vocabulary |
Goal: Create a mapping of an existing (information exchange) data model, to terms from Core Vocabularies. |
Primary Actors: Semantic Engineer |
Actors: Domain Expert, Software Engineer |
Description: The goal is to design and create a mapping of an ontology, vocabulary, or some kind of data schema or information exchange data model that is not part of a more comprehensive semantic data specification, to terms from CVs. Such a mapping can be done at a conceptual level, or formally, e.g. in the form of transformation rules, and most often will include both. |
Note: Since this is a more generic use case it will be broken down into concrete use cases that focus on specific data models and/or data formats. Some of those use cases will be described in detail below, while others will be included in the next section, which is dedicated to the unaddressed use cases. |
Use Case UC2.1: Map an existing Ontology to a Core Vocabulary |
Goal: Create a mapping between the terms of an existing ontology and the terms of Core Vocabularies. |
Primary Actors: Semantic Engineer |
Actors: Domain Expert, Business Analyst, Software Engineer |
Description: The goal is to create a formal mapping expressed in Semantic Web terminology (for example using rdfs:subClassOf, rdfs:subPropertyOf, owl:equivalentClass, owl:equivalentProperty, owl:sameAs properties), associating the terms in an existing ontology that defines relevant concepts in a given domain, to terms defined in one or more CVs. This activity is usually performed by a semantic engineer based on input received from domain experts and/or business analysts, who can assist with the creation of a conceptual mapping. The conceptual mapping associates the terms in an existing ontology, which defines relevant concepts within a specific domain, to terms defined in one or more SEMIC Core Vocabularies. The result of the formal mapping can be used later by software engineers to build information exchange systems. Example: Mapping Core Person to Schema.org [map-cp2org], Core Business to Schema.org [map-cb2org], etc. |
Note: A detailed methodology to be applied for this use case will be provided in the Map an existing Ontology section. |
Use Case UC2.2: Map an existing XSD Schema to a Core Vocabulary |
Goal: Define the data transformation rules for the mapping of an XSD schema to terms from Core Vocabularies. Create a mapping of XML data that conforms to an existing XSD schema to an RDF representation that conforms to a Core Vocabulary for formal data transformation. |
Primary Actors: Semantic Engineer |
Actors: Domain Expert, Business Analyst, Software Engineer |
Description: The goal is to create a formal mapping using Semantic Web technologies (e.g. RML or other languages), to allow automated translation of XML data conforming to a certain XSD schema, to RDF data expressed in terms defined in one or more SEMIC Core Vocabularies. This use case required definitions of an Application Profile for a Core Vocabulary because the CV alone does not specify sufficient instantiation constraints to be precisely mappable. |
Example: ISA2core SAWSDL mapping [isa2-map] |
Note: A detailed methodology to be applied for this use case will be provided in the Map an existing XSD schema section. |
4.2. Unaddressed use cases
Use Case UC3: Create a new Semantic data specification |
Goal: Create a new semantic data specification that reuses terms from Core Vocabularies. Primary Actor: Semantic Engineer Description: The goal is to design and create a semantic data specification that represents the concepts in a particular domain, while reusing terms from existing CVs as much as possible for concepts that are already covered by CVs. Creating semantic data specifications using this approach will support better interoperability. Example: The eProcurement Ontology [epo] is a domain-specific semantic data specification built by reusing terms from multiple Core Vocabularies. Note: Recommendation on how to address this use case can be found in the Clarification on “reuse” section of the SEMIC Style Guide, and therefore will not be addressed in this handbook. |
Use Case UC3.1: Create a new Core Vocabulary |
Goal: Create a new Core Vocabulary that reuses terms from other Core Vocabularies. Primary Actor: Semantic Engineer Description: The goal is to design and create a new Core Vocabulary that represents the concepts of a generic domain of high potential reusability, while reusing terms from existing CVs as much as possible for concepts that are already covered by those CVs. Example: The Core Business Vocabulary (CBV) [cbv] is built reusing terms from the Core Location Vocabulary (CLV) [clv] and Core Public Organization Vocabulary (CPOV) [cpov]. Note: Recommendation on how to address this use case can be found in the Clarification on “reuse” section of the SEMIC Style Guide, and therefore will not be addressed in this handbook. |
Use Case UC3.2: Create a new Application Profile |
Goal: Create a new Application Profile that reuses terms from other Core Vocabularies and specifies how they should be used. Primary Actor: Semantic Engineer Description: The goal is to design and create a new Application Profile that represents all the concepts and restrictions on those concepts that are relevant in a particular application domain, while reusing terms from existing CVs as much as possible. Example: The Core Public Service Vocabulary Application Profile (CPSV-AP) [cpsv-ap] is built reusing terms from the Core Location Vocabulary (CLV) [clv] and Core Public Organisation Vocabulary (CPOV) [cpov]. Note: Recommendation on how to address this use case can be found in the Clarification on “reuse” section of the SEMIC Style Guide, and therefore will not be addressed in this handbook. |
Use Case UC4: Create a new data model |
Goal: Create a new standalone data model artefact that reuses terms from Core Vocabularies. Primary Actor: Semantic Engineer Description: The goal is to design and create a new data model artefact that is not part of a more comprehensive semantic data specification, describing the concepts that are relevant in a particular domain or application context, while reusing terms from existing CVs as much as possible. Such an artefact can be of different nature both according to their interoperability layer (ranging from vocabulary and ontology, to data shape and data schema) and also according to their abstraction level (ranging from upper layer, through domain layer to application layer). Note: Since this is a more generic use case it will be broken down into more concrete use cases that focus on specific data models. See also some related use cases (UC1, UC1.1 and UC1.2) discussed in the Addressed use cases section. |
Use Case UC4.1: Create a new ontology |
Goal: Create a new standalone ontology that reuses terms from Core Vocabularies. Primary Actor: Semantic Engineer Description: The goal is to design and create a new ontology that is not part of a more comprehensive semantic data specification, describing the concepts that are relevant in a particular domain or application context, while reusing terms from existing CVs as much as possible. Example: The eProcurement Ontology (ePO) [epo] is built reusing terms from multiple CVs, including the Core Location Vocabulary (CLV) [clv], Core Public Organisation Vocabulary (CPOV) [cpov] and Core Criterion and Core Evidence Vocabulary (CCCEV) [cccev]. Note: Recommendation on how to address this use case can be found in the SEMIC Style Guide (more specifically in the Clarification on “reuse” section and the various Guidelines and conventions subsections), and therefore will not be addressed in this handbook. |
Use Case UC4.2: Create a new data shape |
Goal: Create a new standalone data shape that specifies restrictions on the use of terms from Core Vocabularies. Primary Actor: Semantic Engineer Description: The goal is to design and create a new data shape that is not part of a more comprehensive semantic data specification, describing the expected use of concepts that are relevant in a particular domain or application context, including the use of terms from existing CVs. Note: Recommendation on how to address this use case can be found in the SEMIC Style Guide (more specifically in the Clarification on “reuse” and Data shape conventions sections), and therefore will not be addressed in this handbook. |
Use Case UC2.3: Map an existing JSON Schema to a Core Vocabulary |
Goal: Define data transformation rules from an JSON schema to terms from Core Vocabularies. Create a mapping of JSON data that was created according to an existing JSON schema to an RDF representation that conforms to a Core Vocabulary for formal data transformation. Primary Actors: Semantic Engineer Actors: Domain Expert, Business Analyst, Software Engineer Description: The goal is to create a formal mapping using Semantic Web technology (e.g. RML or other languages), to allow automated translation of JSON data conforming to a certain JSON schema, to RDF data expressed in terms defined in one or more SEMIC Core Vocabularies. Such activity can be done by semantic engineers, based on input from domain experts and/or business analysts, who can assist with the creation of a conceptual mapping. The conceptual mapping is usually used as the basis for the formal mapping. The conceptual mapping can be a simple correspondence table associating the JSON data model elements defined in an JSON schema, to terms defined in one or more SEMIC Core Vocabularies. In some cases the creation of the conceptual mapping can be done by the semantic engineers themselves, or even by the software engineers building information exchange systems. |
5. How to create new data models
5.1. Create a new XSD schema
This section provides detailed instructions for addressing use case UC1.1.
To create a new XSD schema, the following steps need to be observed:
-
Import or define elements
-
Shape structure with patterns
Import or define elements
When working with XML schemas, particularly in relation to semantic artefacts like ontologies or data shapes, managing the imports and namespaces are vital considerations that ensure clarity, reusability, and proper integration of various data models.
When a core vocabulary has defined an associated XSD schema, it is not only easy but also advisable to directly import this schema using the xsd:import statement. This enables seamless reuse and guarantees that any complex types or elements defined within the core vocabulary are integrated correctly and transparently within new schemas.
The imported elements are then employed in the definition of a specific document structure. For example, Core Vocabularies are based on DCTERMS [ref], that provides an XML schema, so Core Person could import the DCTERMS XML schema for the usage of AgentType:
In cases where the Core Vocabulary does not provide an XSD schema, it is necessary to create for the reused URIs the corresponding XML element definitions in the new XSD schema. Crucially, these new elements must adhere to the namespace defined by the Core Vocabulary to maintain consistency. For example “AgentType” must be defined within the “http://data.europa.eu/m8g/” namespace of the Core Vocabularies.
Furthermore, when integrating these elements into a new schema, it is essential to reflect the constraints from the core vocabulary’s data shape-specifically, which properties are optional and which are mandatory - within the XSD schema element definitions.
Shape XML document structure
In designing XML schemas, the selection of a design pattern has implications for the reusability and extension of the schema. The Venetian Blind and Garden of Eden patterns stand out as preferable for their ability to allow complex types to be reused by different elements [sem-map].
The Venetian Blind pattern is characterised by having a single global element that serves as the entry point for the XML document, from which all the elements can be reached. This pattern implies a certain directionality and starting point, analogous to choosing a primary class in an ontology that has direct relationships to other classes, and from which one can navigate to the rest of the classes.
For instance, in the Core Business Vocabulary, if one were to select the "Legal Entity" class as the starting point, it would shape the XML schema in such a way that all other classes could be reached from this entry point, reflecting its central role within the ontology. A possible implementation with Venetian Blind with “Legal Entity” as the root element would be:
Adopting Venetian Blind pattern reduces the variability in its application and deems the schema usable in specific scenarios by providing not only well-defined elements, but also a rigid and predictable structure.
On the other hand, the Garden of Eden pattern allows for multiple global elements, providing various entry points into the XML document. This pattern accommodates ontologies where no single class is inherently central, mirroring the flexibility of graph representations in ontologies that do not have a strict hierarchical starting point.
Adopting the Garden of Eden pattern provides a less constrained approach, enabling users to represent information starting from different elements that may hold significance in different contexts. This approach has been adopted by standardisation initiatives such as NIEM and UBL, which recommend such flexibility for broader applicability and ease of information representation.
However, the Garden of Eden pattern does not lead to a schema that can be used in final application scenarios, because it does not ensure a single stable document structure but leaves the possibility for variations. This schema pattern requires an additional composition specification. For example, if it is used in a SOAP API, the developers can decide on using multiple starting points to facilitate exchange of granular messages specific per API endpoint. This way the XSD schema remains reusable for different API endpoints and even API implementations.
Overall, the choice between these patterns should be informed by the intended use of the schema, the level of abstraction of the ontology it represents, and the needs of the end-users, aiming to strike a balance between structure and flexibility.
Recommendation: We consider the Garden of Eden pattern suitable for designing XSD schemas at the level of core or domain semantic data specifications, and the Venetian Blind pattern suitable for XSD schemas at the level of specific Application Profiles.
5.2. Create a new JSON-LD context definition
This section provides detailed instructions for addressing use case UC1.2.
JSON-LD combines the simplicity, power, and web ubiquity of JSON with the concepts of Linked Data. Creating JSON-LD context definitions facilitates this synergy. This ensures that when data is shared or integrated across systems, it maintains its meaning and can be understood in the same way across different contexts. Here’s a guide on how to create new JSON-LD contexts for existing CVs, using the Core Person Vocabulary as an example.
-
Import or define elements
-
Shape structure
Import or define elements
When a CV has defined an associated JSON-LD context, it is not only easy, but also advisable to directly import this context using the @import
keyword. This enables seamless reuse and guarantees that any complex types or elements defined within the vocabulary are integrated correctly and transparently within new schemas.
"@context": {"@import": "https://json-ld.org/contexts/remote-context.jsonld", }
In cases where the CV does not provide an JSON-LD context, it is necessary to create for the reused URIs the corresponding field element definitions. To start, gather all the terms from the Core Person Vocabulary that you want to include in your JSON-LD context. Terms can include properties like given name
, family name
, date of birth
, and relationships like residency
or contact point
.
Then, decide the desired structure of the JSON-LD file, by defining the corresponding keys, for example Person.givenName
, Person.familyName
, Person.dateOfBirth
, Person.residency
, Person.contactPoint
. These new fields must adhere to the naming defined by the CV to maintain consistency.
Finally, assign URIs to keys. Each term in your JSON-LD context must be associated with a URI from an ontology that defines its meaning in a globally unambiguous way. Associate the URIs established in CVs to JSON keys using the same CV terms. For example:
"Person.contactPoint": {"@id": "http://data.europa.eu/m8g/contactPoint"}.
The ones that are imported by the CVs, shall be used as originally defined, for example from FOAF:
"Person.givenName": {"@id": "http://xmlns.com/foaf/0.1/givenName"}.
Shape structure
Start defining the structure of the context by relating class terms with property terms and then, if necessary, property terms with other classes.
Commence by creating a JSON structure that starts with a @context
field. This field will contain mappings from your vocabulary terms to their respective URIs. Continue by defining fields for Classes and subfields for their properties.
If the JSON-LD context is developed with the aim of being used directly in exchange specific to an application scenario, then aim to establish a complete tree structure that starts with a single root class. To do so, specify precise @type
references linking to the specific Class. For example:
"Person.contactPoint" : {"@id": "http://data.europa.eu/m8g/contactPoint", "@type": "ContactPoint"}.
If the aim of the developed JSON-LD context is rather ensuring semantic correspondences, without any structural constraints, which is the case for core or domain semantic data specification, then definitions of structures specific to each entity type and its properties suffice, using only loose references to other objects. For example:
"Person.contactPoint": {"@id": "http://data.europa.eu/m8g/contactPoint", "@type": "@id"}
6. How to map existing data models
In the vast and diverse landscape of knowledge representation and data modelling, sources of knowledge are articulated through various means, each adopting its distinct methodology for expression. This diversity manifests not only in the choice of technology and representation languages, but also in the vocabularies used and the specific models created. Such heterogeneity, while enriching, introduces significant challenges for semantic interoperability—the ability of different systems to understand and use the information seamlessly across various contexts.
The idea of unifying this rich spectrum of knowledge under a single model and a single representation language, though conceptually appealing, is pragmatically unfeasible and, arguably, undesirable. The diversity of knowledge sources is not merely a by-product of historical development, but a reflection of the varied domains, perspectives, and requirements these sources serve.
To navigate this complexity, a more nuanced approach is required—one that seeks to establish connections across models without imposing uniformity. This is where the concepts of ontology mapping and the broader spectrum of model alignment methodologies come into play. Moreover, the mapping endeavour encompasses not only ontological artefacts, but also various technical artefacts—ranging from data shapes defined in SHACL or ShEx, XSD schemas for XML, to JSON Schemas for JSON data. Each of these artefacts represents a different facet of knowledge modelling. Thus, mapping in this broader sense involves creating links between these semantic and technical artefacts and Core Vocabularies.
The past couple decades have witnessed extensive efforts in ontology and data mapping, resulting in a plethora of tools, methods, and technologies aimed at enhancing semantic interoperability. These endeavours underscore the vast landscape of potential strategies available for mapping. These strategies range from conceptual methodologies that explore the semantic congruence and contextual relevance of entities and relationships, to formal methodologies that operationalise these conceptual mappings as technical data transformation rules. It’s important to acknowledge that there is no one-size-fits-all method; instead, the field offers a spectrum of approaches suited to various needs and contexts.
The subsequent sections will delve into the specific methodologies of mapping —both conceptual and formal—, providing a blueprint for navigating and bridging the world of semantic and technical artefacts, empowering stakeholders to make informed decisions that best suit their interoperability needs.
6.1. Map an existing Ontology
This section provides detailed instructions for addressing use case UC2.1.
In this section we adopt the definition from this paper of the following concepts:
-
Ontology matching: the process of finding relationships or correspondences between entities of different ontologies.
-
Ontology alignment: a set of correspondences between 2 or more ontologies, as outcome of ontology matching process.
-
Ontology mapping: the oriented, or directed, version of an alignment, i.e., it maps the entities of one ontology to at most one entity of another ontology.
To create an ontology alignment, the following steps need to be observed:
-
Staging: defining the requirements
-
Characterisation: defining source and target data and performing data analysis
-
Reuse: discover, evaluate, and reuse existing alignments
-
Matching: execute and evaluate matching
-
Align and map: prepare, create the alignment, and render mappings
-
Application: make the alignment available for use in applications
This methodology has been used in mapping the Core Vocabularies to Schema.org. This work is available on the SEMIC GitHub repository dedicated to Semantic Mappings, and documented here. Next sections describe this methodology in more detail.
Staging
This initial phase involves a comprehensive understanding of the project’s scope, identifying the specific goals of the mapping exercise, and the key requirements it must fulfil. Stakeholders collaborate to articulate the purpose of the ontology or data model alignment, setting clear objectives that will guide the entire process. Defining these requirements upfront ensures that subsequent steps are aligned with the mapping exercise overarching goals, stakeholder expectations and fitting the use cases.
Inputs: Stakeholder knowledge, project goals, available resources, domain expertise.
Outputs: Mapping project specification document comprising a defined mapping project scope and comprehensive list of requirements.
Characterisation
In this stage, a thorough analysis of both source and target ontologies is conducted to ascertain their structures, vocabularies, and the semantics they encapsulate. This involves an in-depth examination of the conceptual frameworks, data representation languages, and any existing constraints within both models. Understanding the nuances of both the source and target is critical for identifying potential challenges and opportunities in the mapping process, ensuring that the alignment is both feasible and meaningful.
The following is an indicative, but not exhaustive, list of aspects to consider in this analysis: specifications documentation, representation language and representation formats, deprecation mechanism, inheritance policy (single inheritance only or multiple inheritance are also allowed), natural language(s) used, label specification, label conventions, definition specification, definition conventions, version management and release cycles, etc.
Inputs: Source and target ontologies, initial requirements, domain constraints.
Outputs: Analysis reports comprising a comparative characterisation table, identified difficulties, risks and amenability assessments, selected source and target for mapping.
Reuse
In the ontology mapping lifecycle, the reuse stage is pivotal, facilitating the integration of pre-existing alignments into the project’s workflow. Following the initial characterisation, this stage entails discovery and a rigorous evaluation of available alignments against the project’s defined requirements. These requirements are instrumental in appraising whether an existing alignment can be directly adopted, necessitates modifications for reuse, or if a new alignment should be constructed from the ground up.
Ontology alignments are often expressed in Alignment Format (AF) or EDOAL. An example statement in a Turtle file representing an ontology alignment (taken from the Core Business Vocabulary to Schema.org alignment), could look something like this:
<http://mapping.semic.eu/business/sdo/cell/1> a align:Cell; align:entity1 <http://www.w3.org/ns/locn#Address>; align:entity2 <https://schema.org/PostalAddress>; align:relation "="; align:measure "1"^^xsd:float; owl:annotatedProperty owl:equivalentClass; sssom:mapping_justification semapv:MappingReview .
The outcome of this stage splits into three distinct pathways:
-
direct reuse of alignments that are immediately applicable,
-
adaptive reuse where existing alignments provide a partial fit and serve as a basis for refinement (i.e. we can improve certain alignment statements, or add new statements in the alignment map), and
-
the initiation of a new alignment when existing resources are not suitable.
This structured approach to reuse optimises resource utilisation, promotes efficiency, and tailors the mapping process to the project’s unique objectives.
Inputs: repository of existing alignments (for the source and target ontologies), evaluation criteria based on requirements.
Outputs: Assessment report on existing alignments, decisions on reuse, adaptation, or creation of a new alignment.
Matching
This section delves into automatic and semi-automatic approaches to finding the alignment candidates. However, in cases of small vocabularies and ontologies fully manual efforts are likely more efficient.
Utilising both automated tools and manual expertise, this phase focuses on identifying potential correspondences between entities in the source and target models. The matching process may employ various methodologies, including semantic similarity measures, pattern recognition, or lexical analysis, to propose candidate alignments. These candidates are then critically evaluated for their accuracy, relevance, and completeness, ensuring they meet the predefined requirements and are logically sound. This stage is delineated into three main activities: planning, execution, and evaluation.
In the planning activity, the approach to ontology matching is meticulously strategised. The planning encompasses selecting appropriate algorithms and methods, fine-tuning parameters, determining thresholds for similarity and identity functions, and setting evaluative criteria. These preparations are informed by a thorough understanding of the project’s requirements and the outcomes of previous reuse evaluations.
Numerous well-established ontology matching algorithms have been extensively reviewed in the literature (for in-depth discussions, see paper). The main classes of ontology matching techniques are listed below in the order of their relevance to this handbook:
-
Terminological techniques draw on the textual content within ontologies, such as entity labels and comments, employing methods from natural language processing and information retrieval, including string distances and statistical text analysis.
-
Structural techniques analyse the relationships and constraints between ontology entities, using methods like graph matching to explore the topology of ontology structures.
-
Semantic techniques apply formal logic and inference to deduce the implications of proposed alignments, aiding in the expansion of alignments or detection of conflicts.
-
Extensional techniques compare entity sets, or instances, potentially involving analysis of shared resources across ontologies to establish similarity measures. Following planning, the execution activity implements the chosen matchers. Automated or semi-automated tools are deployed to carry out the matching process, resulting in a list of candidate correspondences. This list typically includes suggested links between elements of the source and target ontologies, each with an associated confidence level computed by the algorithms. EDOAL, a representation framework for expressing such correspondences, is commonly utilised to encapsulate these potential alignments.
Finally, in the evaluation activity, the candidate correspondences are rigorously assessed for their suitability. The evaluation measures the candidates against the project’s specific needs, scrutinising their accuracy, relevance, and alignment with the predefined requirements. This assessment ensures that only the most suitable correspondences are carried forward for the creation of an alignment, thereby upholding the integrity and logical soundness of the mapping process.
Tools: Tools such as Silk can be used in this stage.
Inputs: Matcher configurations, additional resources (if any), candidate correspondences from previous matching iterations.
Outputs: Generated candidate correspondences, evaluation reports, finalised list of potential alignments.
Align and Map
Following the identification of suitable matches, this step involves the formal creation of the alignment and the rendering (generation) of specific mappings between the source and target models. This phase encompasses preparation, creation, and rendering activities that solidify the relationships between ontology entities into a coherent alignment and actionable mappings. The resulting alignment is then documented, detailing the rationale, methods used, and any assumptions made during the mapping process.
The alignment process should be considered as part of the governance of a vocabulary or ontology that would include engaging communication with third parties to validate the alignment. Furthermore, the process has technical implications that should be evaluated upfront such as the machine interpretation and execution of the mapping.
Preparation involves stakeholder consensus on the Alignment Plan. This plan guides stakeholders through the systematic refinement of candidate correspondences, considering not only the relevance of the matches, but also the type of relationship between the elements. This plan might include the removal of irrelevant correspondences or strategic amendments to existing relationships. The chosen candidate correspondences are those that have been determined to be an adequate starting point for the alignment. The type of asset—be it an ontology, controlled list, or data shape—dictates the nature of the relationship that can be rendered from the alignment. The table [below] elucidates potential relationship types that can be established:
Relation / Element type | Property | Concept | Class | Individual |
---|---|---|---|---|
= |
owl:equivalentProperty; owl:sameAs |
skos:exactMatch; skos:closeMatch |
owl:equivalentClass; owl:sameAs |
owl:sameAs |
> |
skos:narrowMatch |
|||
< |
rdfs:subPropertyOf |
skos:broadMatch |
rdfs:subClassOf |
|
% |
owl:propertyDisjointWith |
owl:disjointWith |
owl:differentFrom |
|
instanceOf |
rdf:type |
skos:broadMatch; rdf:type |
rdf:type |
rdf:type |
hasInstance |
skos:narrowMatch |
This table is indicative of the variety of semantic connections that can be realised, ranging from equivalence and subclass relations to disjointness and type instantiation. This nuanced approach to the preparation stage is essential in ensuring that the eventual alignment and rendered mapping accurately represent the semantic intricacies of the relationships defined in the project scope, thereby fulfilling the project’s defined requirements.
The Creation step is the execution of the Alignment Plan, entailing the selection and refinement of correspondences. This activity involves human intervention to select and refine those correspondences for transitioning into a deliberate alignment. The selection is conducted manually according to the project’s objectives.
Rendering translates the refined alignment into a mapping—a directed version that can be interpreted and executed by software agents. This process is straightforward, producing a machine-executable artefact. Most often this is a simple export of the alignment statements from the editing tool or the materialisation of the alignment in a triple store. Multiple renderings may be created from the same alignment, accommodating the need for various formalisms.
Tools: Tools such as VocBench3 can be used in this stage, but more generic ones, such as MS Excel or Google Sheets spreadsheets, can be used as well.
Inputs: Evaluated candidate correspondences, stakeholders' amendment plans, requirements for the formalism of the mapping.
Outputs: Created alignment and mapping, alignment amendment strategy, stored versions in an alignment repository.
Application
The final stage focuses on operationalising the created alignment, ensuring it is accessible and usable by applications that require semantic interoperability between the mapped models. This involves publishing the alignment in a standardised, machine-readable format and integrating it within ontology management or data integration tools.
Additionally, mechanisms for maintaining, updating, and governing the alignment are established, facilitating its long-term utility and relevance.
Moreover, this stage involves the creation of maintenance protocols to preserve the alignment’s relevance over time. This includes procedures for regular updates in response to changes in ontology structures or evolving requirements, as well as governance mechanisms to oversee these adaptations. As the mapping is applied, new insights may emerge, prompting discussions within the stakeholder community about potential refinements or the development of a new iteration of the mapping. The dynamic nature of data sources means that the application stage is both an endpoint, as well as a starting point for continuous improvement. Some processes may be automated to enhance efficiency, such as the monitoring of ontologies for changes that would necessitate updates to the mapping.
Inputs: Finalised mappings, application context, feedback mechanisms.
Outputs: Applied mappings in use, insights from application, triggers for potential updates, governance actions for lifecycle management.
6.2. Map an existing XSD schema
This section provides detailed instructions for addressing use case UC2.2
In order to create an XSD mapping one needs to decide on the purpose and level of specificity of the XSD schema mapping. It can range from producing a lightweight alignment at the level of vocabulary down to a fully fledged executable set of rules for data transformation.
In this section we describe a methodology that covers both the conceptual mapping and the technical mapping for data transformation.
Figure 3 depicts a workflow, to create an XSD schema mapping, segmented into four distinct phases:
-
Create a Conceptual Mapping, so that business and domain experts can validate the correspondences;
-
Create a Technical Mapping, so that the data can be automatically transformed;
-
Validate the mapping rules to ensure consistency and accuracy;
-
Disseminate the mapping rules to be applied in the foreseen use cases.
Figure 3
Before initiating the mapping development process, it is crucial to construct a representative test dataset. This dataset should consist of a carefully selected set of XML files that cover the important scenarios and use cases encountered in the production data. It should be comprehensive yet sufficiently compact to facilitate rapid transformation cycles, enabling effective testing iterations.
Conceptual Mapping development
Conceptual Mapping in semantic data integration can be established at two distinct levels: the vocabulary level and the application profile level. These levels differ primarily in their complexity and specificity regarding the data context they address.
Vocabulary Level mapping is established using basic XML elements. This form of mapping aims for a terminological alignment, meaning that an XML element or attribute is directly mapped to an ontology class or property. For example, an XML element <PostalAddress>
could be mapped to locn:Address
class, or an element <surname>
could be mapped to a property foaf:familyName
in the FOAF ontology. Such mapping can be established as a simple spreadsheet. This approach results in a simplistic and direct alignment, which lacks contextual depth and specificity. For this reason the next steps of this methodology cannot be continued.
A more advanced approach would be to embed semantic annotations into XSD schemas using standards such as SAWSDL. Such an approach is appropriate in the context of WSDL services.
Application Profile Level of conceptual mapping utilises XPath to guide access to data in XML structures, enabling precise extraction and contextualization of data before mapping it to specific ontology fragments. An ontology fragment is usually expressed as a SPARQL Property Path (or simply Property Path). This Property Path facilitates the description of instantiation patterns specific to the Application Profile. This advanced approach allows for context-sensitive semantic representations, crucial for accurately reflecting the nuances in interpreting the meaning of data structures.
The tables below show two examples of mapping the organisation’s address, city and postal code. They show where the data can be extracted from, and how it can be mapped to targeted ontology properties such as locn:postName
, and locn:postCode
. To ensure that this address is not mapped in a vacuum, but it is linked to an organisation instance, and not a person for example, the mapping is anchored in an instance ?this
of an owl:Organisation
. Optionally a class path can be provided to complement the property path and explicitly state the class sequence, which otherwise can be deduced from the Application Profile definition.
Source XPath | */efac:Company/cac:PostalAddress/cbc:PostalZone |
---|---|
Target Property Path |
?this cv:registeredAddress /locn:postCode ?value . |
Target Class Path |
org:Organization / locn:Address / rdf:PlainLiteral |
Source XPath | */efac:Company/cac:PostalAddress/cbc:CityName |
---|---|
Target Property Path |
?this cv:registeredAddress / locn:postName ?value . |
Target Class Path |
org:Organization / locn:Address / rdf:PlainLiteral |
Inputs: XSD Schemas, Ontologies, SHACL Data Shapes, Source and Target Documentation, Sample XML data
Outputs: Conceptual Mapping Spreadsheet
Technical Mapping development
The technical mapping step is a critical phase in the mapping process, serving as the bridge between conceptual design and practical, machine-executable implementation. This step takes as input the conceptual mapping, which has been crafted and validated by domain experts or data-savvy business stakeholders and establishes correspondences between XPath expressions and ontology fragments.
When it comes to representing these mappings technically, several technology options are available (paper): such as XSLT, RML, SPARQLAnything, etc. But RDF Mapping Language (RML) stands out for its effectiveness and straightforward approach. RML allows for the representation of mappings from heterogeneous data formats like XML, JSON, relational databases and CSV into RDF, supporting the creation of semantically enriched data models. This code can be expressed in Turtle RML or the YARRRML dialect, a user-friendly text-based format based on YAML, making the mappings accessible to both machines and humans. RML is well-supported by robust implementations such as RMLMapper and RMLStreamer, which provide robust platforms for executing these mappings. RMLMapper is adept at handling batch processing of data, transforming large datasets efficiently. On the other hand, RMLStreamer excels in streaming data scenarios, where data needs to be processed in real-time, providing flexibility and scalability in dynamic environments.
The development of the mapping rules is straightforward due to the preliminary conceptual mapping that is already available. The Conceptual Mapping (CM) aided the understanding to which class and property each XML element be mapped and how. Then, RML mapping statements are created for each class of the target ontology coupled with the property-object mapping statements specific to that class. Furthermore, it is essential to master RML along with XML technologies like XSD, XPath, and XQuery to implement the mappings effectively (rml-gen).
An additional step involves deciding on a URI creation policy and designing a uniform scheme for use in the generated data, ensuring consistency and coherence in the data output.
A viable alternative to RML is XSLT technology, which offers a powerful, but low-level method for defining technical mappings. While this method allows for high expressiveness and complex transformations, it also increases the potential for errors due to its intricate syntax and operational complexity. This technology excels in scenarios requiring detailed manipulation and parameterization of XML documents, surpassing the capabilities of RML in terms of flexibility and depth of transformation rules that can be implemented. However, the detailed control it affords means that developers must have a high level of expertise in semantic technologies and exercise caution and precision to avoid common pitfalls associated with its use.
A pertinent example of XSLT’s application is the tool for transforming ISO-19139 metadata to the DCAT-AP geospatial profile (GeoDCAT-AP) in the framework of INSPIRE and the EU ISA Programme. This XSLT script is configurable to accommodate transformation with various operational parameters such as the selection between core or extended GeoDCAT-AP profiles and specific spatial reference systems for geometry encoding, showcasing its utility in precise and tailored data manipulation tasks.
Inputs: Conceptual Mapping spreadsheet, sample XML data
Outputs: Technical Mapping source code, sample data transformed into RDF
Validation
After transforming the sample XML data into RDF, two primary methods of validation are employed to ensure the integrity and accuracy of the data transformation: SPARQL-based validation and SHACL-based validation. They offer two fundamental methodologies for ensuring data integrity and conformity within semantic technologies, each serving distinct but complementary functions.
The SPARQL-based validation method utilises SPARQL ASK queries, which are derived from the SPARQL Property Path expressions (and complementary Class paths) outlined in the conceptual mapping. These expressions serve as assertions that test specific conditions or patterns within the RDF data corresponding to each conceptual mapping rule. By executing these queries, it is possible to confirm whether certain data elements and relationships have been correctly instantiated according to the mapping rules. The ASK queries return a boolean value indicating whether the RDF data meets the conditions specified in the query, thus providing a straightforward mechanism for validation. This confirms that the conceptual mapping is implemented correctly in a technical mapping rule.
For example, for the mapping rules above the following assertions can be derived:
ASK { ?this a org:Organization . ?this cv:registeredAddress / locn:postName ?value . } ASK { ?this a org:Organization . ?this cv:registeredAddress / locn:postCode ?value . }
The SHACL-based validation method provides a more comprehensive framework for validating RDF data. In this approach, data shapes are defined according to the constraints and structures expected in the RDF output, as specified by the mapped Application Profile. These shapes act as templates that the RDF data must conform to, covering various aspects such as data types, relationships, cardinality, and more. A SHACL validation engine processes the RDF data against these shapes, identifying any deviations or errors that indicate non-conformity with the expected data model.
SHACL is an ideal choice for ensuring adherence to broad data standards and interoperability requirements. This form of validation is independent of the manner in which data mappings are constructed, focusing instead on whether the data conforms to established semantic models at the end-state. It provides a high-level assurance that data structures and content meet the specifications designed to facilitate seamless data integration and interactions across various systems.
Conversely, SPARQL-based validation is tightly linked to the mapping process itself, offering a granular, rule-by-rule validation that ensures each data transformation aligns precisely with the expert-validated mappings. It is particularly effective in confirming the accuracy of complex mappings and ensuring that the implemented data transformations faithfully reflect the intended semantic interpretations, thus providing a comprehensive check on the fidelity of the mapping process.
Inputs: Sample data transformed into RDF, Conceptual Mapping, SHACL data shapes
Outputs: Validation reports
Dissemination
Once the conceptual and technical mappings have been completed and validated, they can be packaged for dissemination and deployment. The purpose of disseminating mapping packages is to facilitate their controlled use for data transformation, ensure the ability to trace the evolution of mapping rules, and standardise the exchange of such rules. This structured approach allows for efficient and reliable data transformation processes across different systems.
A comprehensive mapping package typically includes:
-
Conceptual Mapping Files: Serves as the core documentation, outlining the rationale and structure behind the mappings to ensure transparency and ease of understanding.
-
Technical Mapping Files: This contains all the mapping code files (XSLT[ref], RML[ref], SPARQLAnything[ref], etc. depending on the chosen mapping technology) for data transformation, allowing for the practical application of the conceptual designs.
-
Additional Mapping Resources: Such as controlled lists, value mappings, or correspondence tables, which are crucial for the correct interpretation and application of the RML code. These are stored in a dedicated resources subfolder.
-
Test Data Sets: Carefully selected and representative XML files that cover various scenarios and cases. These test datasets are crucial for ensuring that the mappings perform as expected across a range of real-world data.
-
Factory Acceptance Testing (FAT) Reports: Document the testing outcomes based on the SPARQL and SHACL validations to guarantee that the mappings meet the expected standards before deployment. The generation of these reports should be supported by automation, as manual generation would involve too much effort and costs.
-
Tests Used for FAT Reports: The package also incorporates the actual SPARQL assertions and SHACL shapes used in generating the FAT reports, providing a complete view of the validation process.
-
Descriptive Metadata: Contains essential data about the mapping package, such as identification, title, description, and versions of the mapping, ontology, and source schemas. This metadata aids in the management and application of the package. This package is designed to be self-contained, ensuring that it can be immediately integrated and operational within various data transformation pipelines. The included components support not only the application, but also the governance of the mappings, ensuring they are maintained and utilised correctly in diverse IT environments. This systematic packaging addresses critical needs for usability, maintainability, and standardisation, which are essential for widespread adoption and operational success in data transformation initiatives.
Inputs: Conceptual Mapping spreadsheet, Ontologies, SHACL Data Shapes, Sample XML data, Sample data transformed into RDF, Conceptual Mapping, SHACL data shapes, Validation reports
Outputs: Comprehensive Mapping Package
7. Glossary
7.1. Application Profile
Alternative names: AP, context-specific semantic data specification
Definition: Semantic data specification aimed to facilitate the data exchange in a well-defined application context.
Additional info: It re-uses concepts from one or more semantic data specifications, while adding more specificity, by identifying mandatory, recommended, and optional elements, addressing particular application needs, and providing recommendations for controlled vocabularies to be used.
Source/Reference: SEMIC Style Guide
7.2. Conceptual model
Alternative names: conceptual model specification
Definition: An abstract representation of a system that comprises well-defined concepts, their qualities or attributes, and their relationships to other concepts.
Additional info: A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole.
Source/Reference: SEMIC Style Guide
7.3. Core Vocabulary
Alternative names: CV
Definition: A basic, reusable and extensible semantic data specification that captures the fundamental characteristics of an entity in a context-neutral fashion.
Additional info: Its main objective is to provide terms to be reused in the broadest possible context.
Source/Reference: SEMIC Style Guide
7.4. Data model
Definition: A structured representation of data elements and relationships used to facilitate semantic interoperability within and across domains.
Additional info: Data models represent common languages to facilitate semantic interoperability in a data space, including ontologies, data models, schema specifications, mappings and API specifications that can be used to annotate and describe data sets and data services. They are often domain-specific.
Source/Reference: Data Spaces Blueprint
7.5. Information exchange data model
Alternative names: data schema
Definition: Information exchange data model is a technology-specific framework for data exchange, detailing the syntax, structure, data types, and constraints necessary for effective data communication between systems. It serves as a practical blueprint for implementing an application profile in specific data exchange contexts.
Additional info: An ontology and an exchange data model serve distinct yet complementary roles across different abstraction levels within data management systems. While a Data Schema specifies the technical structure for storing and exchanging data, primarily concerned with the syntactical and structural aspects of data, it is typically articulated using metamodel standards such as JSON Schema and XML Schema.
In contrast, ontologies and data shapes operate at a higher conceptual level, outlining the knowledge and relational dynamics within a particular domain without delving into the specifics of data storage or structural implementations. Although a Data Schema can embody certain elements of an ontology or application profile—particularly attributes related to data structure and cardinalities necessary for data exchange—it does not encapsulate the complete semantics of the domain as expressed in an ontology.
Thus, while exchange data models are essential for the technical realisation of data storage and exchange, they do not replace the broader, semantic understanding provided by ontologies. The interplay between these layers ensures that data schemas contribute to a holistic data management strategy by providing the necessary structure and constraints for data exchange, while ontologies offer the overarching semantic framework that guides the meaningful interpretation and utilisation of data across systems. Together, they facilitate a structured yet semantically rich data ecosystem conducive to advanced data interoperability and effective communication.
Source/Reference: Data Spaces Blueprint
7.6. Data specification artefact
Alternative names: specification artefact, artefact
Definition: A materialisation of a semantic data specification in a concrete representation that is appropriate for addressing one or more concerns (e.g. use cases, requirements).
Source/Reference: SEMIC Style Guide
7.7. Data specification document
Alternative names: specification document
Definition: The human-readable representation of an ontology, a data shape, or a combination of both.
Additional info: A semantic data specification document is created with the objective of making it simple for the end-user to understand (a) how a model encodes knowledge of a particular domain, and (b) how this model can be technically adopted and used for a purpose. It is to serve as technical documentation for anyone interested in using (e.g. adopting or extending) a semantic data specification.
Source/Reference: SEMIC Style Guide
7.8. Data shape specification
Alternative names: data shape constraint specification, data shape constraint, data shape
Definition: A set of conditions on top of an ontology, limiting how the ontology can be instantiated.
Additional info: The conditions and constraints that apply to a given ontology are provided as shapes and other constructs expressed in the form of an RDF graph. We assume that the data shapes are expressed in SHACL language.
Source/Reference: SEMIC Style Guide
7.9. Ontology
Alternative names: ontology specification
Definition: A formal specification describing the concepts and relationships that can formally exist for an agent or a community of agents (e.g. domain experts)
Additional info: It encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains of discourse.
Source/Reference: SEMIC Style Guide
7.10. Semantic data specification
Alternative names: data specification
Definition: An union of machine- and human-readable artefacts addressing clearly defined concerns, interoperability scope and use-cases.
Additional info: A semantic data specification comprises at least an ontology and a data shape (or either of them individually) accompanied by a human-readable data specification document.
Source/Reference: SEMIC Style Guide
7.11. Vocabulary
Definition: An established list of preferred terms that signify concepts or relationships within a domain of discourse. All terms must have an unambiguous and non-redundant definition. Optionally it may include synonyms, notes and their translation to multiple languages.
7.12. Upper Ontology
Definition: An upper ontology is a highly generalised ontology that includes very abstract concepts applicable across all domains, such as "object," "property," and "relation." Its primary role is to facilitate broad semantic interoperability among numerous domain-specific ontologies by offering a standardised foundational framework. This framework assists in harmonising diverse domain ontologies, allowing for consistent data interpretation and efficient information exchange.