How to map existing data models

In the vast and diverse landscape of knowledge representation and data modelling, sources of knowledge are articulated through various means, each adopting its distinct methodology for expression. This diversity manifests not only in the choice of technology and representation languages, but also in the vocabularies used and the specific models created. Such heterogeneity, while enriching, introduces significant challenges for semantic interoperability—the ability of different systems to understand and use the information seamlessly across various contexts.

The idea of unifying this rich spectrum of knowledge under a single model and a single representation language, though conceptually appealing, is pragmatically unfeasible and, arguably, undesirable. The diversity of knowledge sources is not merely a by-product of historical development, but a reflection of the varied domains, perspectives, and requirements these sources serve.

To navigate this complexity, a more nuanced approach is required—one that seeks to establish connections across models without imposing uniformity. This is where the concepts of ontology mapping and the broader spectrum of model alignment methodologies come into play. Moreover, the mapping endeavour encompasses not only ontological artefacts, but also various technical artefacts—ranging from data shapes defined in SHACL or ShEx, XSD schemas for XML, to JSON Schemas for JSON data. Each of these artefacts represents a different facet of knowledge modelling. Thus, mapping in this broader sense involves creating links between these semantic and technical artefacts and Core Vocabularies.

The past couple decades have witnessed extensive efforts in ontology and data mapping, resulting in a plethora of tools, methods, and technologies aimed at enhancing semantic interoperability. These endeavours underscore the vast landscape of potential strategies available for mapping. These strategies range from conceptual methodologies that explore the semantic congruence and contextual relevance of entities and relationships, to formal methodologies that operationalise these conceptual mappings as technical data transformation rules. It’s important to acknowledge that there is no one-size-fits-all method; instead, the field offers a spectrum of approaches suited to various needs and contexts.

The subsequent sections will delve into the specific methodologies of mapping —both conceptual and formal—, providing a blueprint for navigating and bridging the world of semantic and technical artefacts, empowering stakeholders to make informed decisions that best suit their interoperability needs.

Map an existing Ontology

This section provides detailed instructions for addressing use case UC2.1.

In this section we adopt the definition from this paper of the following concepts:

  • Ontology matching: the process of finding relationships or correspondences between entities of different ontologies.

  • Ontology alignment: a set of correspondences between 2 or more ontologies, as outcome of ontology matching process.

  • Ontology mapping: the oriented, or directed, version of an alignment, i.e., it maps the entities of one ontology to at most one entity of another ontology.

To create an ontology alignment, the following steps need to be observed:

  1. Staging: defining the requirements

  2. Characterisation: defining source and target data and performing data analysis

  3. Reuse: discover, evaluate, and reuse existing alignments

  4. Matching: execute and evaluate matching

  5. Align and map: prepare, create the alignment, and render mappings

  6. Application: make the alignment available for use in applications

This methodology has been used in mapping the Core Vocabularies to Schema.org. This work is available on the SEMIC GitHub repository dedicated to Semantic Mappings, and documented here. Next sections describe this methodology in more detail.

Staging

This initial phase involves a comprehensive understanding of the project’s scope, identifying the specific goals of the mapping exercise, and the key requirements it must fulfil. Stakeholders collaborate to articulate the purpose of the ontology or data model alignment, setting clear objectives that will guide the entire process. Defining these requirements upfront ensures that subsequent steps are aligned with the mapping exercise overarching goals, stakeholder expectations and fitting the use cases.

Inputs: Stakeholder knowledge, project goals, available resources, domain expertise.

Outputs: Mapping project specification document comprising a defined mapping project scope and comprehensive list of requirements.

Characterisation

In this stage, a thorough analysis of both source and target ontologies is conducted to ascertain their structures, vocabularies, and the semantics they encapsulate. This involves an in-depth examination of the conceptual frameworks, data representation languages, and any existing constraints within both models. Understanding the nuances of both the source and target is critical for identifying potential challenges and opportunities in the mapping process, ensuring that the alignment is both feasible and meaningful.

The following is an indicative, but not exhaustive, list of aspects to consider in this analysis: specifications documentation, representation language and representation formats, deprecation mechanism, inheritance policy (single inheritance only or multiple inheritance are also allowed), natural language(s) used, label specification, label conventions, definition specification, definition conventions, version management and release cycles, etc.

Inputs: Source and target ontologies, initial requirements, domain constraints.

Outputs: Analysis reports comprising a comparative characterisation table, identified difficulties, risks and amenability assessments, selected source and target for mapping.

Reuse

In the ontology mapping lifecycle, the reuse stage is pivotal, facilitating the integration of pre-existing alignments into the project’s workflow. Following the initial characterisation, this stage entails discovery and a rigorous evaluation of available alignments against the project’s defined requirements. These requirements are instrumental in appraising whether an existing alignment can be directly adopted, necessitates modifications for reuse, or if a new alignment should be constructed from the ground up.

Ontology alignments are often expressed in Alignment Format (AF) or EDOAL. An example statement in a Turtle file representing an ontology alignment (taken from the Core Business Vocabulary to Schema.org alignment), could look something like this:

<http://mapping.semic.eu/business/sdo/cell/1> a align:Cell;
  align:entity1 <http://www.w3.org/ns/locn#Address>;
  align:entity2 <https://schema.org/PostalAddress>;
  align:relation "=";
  align:measure "1"^^xsd:float;
  owl:annotatedProperty owl:equivalentClass;
  sssom:mapping_justification semapv:MappingReview .

The outcome of this stage splits into three distinct pathways:

  • direct reuse of alignments that are immediately applicable,

  • adaptive reuse where existing alignments provide a partial fit and serve as a basis for refinement (i.e. we can improve certain alignment statements, or add new statements in the alignment map), and

  • the initiation of a new alignment when existing resources are not suitable.

This structured approach to reuse optimises resource utilisation, promotes efficiency, and tailors the mapping process to the project’s unique objectives.

Inputs: repository of existing alignments (for the source and target ontologies), evaluation criteria based on requirements.

Outputs: Assessment report on existing alignments, decisions on reuse, adaptation, or creation of a new alignment.

Matching

This section delves into automatic and semi-automatic approaches to finding the alignment candidates. However, in cases of small vocabularies and ontologies fully manual efforts are likely more efficient.

Utilising both automated tools and manual expertise, this phase focuses on identifying potential correspondences between entities in the source and target models. The matching process may employ various methodologies, including semantic similarity measures, pattern recognition, or lexical analysis, to propose candidate alignments. These candidates are then critically evaluated for their accuracy, relevance, and completeness, ensuring they meet the predefined requirements and are logically sound. This stage is delineated into three main activities: planning, execution, and evaluation.

In the planning activity, the approach to ontology matching is meticulously strategised. The planning encompasses selecting appropriate algorithms and methods, fine-tuning parameters, determining thresholds for similarity and identity functions, and setting evaluative criteria. These preparations are informed by a thorough understanding of the project’s requirements and the outcomes of previous reuse evaluations.

Numerous well-established ontology matching algorithms have been extensively reviewed in the literature (for in-depth discussions, see paper). The main classes of ontology matching techniques are listed below in the order of their relevance to this handbook:

  • Terminological techniques draw on the textual content within ontologies, such as entity labels and comments, employing methods from natural language processing and information retrieval, including string distances and statistical text analysis.

  • Structural techniques analyse the relationships and constraints between ontology entities, using methods like graph matching to explore the topology of ontology structures.

  • Semantic techniques apply formal logic and inference to deduce the implications of proposed alignments, aiding in the expansion of alignments or detection of conflicts.

  • Extensional techniques compare entity sets, or instances, potentially involving analysis of shared resources across ontologies to establish similarity measures. Following planning, the execution activity implements the chosen matchers. Automated or semi-automated tools are deployed to carry out the matching process, resulting in a list of candidate correspondences. This list typically includes suggested links between elements of the source and target ontologies, each with an associated confidence level computed by the algorithms. EDOAL, a representation framework for expressing such correspondences, is commonly utilised to encapsulate these potential alignments.

Finally, in the evaluation activity, the candidate correspondences are rigorously assessed for their suitability. The evaluation measures the candidates against the project’s specific needs, scrutinising their accuracy, relevance, and alignment with the predefined requirements. This assessment ensures that only the most suitable correspondences are carried forward for the creation of an alignment, thereby upholding the integrity and logical soundness of the mapping process.

Tools: Tools such as Silk can be used in this stage.

Inputs: Matcher configurations, additional resources (if any), candidate correspondences from previous matching iterations.

Outputs: Generated candidate correspondences, evaluation reports, finalised list of potential alignments.

Align and Map

Following the identification of suitable matches, this step involves the formal creation of the alignment and the rendering (generation) of specific mappings between the source and target models. This phase encompasses preparation, creation, and rendering activities that solidify the relationships between ontology entities into a coherent alignment and actionable mappings. The resulting alignment is then documented, detailing the rationale, methods used, and any assumptions made during the mapping process.

The alignment process should be considered as part of the governance of a vocabulary or ontology that would include engaging communication with third parties to validate the alignment. Furthermore, the process has technical implications that should be evaluated upfront such as the machine interpretation and execution of the mapping.

Preparation involves stakeholder consensus on the Alignment Plan. This plan guides stakeholders through the systematic refinement of candidate correspondences, considering not only the relevance of the matches, but also the type of relationship between the elements. This plan might include the removal of irrelevant correspondences or strategic amendments to existing relationships. The chosen candidate correspondences are those that have been determined to be an adequate starting point for the alignment. The type of asset—be it an ontology, controlled list, or data shape—dictates the nature of the relationship that can be rendered from the alignment. The table [below] elucidates potential relationship types that can be established:

Relation / Element type Property Concept Class Individual

=

owl:equivalentProperty; owl:sameAs

skos:exactMatch; skos:closeMatch

owl:equivalentClass; owl:sameAs

owl:sameAs

>

skos:narrowMatch

<

rdfs:subPropertyOf

skos:broadMatch

rdfs:subClassOf

%

owl:propertyDisjointWith

owl:disjointWith

owl:differentFrom

instanceOf

rdf:type

skos:broadMatch; rdf:type

rdf:type

rdf:type

hasInstance

skos:narrowMatch

This table is indicative of the variety of semantic connections that can be realised, ranging from equivalence and subclass relations to disjointness and type instantiation. This nuanced approach to the preparation stage is essential in ensuring that the eventual alignment and rendered mapping accurately represent the semantic intricacies of the relationships defined in the project scope, thereby fulfilling the project’s defined requirements.

The Creation step is the execution of the Alignment Plan, entailing the selection and refinement of correspondences. This activity involves human intervention to select and refine those correspondences for transitioning into a deliberate alignment. The selection is conducted manually according to the project’s objectives.

Rendering translates the refined alignment into a mapping—a directed version that can be interpreted and executed by software agents. This process is straightforward, producing a machine-executable artefact. Most often this is a simple export of the alignment statements from the editing tool or the materialisation of the alignment in a triple store. Multiple renderings may be created from the same alignment, accommodating the need for various formalisms.
Tools: Tools such as VocBench3 can be used in this stage, but more generic ones, such as MS Excel or Google Sheets spreadsheets, can be used as well.
Inputs: Evaluated candidate correspondences, stakeholders' amendment plans, requirements for the formalism of the mapping.
Outputs: Created alignment and mapping, alignment amendment strategy, stored versions in an alignment repository.

Application

The final stage focuses on operationalising the created alignment, ensuring it is accessible and usable by applications that require semantic interoperability between the mapped models. This involves publishing the alignment in a standardised, machine-readable format and integrating it within ontology management or data integration tools.

Additionally, mechanisms for maintaining, updating, and governing the alignment are established, facilitating its long-term utility and relevance.

Moreover, this stage involves the creation of maintenance protocols to preserve the alignment’s relevance over time. This includes procedures for regular updates in response to changes in ontology structures or evolving requirements, as well as governance mechanisms to oversee these adaptations. As the mapping is applied, new insights may emerge, prompting discussions within the stakeholder community about potential refinements or the development of a new iteration of the mapping. The dynamic nature of data sources means that the application stage is both an endpoint, as well as a starting point for continuous improvement. Some processes may be automated to enhance efficiency, such as the monitoring of ontologies for changes that would necessitate updates to the mapping.

Inputs: Finalised mappings, application context, feedback mechanisms.

Outputs: Applied mappings in use, insights from application, triggers for potential updates, governance actions for lifecycle management.

Map an existing XSD schema

This section provides detailed instructions for addressing use case UC2.2

In order to create an XSD mapping one needs to decide on the purpose and level of specificity of the XSD schema mapping. It can range from producing a lightweight alignment at the level of vocabulary down to a fully fledged executable set of rules for data transformation.

In this section we describe a methodology that covers both the conceptual mapping and the technical mapping for data transformation.

Figure 3 depicts a workflow, to create an XSD schema mapping, segmented into four distinct phases:

  1. Create a Conceptual Mapping, so that business and domain experts can validate the correspondences;

  2. Create a Technical Mapping, so that the data can be automatically transformed;

  3. Validate the mapping rules to ensure consistency and accuracy;

  4. Disseminate the mapping rules to be applied in the foreseen use cases.

QaTocAAAAASUVORK5CYII=

Figure 3

Before initiating the mapping development process, it is crucial to construct a representative test dataset. This dataset should consist of a carefully selected set of XML files that cover the important scenarios and use cases encountered in the production data. It should be comprehensive yet sufficiently compact to facilitate rapid transformation cycles, enabling effective testing iterations.

Conceptual Mapping development

Conceptual Mapping in semantic data integration can be established at two distinct levels: the vocabulary level and the application profile level. These levels differ primarily in their complexity and specificity regarding the data context they address.

Vocabulary Level mapping is established using basic XML elements. This form of mapping aims for a terminological alignment, meaning that an XML element or attribute is directly mapped to an ontology class or property. For example, an XML element <PostalAddress> could be mapped to locn:Address class, or an element <surname> could be mapped to a property foaf:familyName in the FOAF ontology. Such mapping can be established as a simple spreadsheet. This approach results in a simplistic and direct alignment, which lacks contextual depth and specificity. For this reason the next steps of this methodology cannot be continued.

A more advanced approach would be to embed semantic annotations into XSD schemas using standards such as SAWSDL. Such an approach is appropriate in the context of WSDL services.

Application Profile Level of conceptual mapping utilises XPath to guide access to data in XML structures, enabling precise extraction and contextualization of data before mapping it to specific ontology fragments. An ontology fragment is usually expressed as a SPARQL Property Path (or simply Property Path). This Property Path facilitates the description of instantiation patterns specific to the Application Profile. This advanced approach allows for context-sensitive semantic representations, crucial for accurately reflecting the nuances in interpreting the meaning of data structures.

The tables below show two examples of mapping the organisation’s address, city and postal code. They show where the data can be extracted from, and how it can be mapped to targeted ontology properties such as locn:postName, and locn:postCode. To ensure that this address is not mapped in a vacuum, but it is linked to an organisation instance, and not a person for example, the mapping is anchored in an instance ?this of an owl:Organisation. Optionally a class path can be provided to complement the property path and explicitly state the class sequence, which otherwise can be deduced from the Application Profile definition.

Source XPath */efac:Company/cac:PostalAddress/cbc:PostalZone

Target Property Path

?this cv:registeredAddress /locn:postCode ?value .

Target Class Path

org:Organization / locn:Address / rdf:PlainLiteral

Source XPath */efac:Company/cac:PostalAddress/cbc:CityName

Target Property Path

?this cv:registeredAddress / locn:postName ?value .

Target Class Path

org:Organization / locn:Address / rdf:PlainLiteral

Inputs: XSD Schemas, Ontologies, SHACL Data Shapes, Source and Target Documentation, Sample XML data

Outputs: Conceptual Mapping Spreadsheet

Technical Mapping development

The technical mapping step is a critical phase in the mapping process, serving as the bridge between conceptual design and practical, machine-executable implementation. This step takes as input the conceptual mapping, which has been crafted and validated by domain experts or data-savvy business stakeholders and establishes correspondences between XPath expressions and ontology fragments.

When it comes to representing these mappings technically, several technology options are available (paper): such as XSLT, RML, SPARQLAnything, etc. But RDF Mapping Language (RML) stands out for its effectiveness and straightforward approach. RML allows for the representation of mappings from heterogeneous data formats like XML, JSON, relational databases and CSV into RDF, supporting the creation of semantically enriched data models. This code can be expressed in Turtle RML or the YARRRML dialect, a user-friendly text-based format based on YAML, making the mappings accessible to both machines and humans. RML is well-supported by robust implementations such as RMLMapper and RMLStreamer, which provide robust platforms for executing these mappings. RMLMapper is adept at handling batch processing of data, transforming large datasets efficiently. On the other hand, RMLStreamer excels in streaming data scenarios, where data needs to be processed in real-time, providing flexibility and scalability in dynamic environments.

The development of the mapping rules is straightforward due to the preliminary conceptual mapping that is already available. The Conceptual Mapping (CM) aided the understanding to which class and property each XML element be mapped and how. Then, RML mapping statements are created for each class of the target ontology coupled with the property-object mapping statements specific to that class. Furthermore, it is essential to master RML along with XML technologies like XSD, XPath, and XQuery to implement the mappings effectively (rml-gen).

An additional step involves deciding on a URI creation policy and designing a uniform scheme for use in the generated data, ensuring consistency and coherence in the data output.

A viable alternative to RML is XSLT technology, which offers a powerful, but low-level method for defining technical mappings. While this method allows for high expressiveness and complex transformations, it also increases the potential for errors due to its intricate syntax and operational complexity. This technology excels in scenarios requiring detailed manipulation and parameterization of XML documents, surpassing the capabilities of RML in terms of flexibility and depth of transformation rules that can be implemented. However, the detailed control it affords means that developers must have a high level of expertise in semantic technologies and exercise caution and precision to avoid common pitfalls associated with its use.

A pertinent example of XSLT’s application is the tool for transforming ISO-19139 metadata to the DCAT-AP geospatial profile (GeoDCAT-AP) in the framework of INSPIRE and the EU ISA Programme. This XSLT script is configurable to accommodate transformation with various operational parameters such as the selection between core or extended GeoDCAT-AP profiles and specific spatial reference systems for geometry encoding, showcasing its utility in precise and tailored data manipulation tasks.

Inputs: Conceptual Mapping spreadsheet, sample XML data

Outputs: Technical Mapping source code, sample data transformed into RDF

Validation

After transforming the sample XML data into RDF, two primary methods of validation are employed to ensure the integrity and accuracy of the data transformation: SPARQL-based validation and SHACL-based validation. They offer two fundamental methodologies for ensuring data integrity and conformity within semantic technologies, each serving distinct but complementary functions.

The SPARQL-based validation method utilises SPARQL ASK queries, which are derived from the SPARQL Property Path expressions (and complementary Class paths) outlined in the conceptual mapping. These expressions serve as assertions that test specific conditions or patterns within the RDF data corresponding to each conceptual mapping rule. By executing these queries, it is possible to confirm whether certain data elements and relationships have been correctly instantiated according to the mapping rules. The ASK queries return a boolean value indicating whether the RDF data meets the conditions specified in the query, thus providing a straightforward mechanism for validation. This confirms that the conceptual mapping is implemented correctly in a technical mapping rule.

For example, for the mapping rules above the following assertions can be derived:

ASK {
 ?this a org:Organization .
 ?this cv:registeredAddress / locn:postName ?value .
}

ASK {
  ?this a org:Organization .
  ?this cv:registeredAddress / locn:postCode ?value .
}

The SHACL-based validation method provides a more comprehensive framework for validating RDF data. In this approach, data shapes are defined according to the constraints and structures expected in the RDF output, as specified by the mapped Application Profile. These shapes act as templates that the RDF data must conform to, covering various aspects such as data types, relationships, cardinality, and more. A SHACL validation engine processes the RDF data against these shapes, identifying any deviations or errors that indicate non-conformity with the expected data model.

SHACL is an ideal choice for ensuring adherence to broad data standards and interoperability requirements. This form of validation is independent of the manner in which data mappings are constructed, focusing instead on whether the data conforms to established semantic models at the end-state. It provides a high-level assurance that data structures and content meet the specifications designed to facilitate seamless data integration and interactions across various systems.

Conversely, SPARQL-based validation is tightly linked to the mapping process itself, offering a granular, rule-by-rule validation that ensures each data transformation aligns precisely with the expert-validated mappings. It is particularly effective in confirming the accuracy of complex mappings and ensuring that the implemented data transformations faithfully reflect the intended semantic interpretations, thus providing a comprehensive check on the fidelity of the mapping process.

Inputs: Sample data transformed into RDF, Conceptual Mapping, SHACL data shapes

Outputs: Validation reports

Dissemination

Once the conceptual and technical mappings have been completed and validated, they can be packaged for dissemination and deployment. The purpose of disseminating mapping packages is to facilitate their controlled use for data transformation, ensure the ability to trace the evolution of mapping rules, and standardise the exchange of such rules. This structured approach allows for efficient and reliable data transformation processes across different systems.

A comprehensive mapping package typically includes:

  • Conceptual Mapping Files: Serves as the core documentation, outlining the rationale and structure behind the mappings to ensure transparency and ease of understanding.

  • Technical Mapping Files: This contains all the mapping code files (XSLT[ref], RML[ref], SPARQLAnything[ref], etc. depending on the chosen mapping technology) for data transformation, allowing for the practical application of the conceptual designs.

  • Additional Mapping Resources: Such as controlled lists, value mappings, or correspondence tables, which are crucial for the correct interpretation and application of the RML code. These are stored in a dedicated resources subfolder.

  • Test Data Sets: Carefully selected and representative XML files that cover various scenarios and cases. These test datasets are crucial for ensuring that the mappings perform as expected across a range of real-world data.

  • Factory Acceptance Testing (FAT) Reports: Document the testing outcomes based on the SPARQL and SHACL validations to guarantee that the mappings meet the expected standards before deployment. The generation of these reports should be supported by automation, as manual generation would involve too much effort and costs.

  • Tests Used for FAT Reports: The package also incorporates the actual SPARQL assertions and SHACL shapes used in generating the FAT reports, providing a complete view of the validation process.

  • Descriptive Metadata: Contains essential data about the mapping package, such as identification, title, description, and versions of the mapping, ontology, and source schemas. This metadata aids in the management and application of the package. This package is designed to be self-contained, ensuring that it can be immediately integrated and operational within various data transformation pipelines. The included components support not only the application, but also the governance of the mappings, ensuring they are maintained and utilised correctly in diverse IT environments. This systematic packaging addresses critical needs for usability, maintainability, and standardisation, which are essential for widespread adoption and operational success in data transformation initiatives.

Inputs: Conceptual Mapping spreadsheet, Ontologies, SHACL Data Shapes, Sample XML data, Sample data transformed into RDF, Conceptual Mapping, SHACL data shapes, Validation reports

Outputs: Comprehensive Mapping Package