SEMIC

MLDCAT-AP

Status
Working Draft
Published at
2023-01-13
This version
https://semiceu.github.io/MLDCAT-AP/releases/1.0.0

Summary

This is an application profile, aimed to extend the use of DCAT-AP, originally envisaged for the description of a machine learning process, developed in collaboration with OpenML.

Status of this document

This specification has the status of draft published on 2023-01-13.

Entities

Agent

Definition
Any entity carrying out actions with respect to the (Core) entities Catalogue, Datasets, Data Services and Distributions. If the Agent is an organisation, the use of the Organization Ontology is recommended.
Properties
For this entity the following properties are defined: name.
Property Expected Range Cardinality Definition Usage Codelist
name Literal 1..* This property contains a name of the agent. This property can be repeated for different versions of the name (e.g. the name in different languages)

Catalogue

Definition
A catalogue or repository that hosts the Datasets or Data Services being described.
Properties
For this entity the following properties are defined: dataset, description, publisher, record, service, title.
Property Expected Range Cardinality Definition Usage Codelist
dataset Dataset 0..* This property links the Catalogue with a Dataset that is part of the Catalogue. As empty Catalogues are usually indications of problems, this property should be combined with the next property service to implement an empty Catalogue check.
description Literal 1..* This property contains a free-text account of the Catalogue. This property can be repeated for parallel language versions of the description.
publisher Agent 1 This property refers to an entity (organisation) responsible for making the Catalogue available.
record Catalogue Record 0..* This property refers to a Catalogue Record that is part of the Catalogue.
service Data Service 0..* This property refers to a site or end-point (Data Service) that is listed in the Catalogue. As empty Catalogues are usually indications of problems, this property should be combined with the previous property dataset to implement an empty Catalogue check.
title Literal 1..* This property contains a name given to the Catalogue. This property can be repeated for parallel language versions of the name.

Catalogue Record

Definition
A description of an entry of a Dataset in the Catalogue.
Properties
For this entity the following properties are defined: description, description version, modification date, primary topic.
Property Expected Range Cardinality Definition Usage Codelist
description Literal 0..* This property contains a free-text account of the record. This property can be repeated for parallel language versions of the description.
description version Literal 0..1 It refers to the version of the description. '1' for original version.
modification date Literal 1 This property contains the most recent date on which the Catalogue entry was changed or modified.
primary topic Dataset 1 This property links the Catalogue Record to the Dataset, Data service or Catalog described in the record.

Checksum

Definition
A value that allows the contents of a file to be authenticated. This class allows the results of a variety of checksum and cryptographic message digest algorithms to be represented.
Properties
For this entity the following properties are defined: algorithm, checksum value.
Property Expected Range Cardinality Definition Usage Codelist
algorithm ChecksumAlgorithm 1 This property identifies the algorithm used to produce the subject Checksum. The members of this property are the supported checksum algorithms.
checksum value HexBinary 1 This property provides a lower case hexadecimal encoded digest value produced using a specific algorithm.

Collection

Definition
A collection is a group of task or runs to easily refer to them and share them with others.
Properties
For this entity the following properties are defined: creation date, description, has uploader, id, name, visibility.
Property Expected Range Cardinality Definition Usage Codelist
creation date Literal 1
description Literal 0..1
has uploader Agent 1
id Literal 1
name Literal 1
visibility Concept 1 The recommended controlled vocabulary is the Collection visibility code list.

Cost Matrix

Definition
Properties
No properties have been defined for this entity.

Data Quality List

Definition
The list of quality to be uploaded.
Properties
For this entity the following properties are defined: did, evaluation engine id, includes.
Property Expected Range Cardinality Definition Usage Codelist
did Literal 1 Pointer to the did.
evaluation engine id Literal 0..1 The engine responsible for extracting the features.
includes Quality 0..* The qualities that need to be set

Data Service

Definition
A collection of operations that provides access to one or more datasets or data processing functions.
Properties
For this entity the following properties are defined: endpoint URL, serves dataset, title.
Property Expected Range Cardinality Definition Usage Codelist
endpoint URL Resource 1..* The root location or primary endpoint of the service (an IRI).
serves dataset Dataset 0..* This property refers to a collection of data that this data service can distribute.
title Literal 1..* This property contains a name given to the Data Service. This property can be repeated for parallel language versions of the name.

Dataset

Definition
A conceptual entity that represents the information published.
Properties
For this entity the following properties are defined: access rights, collectionDate, contributor, creator, description, distribution, has version, identifier, is referenced by, is version of, issued, keyword, landing page, publisher, status, title, version info, versionLabel, visibility.
Property Expected Range Cardinality Definition Usage Codelist
access rights RightsStatement 0..1 This property refers to information that indicates whether the Dataset is open data, has access restrictions or is not public. The recommended controlled vocabulary is the Access Rights Named Authority List.
collectionDate Literal 1 The date the data was originally collected, given by the uploader.
contributor Agent 0..* People who contributed to the current version of the datadat (e.g. reformatting)
creator Agent 0..1 This property refers to the entity responsible for producing the dataset.
description Literal 1..* This property contains a free-text account of the Dataset. This property can be repeated for parallel language versions of the description.
distribution Distribution 0..* This property links the Dataset to an available Distribution.
has version Dataset 0..* This property refers to a related Dataset that is a version, edition, or adaptation of the described Dataset.
identifier Literal 0..* This property contains the main identifier for the Dataset, e.g. the URI or other unique identifier in the context of the Catalogue.
is referenced by Resource 0..* This property is about a related resource, such as a publication, that references, cites, or otherwise points to the dataset.
is version of Dataset 0..* This property refers to a related Dataset of which the described Dataset is a version, edition, or adaptation.
issued Literal 0..1 This property contains the date of formal issuance (e.g., publication) of the Dataset.
keyword Literal 0..* This property contains a keyword or tag describing the Dataset.
landing page Document 0..* This property refers to a web page that provides access to the Dataset, its Distributions and/or additional information. It is intended to point to a landing page at the original data provider, not to a page on a site of a third party, such as an aggregator.
publisher Agent 0..1 This property refers to an entity (organisation) responsible for making the Dataset available.
status Concept 0..1 The status of the dataset in the context of the publication process. The recommended controlled vocabulary is the Dataset status code list.
title Literal 1..* This property contains a name given to the Dataset. This property can be repeated for parallel language versions of the name.
version info Literal 0..1 This property contains a version number or other version designation of the Dataset.
versionLabel Literal 0..1 Version label provided by user, something relevant to the user. Can also be a date, hash, or some other type of id.
visibility Concept 0..1 Who can see the dataset. Typical values: 'Everyone','All my friends','Only me'. Can also be any of the user's circles. The recommended controlled vocabulary is the Dataset visibility code list.

Distribution

Definition
A physical embodiment of the Dataset in a particular format.
Properties
For this entity the following properties are defined: access service, access URL, byte size, checksum, checksum, default target attribute, download URL, format, has feature, has policy, has quality, identifier, ignore attribute, language, licence, processing error, processing warning, processingDate, row ID attribute, title.
Property Expected Range Cardinality Definition Usage Codelist
access service Data Service 0..* This property refers to a data service that gives access to the distribution of the dataset.
access URL Resource 1..* This property contains a URL that gives access to a Distribution of the Dataset. The resource at the access URL may contain information about how to get the Dataset.
byte size Literal 0..1 This property contains the size of a Distribution in bytes.
checksum Checksum 0..* This property provides a mechanism that can be used to verify that the contents of a distribution have not changed.
checksum Checksum 0..1 This property provides a mechanism that can be used to verify that the contents of a distribution have not changed. The checksum is related to the downloadURL.
default target attribute Literal 0..1 The default target attribute, if it exists. Can also have multiple values (comma-separated). Of course, tasks can be defined that use another attribute as target.
download URL Resource 0..* This property contains a URL that is a direct link to a downloadable file in a given format.
format MediaTypeOrExtent 0..1 This property refers to the file format of the Distribution. The recommended controlled vocabulary is the EU Vocabularies File Type Named Authority List.
has feature Feature 1..* The attribute or column being part of a distribution.
has policy Policy 0..1 This property refers to the policy expressing the rights associated with the distribution if using the ODRL vocabulary.
has quality Quality 1..* A computed characteristic that describes a distribution.
identifier Literal 0..1 File identifier
ignore attribute Literal 0..* Attributes that should be excluded in modelling, such as identifiers and indexes.
language LinguisticSystem 0..* This property refers to a language used in the Distribution. This property can be repeated if the metadata is provided in multiple languages. The recommended controlled vocabulary is the EU Vocabularies Languages Named Authority List.
licence Licence document 0..1 This property refers to the licence under which the Distribution is made available.
processing error Literal 0..1 Errors discovered while processing the dataset.
processing warning Literal 0..1 Warnings while processing the dataset.
processingDate Literal 0..1 Date of processing.
row ID attribute Literal 0..1 The attribute that represents the row-id column, if present in the dataset.
title Literal 0..* This property contains a name given to the Distribution. This property can be repeated for parallel language versions of the description.

Estimation Procedure

Definition
It defines how models should be evaluated, typically by defining specific kinds of train- and test splits.
Subclass of
Measure
Properties
For this entity the following properties are defined: data splits URL, description, has parameter, id, name, type.
Property Expected Range Cardinality Definition Usage Codelist
data splits URL Resource 1
description Literal 1
has parameter Parameter 1..*
id Literal 1
name Literal 1
type Concept 1 The recommended controlled vocabulary is the Estimation Procedure type code list.

Evaluation

Definition
Properties
For this entity the following properties are defined: array data, fold, has flow, id, interval end, interval start, label, name, repeat, sample, sample size, stdev, value.
Property Expected Range Cardinality Definition Usage Codelist
array data Literal 0..1
fold Literal 0..1
has flow Flow 0..*
id Literal 0..1
interval end Literal 0..1
interval start Literal 0..1
label Literal 0..1
name Literal 1
repeat Literal 0..1
sample Literal 0..1
sample size Literal 0..1
stdev Literal 0..1
value Literal 0..1

Evaluation Measure

Definition
A way to score the outputs of machine learning models (e.g. predictions).
Subclass of
Measure
Properties
For this entity the following properties are defined: description, implementation, name, value, value bad, value good.
Property Expected Range Cardinality Definition Usage Codelist
description Literal 0..1
implementation Literal 0..1
name Literal 1
value Literal 1
value bad Literal 0..1
value good Literal 0..1

Feature

Definition
The attribute or column in which a distribution is structured.
Properties
For this entity the following properties are defined: description, name, type.
Property Expected Range Cardinality Definition Usage Codelist
description Literal 0..1 An explanation of the attribute.
name Literal 1 The label of the attribute.
type Concept 1 A classification of the attribute. The recommended controlled vocabulary is the Feature type code list.

Flow

Definition
A machine learning pipelines or neural architectures, or (untrained) machine learning models, such as in general.
Usage
Flows contain all the information necessary to build a model, including its exact structure and any software dependencies. Given a flow, supported machine learning libraries can reproduce the model exactly.
Properties
For this entity the following properties are defined: class name, custom name, description, external version, has dependency, has flow parameter, has uploader, id, name, status, tag, uploaded, version.
Property Expected Range Cardinality Definition Usage Codelist
class name Literal 0..1
custom name Literal 0..1
description Literal 0..1
external version Literal 0..1
has dependency Library 1..*
has flow parameter Flow Parameter 0..*
has uploader Agent 0..1
id Literal 0..1
name Literal 1
status Concept 1 The recommended controlled vocabulary is the Flow status code list.
tag Literal 0..*
uploaded Literal 1
version Literal 0..1

Flow Parameter

Definition
Properties
For this entity the following properties are defined: default value, description, name, recommended range, type.
Property Expected Range Cardinality Definition Usage Codelist
default value Literal 0..1
description Literal 0..1
name Literal 1
recommended range Literal 0..1
type Concept 0..1 The recommended controlled vocabulary is the Flow Parameter type code list.

Library

Definition
Properties
For this entity the following properties are defined: name, version.
Property Expected Range Cardinality Definition Usage Codelist
name Literal 1
version Literal 1

Licence document

Definition
A legal document giving official permission to do something with a resource.
Properties
No properties have been defined for this entity.

Measure

Definition
Properties
No properties have been defined for this entity.

Output File Description

Definition
The degree or extent of something, as determined by measurement or calculation.
Properties
For this entity the following properties are defined: id, name, url.
Property Expected Range Cardinality Definition Usage Codelist
id Literal 1
name Literal 1
url Literal 1

Output File Prediction

Definition
Properties
For this entity the following properties are defined: id, name, url.
Property Expected Range Cardinality Definition Usage Codelist
id Literal 1
name Literal 1
url Resource 1

Parameter

Definition
Properties
For this entity the following properties are defined: name, value.
Property Expected Range Cardinality Definition Usage Codelist
name Literal 1
value Literal 1

Parameter Setting

Definition
Properties
For this entity the following properties are defined: component, name, value.
Property Expected Range Cardinality Definition Usage Codelist
component Literal 0..1
name Literal 1
value Literal 1

Prediction

Definition
Properties
For this entity the following properties are defined: format, has prediction feature.
Property Expected Range Cardinality Definition Usage Codelist
format Concept 1
has prediction feature Prediction Feature 1..*

Prediction Feature

Definition
Properties
For this entity the following properties are defined: name, type.
Property Expected Range Cardinality Definition Usage Codelist
name Literal 1
type Concept 1 The recommended controlled vocabulary is the Prediction Feature type code list.

Quality

Definition
A quality measured on a dataset.
Properties
For this entity the following properties are defined: feature index, interval end, interval start, type, value.
Property Expected Range Cardinality Definition Usage Codelist
feature index Literal 0..1 The index of the quality that is set.
interval end Literal 0..1
interval start Literal 0..1
type Quality Type 1 A classification for a quality.
value Literal 1 The value of the quality.

Quality Type

Definition
A measureable property of datasets.
Usage
Examples are size, shape, statistical properties, benchmarks, and the presence of missing values.
Subclass of
Measure
Properties
For this entity the following properties are defined: description, id, name.
Property Expected Range Cardinality Definition Usage Codelist
description Literal 0..1 An explanation of the quality type.
id Literal 1 An unambiguous identifier for the quality type.
name Literal 1 The name assigned to the quality type.

Run

Definition
An evaluation of machine learning models (flows) trained on a given task.
Usage
Runs can be created and shared automatically from supported machine learning libraries. They contain the exact hyperparameters used, all detailed results, and potentially the trained models.
Properties
For this entity the following properties are defined: error message, has evaluation, has flow, has output file description, has output file prediction, has parameter, has task, has uploader, id, run details, setup Id, setup string, tag.

Run Collection

Definition
A group of runs to easily refer to them and share them with others.
Subclass of
Collection
Properties
For this entity the following properties are defined: has run.
Property Expected Range Cardinality Definition Usage Codelist
has run Run 1..*

Split

Definition
Properties
For this entity the following properties are defined: description, id, is applied to, name.
Property Expected Range Cardinality Definition Usage Codelist
description Literal 0..1
id Literal 1
is applied to Distribution 1..*
name Literal 1

Task

Definition
A task defines specific problems to be solved using a given dataset.
Usage
Tasks specify train and test sets, which target feature(s) to predict for supervised problems, and possibly which evaluation measure to optimize. They make the problem reproducible and machine-readable.
Properties
For this entity the following properties are defined: has cost matrix, has estimation procedure, has evaluation measure, has output, has task type, id, name, source data, tag, target feature.
Property Expected Range Cardinality Definition Usage Codelist
has cost matrix Cost Matrix 0..1
has estimation procedure Estimation Procedure 1..*
has evaluation measure Evaluation Measure 1..*
has output Prediction 1
has task type Task Type 1 The recommended controlled vocabulary is the Task type code list.
id Literal 0..1
name Literal 1
source data Dataset 1..*
tag Literal 0..*
target feature Feature 1..*

Task Collection

Definition
A group of tasks to easily refer to them and share them with others.
Subclass of
Collection
Properties
For this entity the following properties are defined: has task.
Property Expected Range Cardinality Definition Usage Codelist
has task Task 1..*

Task Type

Definition
Properties
For this entity the following properties are defined: description, id, name.
Property Expected Range Cardinality Definition Usage Codelist
description Literal 1
id Literal 1
name Literal 1

Examples

Example 1 - Dataset

Changelog w.r.t. previous version

(non-normative)

This is the first release of the MLDCAT-AP.

UML representation

(non-normative)

The UML representation from which this specification has been build is available here.

RDF representation

(non-normative)

A reusable RDF representation (in turtle) for this specification is retrievable here.

JSON-LD context

(non-normative)

A reusable JSON-LD context definition for this specification is retrievable here.

SHACL template

(non-normative)

A reusable SHACL template for this specification is retrievable here.