Summary
This is an application profile, aimed to extend the use of DCAT-AP, originally envisaged for the description of a machine learning process, developed in collaboration with OpenML.
Status of this document
This specification has the status of draft published on 2023-01-13.
Overview
This document describes the usage of the following entities for a correct usage of the Application Profile:
|
Agent |
Catalogue |
Catalogue Record |
Checksum |
Collection |
Cost Matrix |
Data Quality List |
Data Service |
Dataset |
Distribution |
Estimation Procedure |
Evaluation |
Evaluation Measure |
Feature |
Flow |
Flow Parameter |
Library |
Licence document |
Machine Learning Model |
Measure |
Output File Description |
Output File Prediction |
Parameter |
Parameter Setting |
Prediction |
Prediction Feature |
Quality |
Quality Type |
Run |
Run Collection |
Split |
Task |
Task Collection |
Task Type |
Entities
Agent
- Definition
- Any entity carrying out actions with respect to the (Core) entities Catalogue, Datasets, Data Services and Distributions. If the Agent is an organisation, the use of the Organization Ontology is recommended.
- Properties
- For this entity the following properties are defined: name.
Catalogue
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
dataset
|
Dataset | 0..* | This property links the Catalogue with a Dataset that is part of the Catalogue. As empty Catalogues are usually indications of problems, this property should be combined with the next property service to implement an empty Catalogue check. | ||
description
|
Literal | 1..* | This property contains a free-text account of the Catalogue. This property can be repeated for parallel language versions of the description. | ||
publisher
|
Agent | 1 | This property refers to an entity (organisation) responsible for making the Catalogue available. | ||
record
|
Catalogue Record | 0..* | This property refers to a Catalogue Record that is part of the Catalogue. | ||
service
|
Data Service | 0..* | This property refers to a site or end-point (Data Service) that is listed in the Catalogue. | As empty Catalogues are usually indications of problems, this property should be combined with the previous property dataset to implement an empty Catalogue check. | |
title
|
Literal | 1..* | This property contains a name given to the Catalogue. This property can be repeated for parallel language versions of the name. |
Catalogue Record
- Definition
- A description of an entry of a Dataset in the Catalogue.
- Properties
- For this entity the following properties are defined: description, description version, modification date, primary topic.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
description
|
Literal | 0..* | This property contains a free-text account of the record. This property can be repeated for parallel language versions of the description. | ||
description version
|
Literal | 0..1 | It refers to the version of the description. '1' for original version. | ||
modification date
|
Literal | 1 | This property contains the most recent date on which the Catalogue entry was changed or modified. | ||
primary topic
|
Dataset | 1 | This property links the Catalogue Record to the Dataset, Data service or Catalog described in the record. |
Checksum
- Definition
- A value that allows the contents of a file to be authenticated. This class allows the results of a variety of checksum and cryptographic message digest algorithms to be represented.
- Properties
- For this entity the following properties are defined: algorithm, checksum value.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
algorithm
|
ChecksumAlgorithm | 1 | This property identifies the algorithm used to produce the subject Checksum. | The members of this property are the supported checksum algorithms. | |
checksum value
|
HexBinary | 1 | This property provides a lower case hexadecimal encoded digest value produced using a specific algorithm. |
Collection
- Definition
- A collection is a group of task or runs to easily refer to them and share them with others.
- Properties
- For this entity the following properties are defined: creation date, description, has uploader, id, name, visibility.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
creation date
|
Literal | 1 | |||
description
|
Literal | 0..1 | |||
has uploader
|
Agent | 1 | |||
id
|
Literal | 1 | |||
name
|
Literal | 1 | |||
visibility
|
Concept | 1 | The recommended controlled vocabulary is the Collection visibility code list. |
Cost Matrix
- Definition
- Properties
- No properties have been defined for this entity.
Data Quality List
- Definition
- The list of quality to be uploaded.
- Properties
- For this entity the following properties are defined: did, evaluation engine id, includes.
Data Service
- Definition
- A collection of operations that provides access to one or more datasets or data processing functions.
- Properties
- For this entity the following properties are defined: endpoint URL, serves dataset, title.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
endpoint URL
|
Resource | 1..* | The root location or primary endpoint of the service (an IRI). | ||
serves dataset
|
Dataset | 0..* | This property refers to a collection of data that this data service can distribute. | ||
title
|
Literal | 1..* | This property contains a name given to the Data Service. This property can be repeated for parallel language versions of the name. |
Dataset
- Definition
- A conceptual entity that represents the information published.
- Properties
- For this entity the following properties are defined: access rights, collectionDate, contributor, creator, description, distribution, has version, identifier, is referenced by, is version of, issued, keyword, landing page, publisher, status, title, version info, versionLabel, visibility.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
access rights
|
RightsStatement | 0..1 | This property refers to information that indicates whether the Dataset is open data, has access restrictions or is not public. | The recommended controlled vocabulary is the Access Rights Named Authority List. | |
collectionDate
|
Literal | 1 | The date the data was originally collected, given by the uploader. | ||
contributor
|
Agent | 0..* | People who contributed to the current version of the datadat (e.g. reformatting) | ||
creator
|
Agent | 0..1 | This property refers to the entity responsible for producing the dataset. | ||
description
|
Literal | 1..* | This property contains a free-text account of the Dataset. This property can be repeated for parallel language versions of the description. | ||
distribution
|
Distribution | 0..* | This property links the Dataset to an available Distribution. | ||
has version
|
Dataset | 0..* | This property refers to a related Dataset that is a version, edition, or adaptation of the described Dataset. | ||
identifier
|
Literal | 0..* | This property contains the main identifier for the Dataset, e.g. the URI or other unique identifier in the context of the Catalogue. | ||
is referenced by
|
Resource | 0..* | This property is about a related resource, such as a publication, that references, cites, or otherwise points to the dataset. | ||
is version of
|
Dataset | 0..* | This property refers to a related Dataset of which the described Dataset is a version, edition, or adaptation. | ||
issued
|
Literal | 0..1 | This property contains the date of formal issuance (e.g., publication) of the Dataset. | ||
keyword
|
Literal | 0..* | This property contains a keyword or tag describing the Dataset. | ||
landing page
|
Document | 0..* | This property refers to a web page that provides access to the Dataset, its Distributions and/or additional information. It is intended to point to a landing page at the original data provider, not to a page on a site of a third party, such as an aggregator. | ||
publisher
|
Agent | 0..1 | This property refers to an entity (organisation) responsible for making the Dataset available. | ||
status
|
Concept | 0..1 | The status of the dataset in the context of the publication process. | The recommended controlled vocabulary is the Dataset status code list. | |
title
|
Literal | 1..* | This property contains a name given to the Dataset. This property can be repeated for parallel language versions of the name. | ||
version info
|
Literal | 0..1 | This property contains a version number or other version designation of the Dataset. | ||
versionLabel
|
Literal | 0..1 | Version label provided by user, something relevant to the user. Can also be a date, hash, or some other type of id. | ||
visibility
|
Concept | 0..1 | Who can see the dataset. Typical values: 'Everyone','All my friends','Only me'. Can also be any of the user's circles. | The recommended controlled vocabulary is the Dataset visibility code list. |
Distribution
- Definition
- A physical embodiment of the Dataset in a particular format.
- Properties
- For this entity the following properties are defined: access service, access URL, byte size, checksum, checksum, default target attribute, download URL, format, has feature, has policy, has quality, identifier, ignore attribute, language, licence, processing error, processing warning, processingDate, row ID attribute, title.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
access service
|
Data Service | 0..* | This property refers to a data service that gives access to the distribution of the dataset. | ||
access URL
|
Resource | 1..* | This property contains a URL that gives access to a Distribution of the Dataset. The resource at the access URL may contain information about how to get the Dataset. | ||
byte size
|
Literal | 0..1 | This property contains the size of a Distribution in bytes. | ||
checksum
|
Checksum | 0..* | This property provides a mechanism that can be used to verify that the contents of a distribution have not changed. | ||
checksum
|
Checksum | 0..1 | This property provides a mechanism that can be used to verify that the contents of a distribution have not changed. The checksum is related to the downloadURL. | ||
default target attribute
|
Literal | 0..1 | The default target attribute, if it exists. | Can also have multiple values (comma-separated). Of course, tasks can be defined that use another attribute as target. | |
download URL
|
Resource | 0..* | This property contains a URL that is a direct link to a downloadable file in a given format. | ||
format
|
MediaTypeOrExtent | 0..1 | This property refers to the file format of the Distribution. | The recommended controlled vocabulary is the EU Vocabularies File Type Named Authority List. | |
has feature
|
Feature | 1..* | The attribute or column being part of a distribution. | ||
has policy
|
Policy | 0..1 | This property refers to the policy expressing the rights associated with the distribution if using the ODRL vocabulary. | ||
has quality
|
Quality | 1..* | A computed characteristic that describes a distribution. | ||
identifier
|
Literal | 0..1 | File identifier | ||
ignore attribute
|
Literal | 0..* | Attributes that should be excluded in modelling, such as identifiers and indexes. | ||
language
|
LinguisticSystem | 0..* | This property refers to a language used in the Distribution. This property can be repeated if the metadata is provided in multiple languages. | The recommended controlled vocabulary is the EU Vocabularies Languages Named Authority List. | |
licence
|
Licence document | 0..1 | This property refers to the licence under which the Distribution is made available. | ||
processing error
|
Literal | 0..1 | Errors discovered while processing the dataset. | ||
processing warning
|
Literal | 0..1 | Warnings while processing the dataset. | ||
processingDate
|
Literal | 0..1 | Date of processing. | ||
row ID attribute
|
Literal | 0..1 | The attribute that represents the row-id column, if present in the dataset. | ||
title
|
Literal | 0..* | This property contains a name given to the Distribution. This property can be repeated for parallel language versions of the description. |
Estimation Procedure
- Definition
- It defines how models should be evaluated, typically by defining specific kinds of train- and test splits.
- Subclass of
- Measure
- Properties
- For this entity the following properties are defined: data splits URL, description, has parameter, id, name, type.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
data splits URL
|
Resource | 1 | |||
description
|
Literal | 1 | |||
has parameter
|
Parameter | 1..* | |||
id
|
Literal | 1 | |||
name
|
Literal | 1 | |||
type
|
Concept | 1 | The recommended controlled vocabulary is the Estimation Procedure type code list. |
Evaluation
- Definition
- Properties
- For this entity the following properties are defined: array data, fold, has flow, id, interval end, interval start, label, name, repeat, sample, sample size, stdev, value.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
array data
|
Literal | 0..1 | |||
fold
|
Literal | 0..1 | |||
has flow
|
Flow | 0..* | |||
id
|
Literal | 0..1 | |||
interval end
|
Literal | 0..1 | |||
interval start
|
Literal | 0..1 | |||
label
|
Literal | 0..1 | |||
name
|
Literal | 1 | |||
repeat
|
Literal | 0..1 | |||
sample
|
Literal | 0..1 | |||
sample size
|
Literal | 0..1 | |||
stdev
|
Literal | 0..1 | |||
value
|
Literal | 0..1 |
Evaluation Measure
- Definition
- A way to score the outputs of machine learning models (e.g. predictions).
- Subclass of
- Measure
- Properties
- For this entity the following properties are defined: description, implementation, name, value, value bad, value good.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
description
|
Literal | 0..1 | |||
implementation
|
Literal | 0..1 | |||
name
|
Literal | 1 | |||
value
|
Literal | 1 | |||
value bad
|
Literal | 0..1 | |||
value good
|
Literal | 0..1 |
Feature
- Definition
- The attribute or column in which a distribution is structured.
- Properties
- For this entity the following properties are defined: description, name, type.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
description
|
Literal | 0..1 | An explanation of the attribute. | ||
name
|
Literal | 1 | The label of the attribute. | ||
type
|
Concept | 1 | A classification of the attribute. | The recommended controlled vocabulary is the Feature type code list. |
Flow
- Definition
- A machine learning pipelines or neural architectures, or (untrained) machine learning models, such as in general.
- Usage
- Flows contain all the information necessary to build a model, including its exact structure and any software dependencies. Given a flow, supported machine learning libraries can reproduce the model exactly.
- Properties
- For this entity the following properties are defined: class name, custom name, description, external version, has dependency, has flow parameter, has uploader, id, name, status, tag, uploaded, version.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
class name
|
Literal | 0..1 | |||
custom name
|
Literal | 0..1 | |||
description
|
Literal | 0..1 | |||
external version
|
Literal | 0..1 | |||
has dependency
|
Library | 1..* | |||
has flow parameter
|
Flow Parameter | 0..* | |||
has uploader
|
Agent | 0..1 | |||
id
|
Literal | 0..1 | |||
name
|
Literal | 1 | |||
status
|
Concept | 1 | The recommended controlled vocabulary is the Flow status code list. | ||
tag
|
Literal | 0..* | |||
uploaded
|
Literal | 1 | |||
version
|
Literal | 0..1 |
Flow Parameter
- Definition
- Properties
- For this entity the following properties are defined: default value, description, name, recommended range, type.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
default value
|
Literal | 0..1 | |||
description
|
Literal | 0..1 | |||
name
|
Literal | 1 | |||
recommended range
|
Literal | 0..1 | |||
type
|
Concept | 0..1 | The recommended controlled vocabulary is the Flow Parameter type code list. |
Licence document
- Definition
- A legal document giving official permission to do something with a resource.
- Properties
- No properties have been defined for this entity.
Machine Learning Model
- Definition
- Properties
- For this entity the following properties are defined: contributor, creator, description, ethical considerations, evaluation results, has output file prediction, has uploader, intended use, license, name, tag, trained on, training process, version.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
contributor
|
Agent | 0..* | |||
creator
|
Agent | 0..* | |||
description
|
Literal | 0..1 | |||
ethical considerations
|
Literal | 0..1 | |||
evaluation results
|
Literal | 0..1 | |||
has output file prediction
|
Output File Prediction | 1 | |||
has uploader
|
Agent | 0..1 | |||
intended use
|
Literal | 0..1 | |||
license
|
Licence document | ||||
name
|
Literal | 1 | |||
tag
|
Literal | 0..* | |||
trained on
|
Dataset | 1 | |||
training process
|
Literal | 0..1 | |||
version
|
Literal | 1 |
Measure
- Definition
- Properties
- No properties have been defined for this entity.
Prediction
- Definition
- Properties
- For this entity the following properties are defined: format, has prediction feature.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
format
|
Concept | 1 | |||
has prediction feature
|
Prediction Feature | 1..* |
Prediction Feature
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
name
|
Literal | 1 | |||
type
|
Concept | 1 | The recommended controlled vocabulary is the Prediction Feature type code list. |
Quality
- Definition
- A quality measured on a dataset.
- Properties
- For this entity the following properties are defined: feature index, interval end, interval start, type, value.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
feature index
|
Literal | 0..1 | The index of the quality that is set. | ||
interval end
|
Literal | 0..1 | |||
interval start
|
Literal | 0..1 | |||
type
|
Quality Type | 1 | A classification for a quality. | ||
value
|
Literal | 1 | The value of the quality. |
Quality Type
- Definition
- A measureable property of datasets.
- Usage
- Examples are size, shape, statistical properties, benchmarks, and the presence of missing values.
- Subclass of
- Measure
- Properties
- For this entity the following properties are defined: description, id, name.
Run
- Definition
- An evaluation of machine learning models (flows) trained on a given task.
- Usage
- Runs can be created and shared automatically from supported machine learning libraries. They contain the exact hyperparameters used, all detailed results, and potentially the trained models.
- Properties
- For this entity the following properties are defined: error message, has evaluation, has flow, has output file description, has output file prediction, has parameter, has task, has uploader, id, run details, setup Id, setup string, tag.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
error message
|
Literal | 0..1 | |||
has evaluation
|
Evaluation | 0..* | |||
has flow
|
Flow | 1 | |||
has output file description
|
Output File Description | 1 | |||
has output file prediction
|
Output File Prediction | 1 | |||
has parameter
|
Parameter Setting | 0..1 | |||
has task
|
Task | 0..1 | |||
has uploader
|
Agent | 0..1 | |||
id
|
Literal | 0..1 | |||
run details
|
Literal | 0..1 | |||
setup Id
|
Literal | 0..1 | |||
setup string
|
Literal | 0..1 | |||
tag
|
Literal | 0..* |
Run Collection
- Definition
- A group of runs to easily refer to them and share them with others.
- Subclass of
- Collection
- Properties
- For this entity the following properties are defined: has run.
Split
- Definition
- Properties
- For this entity the following properties are defined: description, id, is applied to, name.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
description
|
Literal | 0..1 | |||
id
|
Literal | 1 | |||
is applied to
|
Distribution | 1..* | |||
name
|
Literal | 1 |
Task
- Definition
- A task defines specific problems to be solved using a given dataset.
- Usage
- Tasks specify train and test sets, which target feature(s) to predict for supervised problems, and possibly which evaluation measure to optimize. They make the problem reproducible and machine-readable.
- Properties
- For this entity the following properties are defined: has cost matrix, has estimation procedure, has evaluation measure, has output, has task type, id, name, source data, tag, target feature.
Property | Expected Range | Cardinality | Definition | Usage | Codelist |
---|---|---|---|---|---|
has cost matrix
|
Cost Matrix | 0..1 | |||
has estimation procedure
|
Estimation Procedure | 1..* | |||
has evaluation measure
|
Evaluation Measure | 1..* | |||
has output
|
Prediction | 1 | |||
has task type
|
Task Type | 1 | The recommended controlled vocabulary is the Task type code list. | ||
id
|
Literal | 0..1 | |||
name
|
Literal | 1 | |||
source data
|
Dataset | 1..* | |||
tag
|
Literal | 0..* | |||
target feature
|
Feature | 1..* |
Task Collection
- Definition
- A group of tasks to easily refer to them and share them with others.
- Subclass of
- Collection
- Properties
- For this entity the following properties are defined: has task.
Task Type
- Definition
- Properties
- For this entity the following properties are defined: description, id, name.
Examples
Example 1 - Dataset
Changelog w.r.t. previous version
(non-normative)This is the first release of the MLDCAT-AP.
UML representation
(non-normative)The UML representation from which this specification has been build is available here.
RDF representation
(non-normative)A reusable RDF representation (in turtle) for this specification is retrievable here.
JSON-LD context
(non-normative)A reusable JSON-LD context definition for this specification is retrievable here.
SHACL template
(non-normative)A reusable SHACL template for this specification is retrievable here.