Record provenance of metadata

Status: accepted
Deciders: sdruskat, jkelling, led02, poikilotherm, skernchen
Date: 2023-11-15

Technical story: https://github.com/hermes-hmc/hermes/pull/40

Context and Problem Statement

To enable traceability of the metadata, and resolution based on metadata sources in case of duplicates, etc., we need to record the provenance of metadata values. To do this, we need to specify a way to do this.

Considered Options

Internal comment field
Dedicated metadata field
Use PROV standard
Separate internal provenance model
Create wrapped JSON-LD entities and add our metadata (json-ld/json-ld.org#744)
Create non-standard JSON-LD extension with custom keywords

Decision Outcome

Chosen option: “Create non-standard JSON-LD extension with custom keywords”, because comes out best.

Pros and Cons of the Options

Positive Consequences

We have a unified data model and keep both provenance and actual data at the sample place
Unifying the data model requires less complex handling in the implementation, as there is no need to recombine data
It’s not possible to loose the provenance information, as it is added to any attribute stored in the model

Negative Consequences

We loose the ability to reuse the serialized internal data model as output files (CodeMeta) without further processing
We have to document these added keywords very well for plugin developers
We need to define a ontology for this (internal) provenance metadata

Internal comment field

This would be a comment attached to single metadata fields to record provenance.

Bad, because Non-standard way to record this kind of information, i.e., non-reusable
Bad, because Extra documentation effort

Dedicated metadata field

This would be an extra metadata field to be attached to each field (?), e.g., with a URI (source: https://repo.org/user/project/codemeta.json or similar)

Good, because Very generic way to specify the source of information
Bad, because Very generic way to specify the source of information
Bad, because Non-standard way to record provenance

Use PROV standard

This attaches provenance information following PROV-O to metadata fields

Good, because Standardized for provenance information
Good, because Not much extra documentation needed
Bad, because More circumvent way to describe relatively constricted cases (probably only use a few entities and prov:wasInformedBy or similar)

Separate internal metadata about metadata model

This would create a valid JSON-LD file serializing our internal data model and a auxiliary file with the provencance data

Good, because keeping things separate enables direct reuse and validation of the data model file
Good, because serialization of the provenance data is free form and simple to do
Bad, because we need to re-combine provenance and metadata
Bad, because we have more files in the output which might confuse people
Bad, because not easy to debug when recombination fails

Create wrapped JSON-LD entities and add our metadata

This would work around the limitation of RDF and JSON-LD that value objects are non-extensible

Good, because standard compliant, still validates using standard validators
Bad, because very noisy in the output files
Bad, because still needs back references to the object when using @id in graph objects
Bad, because would require our own ontology and repeating any field ever needed (when keeping the original fields and not using a graph object)
Bad, because would require our own objects to keep the type and value separated, requiring reparsing when writing output files

Create non-standard JSON-LD extension with custom keywords

This would work around the limitation of RDF and JSON-LD that value objects are non-extensible

Good, because easy to implement in our custom handling of the graph as Python dictionaries
Good, because not very noisy
Good, because keywords are the JSON-LD way to provide metadata already
Good, because very light extension and not touching definitions from other ontologies
Good, because we can still make use of an ontology for the metadata objects to provide an open/closed principle compliant structure
Bad, because not standard compliant
Bad, because needs filtering when writing output files

Record provenance of metadata

Contents

Record provenance of metadata

Context and Problem Statement

Considered Options

Decision Outcome

Pros and Cons of the Options

Positive Consequences

Negative Consequences

Internal comment field

Dedicated metadata field

Use PROV standard

Separate internal metadata about metadata model

Create wrapped JSON-LD entities and add our metadata

Create non-standard JSON-LD extension with custom keywords