Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds metadata source specification #484

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

jsheunis
Copy link
Member

@jsheunis jsheunis commented Jul 23, 2024

The source specification defines how to structure a collection of metadata records that together form the source material for a catalog instance. It separates metadata source files and formats from tooling, ensuring that users can provide and maintain a metadata collection without depending on datalad-catalog tools, while providing a validated structure from which automated tools can generate datalad-catalog-compatible records to be rendered.

This commit adds the specification as part of the project docs. Future commits should update the 'Pipeline description' section of the docs to suggest the use of tools that understand the metadata source specification, and should also remove or update the 'Metadata formats' section of the docs accordingly.

Closes #482

The source specification defines how to structure a collection
of metadata records that together form the source material for
a catalog instance. It separates metadata source files and formats
from tooling, ensuring that users can provide and maintain a
metadata collection without depending on datalad-catalog tools, while
providing a validated structure from which automated tools can generate
datalad-catalog-compatible records to be rendered.

This commit adds the specification as part of the project docs.
Future commits should update the 'Pipeline description' section of the
docs to suggest the use of tools that understand the metadata source
specification, and should also remove or update the 'Metadata formats'
section of the docs accordingly.
Copy link

netlify bot commented Jul 23, 2024

Deploy Preview for datalad-catalog canceled.

Name Link
🔨 Latest commit 002c8f3
🔍 Latest deploy log https://app.netlify.com/sites/datalad-catalog/deploys/669f860ca2ae8a0008c778c7

Comment on lines +47 to +49
├── config/
│ └── <config-version-id>/
│ └── config.json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm uncertain about here, wrt versioned configs, is how the ingestion pipeline will know which config version to use to create the catalog entries. It will have to be parameterized somehow, but ideally the agent that created the metadata collection should be the one to specify which config version to use. I.e. that argument should be part of the collection somehow?

Comment on lines +60 to +61
This directory should contain the catalog-level configuration file(s), one per version,
with the name ``config.json``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, datalad-catalog can also read YAML config files. Should we allow all possibilities (.json, .yml, .yaml), or just specify a single option?

Comment on lines +47 to +49
├── config/
│ └── <config-version-id>/
│ └── config.json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point about the config, it can also include a logo path (specified relative to the location of the config, within the context of the environment running the datalad-catalog code). For the purposes of the collection, this logo will either have to be provided as an image file in the collection itself (likely alongside the config.json file) or as a downloadable URL. Thoughts?

Comment on lines +94 to +97
This should be a unique filename of a single record, with identifying characters that
can be parsed in order to match the specific file format with a specific reader or processing
tool. There is no restriction on the number of files contained in a given ``<dataset-version-id>``
directory, they should just all be unique.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me that it might not always be individual files, e.g. a tabby collection might be included here as a directory containing all the related tabby files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A catalog metadata source format to support automatic ingestion
1 participant