Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define ideal user experience and workflow for catalog-translate #324

Closed
jsheunis opened this issue Jul 5, 2023 · 3 comments
Closed

Define ideal user experience and workflow for catalog-translate #324

jsheunis opened this issue Jul 5, 2023 · 3 comments

Comments

@jsheunis
Copy link
Member

jsheunis commented Jul 5, 2023

Relevant existing issues:

As part of #309 I am creating a new interface for Translate, with one of the goals being to fix existing issues related to translation functionality. I am writing a summary here of the existing functionality and the proposed changes, to get feedback on it before I waste time on implementing the wrong stuff.

Old

The current approach is:

datalad catalog translate --metadata <metalad-metadata-to-be-translated>

  • the name and version of the extractor used to generate a provided metadata item is determined from the metadata item itself
  • these values are used to match a specific translator from all translators that are available via datalad's entrypoint mechanism
  • this process (getting the name and version, running the matching method of all available translators, loading the matched translator) is done for each metadata item. this is where the performance can be improved.

New

The envisioned approach v1:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated>

  • the same as old procedure, but we use a simple approach to keep track of already matched translators (and their loaded entrypoints)
  • the factory of matched translators will be checked first for every new metadata item being translated, to prevent an unnecessary matching procedure

The envisioned approach v2:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated> --translator <name(s) of translator(s)>

  • same as the new procedure v1, but with the added --translator argument which is introduced here in response to NF: add batch functionality to translation #264
  • if a translator is supplied (or a list of translators) then only these supplied translators will be used during the rest of the catalog-translate process (if they are available as entrypoints)
  • if metadata items are provided that do not result from an extractor for which a related translator has been passed as an argument, then these items are ignored.

The problem with the v2 approach is that the manner in which the translators are supplied as arguments to datalad catalog-translate would have to include the same information that is used by the matching method of these translators, which could vary across translators since these are implemented on the inherited translator classes themselves (and not on the base class). So a call would have to be something like:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated> --translator metalad_core_v0.2.2

and that would have to be parsed in order to allow complete matching. I'm thinking more and more this is unnecessarily complicated and that we should stick with new v1 approach.

Opinions?

@jsheunis jsheunis changed the title Define ideal user experience and workflow Define ideal user experience and workflow for catalog-translate Jul 5, 2023
@mslw
Copy link
Collaborator

mslw commented Jul 5, 2023

For v1, you'd need the factory to index translators by all properties given to match(). IIRC these are: name, version, and (optional) UUID. This should be doable. And if successful, would not introduce two "modes" of processing.

I am not able to predict what performance gain that would have in practice (I have been wrong once on this topic already), but probably it's worth a try to write a basic factory, and try it on a synthetically generated, large-ish set of metadata (either from one extractor, or from multiple, interleaved).

I don't currently understand the argument why the v2 specification must be complex - why not entrypoint name alone? But if you feel v1 may be better, it won't hurt to try v1 first.

@jsheunis
Copy link
Member Author

jsheunis commented Jul 5, 2023

I am not able to predict what performance gain that would have in practice (I have been wrong once on this topic already), but probably it's worth a try to write a basic factory, and try it on a synthetically generated, large-ish set of metadata (either from one extractor, or from multiple, interleaved).

good idea

I don't currently understand the argument why the v2 specification must be complex - why not entrypoint name alone? But if you feel v1 may be better, it won't hurt to try v1 first.

An example of when entrypoint alone would not be enough would be if there are two translators compatible with an extractor of the same name, but different versions. One would either have to make some assumptions to e.g. select the latest, or the user would have to specify both name and version, or both translators would have to be deliberately instantiated and made part of the factory store and matched correctly at a later stage. All of these requires some decision trees that will turn into additional code, and I don't think this adds much benefit to the user in terms of improved UX (the v1 alternative actually feels easier because it's fewer arguments to be concerned with) nor in terms of improved performance over the v1 alternative (in both cases the matching and instantiation for a to-be-used translator would only happen once)

@jsheunis
Copy link
Member Author

Pragmatic approach implemented in #309. This might not be ideal forever, but closing the issue until such time as this becomes insufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants