Define ideal user experience and workflow for `catalog-translate` #324

jsheunis · 2023-07-05T15:25:55Z

Relevant existing issues:

As part of #309 I am creating a new interface for Translate, with one of the goals being to fix existing issues related to translation functionality. I am writing a summary here of the existing functionality and the proposed changes, to get feedback on it before I waste time on implementing the wrong stuff.

Old

The current approach is:

datalad catalog translate --metadata <metalad-metadata-to-be-translated>

the name and version of the extractor used to generate a provided metadata item is determined from the metadata item itself
these values are used to match a specific translator from all translators that are available via datalad's entrypoint mechanism
this process (getting the name and version, running the matching method of all available translators, loading the matched translator) is done for each metadata item. this is where the performance can be improved.

New

The envisioned approach v1:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated>

the same as old procedure, but we use a simple approach to keep track of already matched translators (and their loaded entrypoints)
the factory of matched translators will be checked first for every new metadata item being translated, to prevent an unnecessary matching procedure

The envisioned approach v2:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated> --translator <name(s) of translator(s)>

same as the new procedure v1, but with the added --translator argument which is introduced here in response to NF: add batch functionality to translation #264
if a translator is supplied (or a list of translators) then only these supplied translators will be used during the rest of the catalog-translate process (if they are available as entrypoints)
if metadata items are provided that do not result from an extractor for which a related translator has been passed as an argument, then these items are ignored.

The problem with the v2 approach is that the manner in which the translators are supplied as arguments to datalad catalog-translate would have to include the same information that is used by the matching method of these translators, which could vary across translators since these are implemented on the inherited translator classes themselves (and not on the base class). So a call would have to be something like:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated> --translator metalad_core_v0.2.2

and that would have to be parsed in order to allow complete matching. I'm thinking more and more this is unnecessarily complicated and that we should stick with new v1 approach.

Opinions?

The text was updated successfully, but these errors were encountered:

mslw · 2023-07-05T16:13:16Z

For v1, you'd need the factory to index translators by all properties given to match(). IIRC these are: name, version, and (optional) UUID. This should be doable. And if successful, would not introduce two "modes" of processing.

I am not able to predict what performance gain that would have in practice (I have been wrong once on this topic already), but probably it's worth a try to write a basic factory, and try it on a synthetically generated, large-ish set of metadata (either from one extractor, or from multiple, interleaved).

I don't currently understand the argument why the v2 specification must be complex - why not entrypoint name alone? But if you feel v1 may be better, it won't hurt to try v1 first.

jsheunis · 2023-07-05T20:46:02Z

I am not able to predict what performance gain that would have in practice (I have been wrong once on this topic already), but probably it's worth a try to write a basic factory, and try it on a synthetically generated, large-ish set of metadata (either from one extractor, or from multiple, interleaved).

good idea

I don't currently understand the argument why the v2 specification must be complex - why not entrypoint name alone? But if you feel v1 may be better, it won't hurt to try v1 first.

An example of when entrypoint alone would not be enough would be if there are two translators compatible with an extractor of the same name, but different versions. One would either have to make some assumptions to e.g. select the latest, or the user would have to specify both name and version, or both translators would have to be deliberately instantiated and made part of the factory store and matched correctly at a later stage. All of these requires some decision trees that will turn into additional code, and I don't think this adds much benefit to the user in terms of improved UX (the v1 alternative actually feels easier because it's fewer arguments to be concerned with) nor in terms of improved performance over the v1 alternative (in both cases the matching and instantiation for a to-be-used translator would only happen once)

jsheunis · 2023-07-28T20:36:49Z

Pragmatic approach implemented in #309. This might not be ideal forever, but closing the issue until such time as this becomes insufficient.

jsheunis changed the title ~~Define ideal user experience and workflow~~ Define ideal user experience and workflow for catalog-translate Jul 5, 2023

jsheunis mentioned this issue Jul 11, 2023

RF: move from single entrypoint API to entrypoint per command #309

Merged

mih mentioned this issue Jul 28, 2023

Implement pyld-based metadata homogenization tooling mih/datalad-mihextras#40

Closed

jsheunis closed this as completed Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define ideal user experience and workflow for `catalog-translate` #324

Define ideal user experience and workflow for `catalog-translate` #324

jsheunis commented Jul 5, 2023

mslw commented Jul 5, 2023

jsheunis commented Jul 5, 2023

jsheunis commented Jul 28, 2023

Define ideal user experience and workflow for catalog-translate #324

Define ideal user experience and workflow for catalog-translate #324

Comments

jsheunis commented Jul 5, 2023

Old

New

mslw commented Jul 5, 2023

jsheunis commented Jul 5, 2023

jsheunis commented Jul 28, 2023

Define ideal user experience and workflow for `catalog-translate` #324

Define ideal user experience and workflow for `catalog-translate` #324