Collecting feature requests around a developmental feature for RAMP #250

kegl · 2020-10-16T17:50:58Z

When RAMP is used for developing models for a problem, we may want to tag certain versions of a submission, and even problem.py, together with the scores. One idea is to use git tags. For example, after running ramp-test ... --save-output, one could run another script that git adds problem.py, the submission files, and the scores in training_output/fold_<i>, commit and tag with a user-defined tag (plus maybe a prefix indicating that it is a scoring tag, so later we may automatically search for all such tags).

The text was updated successfully, but these errors were encountered:

zhangJianfeng · 2020-11-09T10:20:36Z

When loading the data in ramp, seems training data will be read twice. When the data is big, it is a bit slow.
Is it possible to parallelize the CV process?

gabriel-hurtado · 2020-11-09T16:39:59Z

Adding on feature that I would be useful, at least to me: it would be great to have the ability to import more code from elsewhere in a submission, allowing multiple submissions to share some code. Now it can be done by creating a library and importing it, which is a bit tedious.
@albertcthomas mentioned this could perhaps be done on a similar way that pytest does it. They have a conftest.py file for code that you want to reuse for different test module.
#181

albertcthomas · 2020-11-09T18:17:03Z

@albertcthomas mentioned this could perhaps be done on a similar way that pytest does it. They have a conftest.py file for code that you want to reuse for different test module.

Well it is more like "this makes me think of conftest.py that can be used to share fixtures" but I don't know what happens when you run pytest and I am not sure the comparison goes very far :). As written in pytest doc "The next example puts the fixture function into a separate conftest.py file so that tests from multiple test modules in the directory can access the fixture function".
This feature is discussed in issue #181.

illyyne · 2020-11-12T10:21:38Z

1- I find the step of reading data is taking too much time: slower than reading it without RAMP.
2- It would be great if also the mean result is saved with the bagged one.
3- Propose a latex syntax for the results.
4- When the output is saved, it would be better to save also the experiment conditions: like data label, tested hyperparameter, etc and keep all somewhere either locally or in the cloud to check it later.

LudoHackathon · 2020-11-19T10:40:41Z

Here are some features that could help:

Model selection: "early killing" (e.g. successive Halving or even simpler schemes), which implies having shared information while hyperopt or at least a way to compare or current model to the best one so far (either a global python variable or save it somehow on HDD...)
Experimental protocol: Having parametrized problem.py, I'm keen on json (that could also be saved each time you launch a ramp-test). I'm not a big fan of using commit tags since I can launch 10 different batches of experiments on different servers without wanting to commit each time just for a experiment's configuration file.
Logging:
- Model saving and loading (path, hyperopt, ...)
- Possibility to rename the output score folders. E.g. depending on the task and the models I've implemented I rather save the results with a different directory hierarchy, let's say w.r.t. hp or more global options. It helps regex search (useful with tensorboard for example), or plotting when dealing with tens of thousands ran experiments (and looking at parameters sensitivity).
Other:
- Being able to modify the submissions while some experiments are running (looks like the .py submission file is loaded several times, I have the habit to load the class somewhere which allows me to do whatever I want while my experiments are running)
- Same as Gabriel, ease the imports in a submission, maybe I didn't find the right way to do so but there is a lot of duplicated code in my submissions even if I've implemented a Pytorch2RAMP class. [RFC] importing files in submissions #181

LudoHackathon · 2020-11-23T10:19:53Z

From my (little) experience with RAMP, what made people a bit reluctant to use it was that it was too high level. Mearning that we don't see the classical sequential process we are used to see in a ML script (load data, instantiate model, train it, test it). As an example, Keras (not the same purpose as RAMP) embedded some part of the script to minimize the main script but kept the overall spirit of the classical script making it as understandable as the original one. Using ramp-test in command line may make RAMP more obscure to new users. Maybe that having a small script (as the one already in the documentation for example) giving the user a more pythonic way to play with it, without having to use ramp-test as a command line, could make machine learners more willing to use it.

agramfort · 2020-11-23T10:30:21Z

I have heard this many times too. Debugging is a pain etc. To fix this now I stick to RAMP kits where you need to return a sklearn estimator that implements fit and predict so you can replace ramp-test by sklearn cross_val_score and just use your favorite env to inspect / debug / run (vscode, notebook, google colab etc.)

…

kegl · 2020-11-23T15:21:28Z

Calling ramp-test from a notebook is as simple as

from rampwf.utils import assert_submission
assert_submission(submission='starting_kit')

This page https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/advanced/scoring.html now contains two code snippets that you can use to call lower-level elements of the workflow and emulate a simple train/test and cross-validation loop. @LudoHackathon do you have a suggestion what else would be useful? E.g. an example notebook in the library?

agramfort · 2020-11-23T17:56:45Z

the doc says: trained_workflow = problem.workflow.train_submission( 'submissions/starting_kit', X_train, y_train) after all these years I did not know this :'( this should be explained in the kits to save some pain to students

albertcthomas · 2020-11-23T18:07:04Z

this should be explained in the kits to save some pain to students

wasn't this the purpose of the "Working in the notebook" section of the old titanic notebook starting kit?

kegl · 2020-11-23T18:21:32Z

Yes, @albertcthomas is right, but the snippet in the doc is cleaner now. I'm doing this decomposition in every kit now, see for example line 36 here https://github.com/ramp-kits/optical_network_modelling/blob/master/optical_network_modelling_starting_kit.ipynb. This snippet is even simpler than in the doc but less general, only works when the Predictions class does nothing with the input numpy array, which is most of the time (regression and classification). Feel free to reuse.

albertcthomas · 2020-11-23T18:27:18Z

This page https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/advanced/scoring.html now contains two code snippets that you can use to call lower-level elements of the workflow and emulate a simple train/test and cross-validation loop. @LudoHackathon do you have a suggestion what else would be useful? E.g. an example notebook in the library?

The page is doing a good job at showing how you can call the different elements (and thus play with them, doing plots....)

for better visibility we might clearly say that there is a command-line interface based on ramp-test and a way of calling the neededs function easily in a python script (or notebook). Of course we could add an example showing the python script interface.
More importantly, maybe think of what can break when you go from one to the other interface. For instance imports from other modules located in the current working directory. This still forces us/the students to work with submission files. I think that using the "scikit-learn kits" eases the transfer of your scikit-learn estimator from your playing python script/notebook to a submission file and making sure that this works in most cases. I let @agramfort confirm this :)
Instead of

from rampwf.utils import assert_submission
assert_submission(submission='starting_kit')

we could have something like

from rampwf import ramp_test
ramp_test(submission='starting_kit')

Debugging is a pain etc.

For debugging with the command line I have to say that I rely a lot on adding a breakpoint where I want to enter the debugger. However, this cannot be done post-mortem compared to using %debug in ipython or jupyter. For this we could have a --pdb or --trace flag as in pytest. But it's true that it's easier to try things and play with your models/pipelines when not using the command-line.

albertcthomas · 2020-11-23T18:58:33Z

use your favorite env to inspect / debug / run (vscode, notebook, google colab etc.)
giving the user a more pythonic way to play with it, without having to use ramp-test as a command line

this is an important point. 2 or 3 years ago I was rarely using the command-line and I always preferred staying in a python environment. Users should be able to use their favorite tool to play with their models and we should make sure that at the end it will work when calling ramp-test in the command line.

kegl · 2020-11-23T19:03:25Z

OK
no comment
OK. In fact we may put in focus the python call and tell them to use the command line ramp-test as a final unit test, the same way as one would use pytest. I think the cleanest way would be to have ramp_test defined in https://github.com/paris-saclay-cds/ramp-workflow/blob/advanced/rampwf/utils/cli/testing.py and main would just call ramp_test with the exact same signature. In this way it's certain that the two calls do the same thing.
I prefer not adding the command line feature if everything can be done from the python call.

albertcthomas · 2020-11-23T19:04:34Z

3\. I prefer not adding the command line feature if everything can be done from the python call.

is this for 4. and --pdb?

agramfort · 2020-11-23T20:10:58Z

doing: import imp feature_extractor = imp.load_source( '', 'submissions/starting_kit/feature_extractor.py') fe = feature_extractor. FeatureExtractor() classifier = imp.load_source( '', 'submissions/starting_kit/classifier.py') clf = classifier.Classifier() is to me too complex and should be avoided. We have a way suggested by @kegl based on the ramwf function.

now I agree with @albertcthomas leaving the notebooks to edit python files is a bit error prone. what I have shown to students is to use the %%file magic to write a cell to the file on the disk. anyway I think we should show in each notebook what is the easy way. ramp-test command is an easy for us to know that it works on their systems but not the more agile way when they need to come up with their own solution.

kegl · 2020-11-24T12:35:29Z

import imp feature_extractor = imp.load_source( '', 'submissions/starting_kit/feature_extractor.py') fe = feature_extractor, FeatureExtractor() classifier = imp.load_source( 'submissions/starting_kit/classifier.py') clf = classifier.Classifier() is to me too complex and should be avoided. We have a way suggested by @kegl based on the ramwf function.

I'm not sure what you mean here. We're using import_module_from_source now.

agramfort · 2020-11-24T12:54:26Z

I copied these lines from the titanic starting kit which is used to get student started on RAMP.

…

kegl · 2020-11-24T17:22:55Z

3\. I prefer not adding the command line feature if everything can be done from the python call.

is this for 4. and --pdb?

yes

gabriel-hurtado · 2020-12-16T17:50:45Z

Another feature that would be nice to have : have an option to separate what is saved and what is printed to the console.
This would allow to save extensive metrics without flooding the terminal.

kegl · 2021-01-28T16:15:07Z

Partial fit for models where eg. number of trees or number of epochs is a hyper. This would be mainly a feature used by hyperopt (killing trainings early) but maybe also useful as CLI param.

kegl · 2021-02-04T14:02:49Z

Standardized latex tables computed out of saved scores. Probably two steps: first create all scores (of selected submissions and data labels) into a well-designed pandas table. Then a set of tools to create latex tables, scores with CI and also paired tests. I especially like the plots and score presentation in https://link.springer.com/article/10.1007/s10994-018-5724-2.

albertcthomas · 2021-02-26T16:50:11Z

When RAMP is used for developing models for a problem, we may want to tag certain versions of a submission, and even problem.py, together with the scores. One idea is to use git tags. For example, after running ramp-test ... --save-output, one could run another script that git adds problem.py, the submission files, and the scores in training_output/fold_<i>, commit and tag with a user-defined tag (plus maybe a prefix indicating that it is a scoring tag, so later we may automatically search for all such tags).

would be great to have a look at MLflow, @agramfort pointed it out to me. There are some parts that we could use, for instance the tracking one

martin1tab · 2021-03-11T18:16:11Z

When loading the data in ramp, seems training data will be read twice. When the data is big, it is a bit slow.

Is it possible to parallelize the CV process?

yes, training data is read twice for the moment since X_train, y_train, X_test, y_test = assert_data(
ramp_kit_dir, ramp_data_dir, data_label) is called twice in the testing.py module.
Same issue appears with the 'problem' variable, which is called 5 times.
It is possible to fix these issues by making the testing module object oriented, then attributes corresponding to each of theses variables, (X_train, X_test,...) could be created and we would not need to repeat calls for some functions.
But do we agree to add more object oriented code ?
yes it is

albertcthomas changed the title ~~Save scores for different versions of a submission during development~~ Collecting feature requests around a developmental feature of RAMP Nov 9, 2020

albertcthomas changed the title ~~Collecting feature requests around a developmental feature of RAMP~~ Collecting feature requests around a developmental feature for RAMP Nov 9, 2020

rth mentioned this issue Jun 25, 2021

Parallel CV #283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collecting feature requests around a developmental feature for RAMP #250

Collecting feature requests around a developmental feature for RAMP #250

kegl commented Oct 16, 2020

zhangJianfeng commented Nov 9, 2020

gabriel-hurtado commented Nov 9, 2020 •

edited by kegl

Loading

albertcthomas commented Nov 9, 2020

illyyne commented Nov 12, 2020

LudoHackathon commented Nov 19, 2020 •

edited by kegl

Loading

LudoHackathon commented Nov 23, 2020

agramfort commented Nov 23, 2020 via email

kegl commented Nov 23, 2020

agramfort commented Nov 23, 2020 via email

albertcthomas commented Nov 23, 2020 •

edited

Loading

kegl commented Nov 23, 2020

albertcthomas commented Nov 23, 2020 •

edited

Loading

albertcthomas commented Nov 23, 2020 •

edited

Loading

kegl commented Nov 23, 2020 •

edited

Loading

albertcthomas commented Nov 23, 2020 •

edited

Loading

agramfort commented Nov 23, 2020 via email •

edited

Loading

kegl commented Nov 24, 2020

agramfort commented Nov 24, 2020 via email

kegl commented Nov 24, 2020

gabriel-hurtado commented Dec 16, 2020

kegl commented Jan 28, 2021

kegl commented Feb 4, 2021 •

edited

Loading

albertcthomas commented Feb 26, 2021 •

edited

Loading

martin1tab commented Mar 11, 2021 •

edited

Loading

Collecting feature requests around a developmental feature for RAMP #250

Collecting feature requests around a developmental feature for RAMP #250

Comments

kegl commented Oct 16, 2020

zhangJianfeng commented Nov 9, 2020

gabriel-hurtado commented Nov 9, 2020 • edited by kegl Loading

albertcthomas commented Nov 9, 2020

illyyne commented Nov 12, 2020

LudoHackathon commented Nov 19, 2020 • edited by kegl Loading

LudoHackathon commented Nov 23, 2020

agramfort commented Nov 23, 2020 via email

kegl commented Nov 23, 2020

agramfort commented Nov 23, 2020 via email

albertcthomas commented Nov 23, 2020 • edited Loading

kegl commented Nov 23, 2020

albertcthomas commented Nov 23, 2020 • edited Loading

albertcthomas commented Nov 23, 2020 • edited Loading

kegl commented Nov 23, 2020 • edited Loading

albertcthomas commented Nov 23, 2020 • edited Loading

agramfort commented Nov 23, 2020 via email • edited Loading

kegl commented Nov 24, 2020

agramfort commented Nov 24, 2020 via email

kegl commented Nov 24, 2020

gabriel-hurtado commented Dec 16, 2020

kegl commented Jan 28, 2021

kegl commented Feb 4, 2021 • edited Loading

albertcthomas commented Feb 26, 2021 • edited Loading

martin1tab commented Mar 11, 2021 • edited Loading

gabriel-hurtado commented Nov 9, 2020 •

edited by kegl

Loading

LudoHackathon commented Nov 19, 2020 •

edited by kegl

Loading

albertcthomas commented Nov 23, 2020 •

edited

Loading

albertcthomas commented Nov 23, 2020 •

edited

Loading

albertcthomas commented Nov 23, 2020 •

edited

Loading

kegl commented Nov 23, 2020 •

edited

Loading

albertcthomas commented Nov 23, 2020 •

edited

Loading

agramfort commented Nov 23, 2020 via email •

edited

Loading

kegl commented Feb 4, 2021 •

edited

Loading

albertcthomas commented Feb 26, 2021 •

edited

Loading

martin1tab commented Mar 11, 2021 •

edited

Loading