-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inform submission about evaluation step #719
Comments
Hi Niccolo, |
Hi @priyakasimbeg, thanks for looking into this! I'll explain better. In classical weight averaging (Polyak Averaging, SWA, LAWA...), we collect a copy of model_parameters once in a while, aggregate those checkpoints, and use them for evaluation only, while keep progressing with training. Under the current code infrastructure, we do not know when an eval is going to occur, so, if we want to implement weight averaging, we need to always return the average in This is problematic because we also have to keep in memory the current model state to resume training from it at the next iteration, requiring a deepcopy of the model + storing it to cpu at each iteration (since CUDA memory is limited), hence introducing a big overhead to this kind of submission. Pseudocode of weight avg under current codebase:
My suggestion is to decide weather an evaluation is gonna occur before calling This expands the actions that a submission is allowed to do, enlarging the space of possible submissions, and allowing for a situation that is common in practice (in a common ML task, we do know when eval is going to occur, and we can exploit this information). |
Notice that the pull request modifies current submissions adding the extra argument We can make the code backward compatible by putting the call to |
Hi Niccolo, we believe that figuring out before the call of But since you are free to "re-interpret" what a step means. You could also make your step longer (e.g. by just doing 10 iterations of updating the parameters as an inner-loop). In that way, you can reduce some of the model-transfer costs, right? We are happy to discuss this topic more in our WG meeting. Could you join this week Thursday (19:35 – 20:30 German time)? |
Sure, happy to join the WG meeting on Thursday. Making a step longer is indeed a good solution, thanks! Regarding the problem of a too long final step, we could solve it by checking again after update_params if there is time remaining, and proceed to eval only in that case. The only drawback is that we might inform the submission of an imminent eval step that is not gonna occur (this would happen one time at most). Something like:
|
From offline WG group discussion we have decided to leave it as is for now, but in future iterations, we could have a prepare_for_eval function or similar to account for this. Reordering the check for whether an eval is due is difficult (before the update_params and let the submission know if an eval is coming up). The benchmark code doesn’t “know” how long the update_params function will take. A submission might have exceeded the wall-clock budget. Some suggestions for workarounds:
|
We plan to discuss feature requests like these in the benchmark code during the WG meeting on Thursday, 9/5. |
While thinking about this issue and the possible changes it required, I created this skeleton version of our current # Bookkeeping Train State
train_state['is_time_remaining'] = True
train_state['validation_goal_reached'], train_state['test_goal_reached'] = False, False
train_state['last_eval_time'], train_state['accumulated_submission_time'] = 0, 0
train_state['training_complete'] = False # Can be set to true by the submission via the spec.TrainingCompleteError
goals_reached = (
train_state['validation_goal_reached'] and
train_state['test_goal_reached'])
# [...]
# Training loop
# Only start training if time remaining and training is not complete
while train_state['is_time_remaining'] and \
not goals_reached and \
not train_state['training_complete']:
# [...]
update_params()
# [...]
# Update submission time and compute (but not check) if time is remaining
train_state['accumulated_submission_time'] += # [...]
train_state['is_time_remaining'] = train_state['accumulated_submission_time'] < max_allowed_runtime_sec
# [...]
# Check if the submission is eligible for an untimed eval.
if (time_since_last_eval >= workload.eval_period_time_sec or train_state['training_complete']):
eval_model()
# Check if targets are reached.
train_state["validation_goal_reached"], train_state["test_goal_reached"] = # [...] I believe the issue is that we don't check whether there is still time left directly before doing the eval. @priyakasimbeg, this is the bug I mentioned in our call yesterday. We currently only check for the time remaining at the beginning of the while loop, then do the submission's To fix this, we can perform the following modifications. This would also rather easily allow a submission's # Training loop
# Only start training if time remaining and training is not complete
while train_state['is_time_remaining'] and \
not goals_reached and \
not train_state['training_complete']:
# [...]
update_params()
# [...]
# Check if the submission is eligible for an untimed eval.
if (time_since_last_eval >= workload.eval_period_time_sec or train_state['training_complete']):
prepare_for_eval()
# Update submission time and compute if time is remaining
train_state['accumulated_submission_time'] += # [...]
train_state['is_time_remaining'] = train_state['accumulated_submission_time'] < max_allowed_runtime_sec
# Only eval if time is remaining
if train_state['is_time_remaining']:
eval_model()
# Check if targets are reached.
train_state["validation_goal_reached"], train_state["test_goal_reached"] = # [...] I guess this logic is needed irrespective of whether we allow a |
tl;dr: We should let the submission know if an evaluation is going to happen at the current step or not.
Description
Currently, there is no easy way for the submission to know if the model returned by
update_params
is going to be evaluated on the train/valid/test set.
This limits the space of possible submission, or at least force them to apply some workarounds to infer whether the current step is going to be an evaluation step or not. (A possible workaround is to keep track of time from last evaluation inside the submission, but this adds a non-negligible overhead to the submission itself and deviates from the original goal of the submission.).
An example of a submission where it is crucial to know when evaluation is going to occur is Stochastic Weight Average.
Possible solutions
A straightforward solution is to decide if the submission is eligible for an untimed eval before calling
update_params
, and add an argument toupdate_params
that passes this information to the submission.The only drawback with this approach is that we don't evaluate every
workload.eval_period_time_sec
, but a little less frequently (we evaluate everyworkload.eval_period_time_sec
+ submission_time_per_step). Assuming thatworkload.eval_period_time_sec
>> submission_time_per_step, this is hopefully not a big difference.I think this is an important feature, and it would be nice to implement it.
The text was updated successfully, but these errors were encountered: