Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add e2e test for tune api with LLM hyperparameter optimization #2420

Open
wants to merge 49 commits into
base: master
Choose a base branch
from

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: helenxie-bit <[email protected]>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@helenxie-bit
Copy link
Contributor Author

/area gsoc

@helenxie-bit
Copy link
Contributor Author

Ref: #2339

@helenxie-bit helenxie-bit changed the title [GSoC] Add e2e test for tune api with LLM hyperparameter optimization [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit
Copy link
Contributor Author

helenxie-bit commented Sep 24, 2024

The e2e test for the tune API has been consistently failing due to a "Timeout Error," and I have been investigating the root cause. I set the retain_trials parameter to True and retrieved the logs from the pod in the Experiment. The logs revealed that both the pytorch container and the metrics-logger-and-collector container exited with an Error 137.

When I ran kubectl describe pod $POD_NAME -n default, I noticed the following events. One specific event, "SandboxChanged," stood out as potentially problematic:

Events:
  Type    Reason          Age                    From               Message
  ----    ------          ----                   ----               -------
  ...
  Normal  SandboxChanged  3m (x2 over 3m43s)     kubelet            Pod sandbox changed, it will be killed and re-created.
  ...

However, when I checked the pod logs using kubectl logs $POD_NAME -n default --all-containers, everything appeared normal, and the logs confirmed that "Training is complete."

I also examined the kubelet and container runtime logs. While the kubelet logs provided no additional insights, the container runtime logs displayed the following error, which I believe may be related to the issue:

Sep 29 19:59:04 fv-az1986-610 dockerd[3342]: time="2024-09-29T19:59:04.631799544Z" level=info msg="Container failed to exit within 30s of signal 15 - using the force" container=bfff1b5f24d7ebcdc51d0dabe807e391053c4a4065a404203e266c5341bbfbbe spanID=6c2dd21dc1394346 traceID=738eda3fc86a653490d5534deb664c93

@andreyvelich @tenzen-y Do you have any thoughts on how to resolve this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant