[GSoC] Project6: Push-based Metrics Collection for Katib #2340

Electronic-Waste · 2024-06-03T05:10:46Z

Goal

The project aims to provide a Python SDK API interface for users to push metrics to Katib DB directly.

The current implementation of Metrics Collector is pull-based, raising design problems such as determining the frequency at which we scrape the metrics, performance issues like the overhead caused by too many sidecar containers, and restrictions on developing environments that must support sidecar containers and admission webhooks.

Thus, we decided to implement a new API for Katib Python SDK to offer users a push-based way to store metrics directly into the Kaitb DB and resolve those issues raised by pull-based metrics collection.

What I did in GSoC Project & Ongoing Works

This issue tracks the progress of developing push-based metrics collection for katib during the GSoC coding phase.

I raised numerous PRs for the Katib and Training-Operator project. Some of them are related to my GSoC project, and others may contribute to the completeness of UTs, simplicity of dependency package, and the compatibility of UI component, etc.

Also, I raised some issues not only to describe the problems and bugs I met during the coding period, but also to suggest the future enhancement direction for Katib and Training-Operator.

PRs concerned with the project:

Convert GSoC proposal to KEP: [GSoC] KEP for Project 6: Push-based Metrics Collection for Katib #2328
Add new parameter in tune function: [GSoC] Add New Parameter in tune #2369
New interface report_metrics in Python SDK: [GSoC] New Interface report_metrics in Python SDK #2371
Compatibility changes in trial controller: [GSoC] Compatibility Changes in Trial Controller #2394
Perform unit tests and e2e tests:
[SDK] fix grpc related bugs in Python SDK #2398
[SDK] test: Add e2e test for tune function. #2399
Create documents for the new feature, including API specification, usage, examples, etc.
doc(katib): update push-based metrics collector. website#3844
[GSoC] Provide a PyTorch MNIST Example for Push-based Metrics Collection #2437
[GSoC] Summary for Project 6: Push-based Metrics Collection blog#155

Other PRs:

Issues I raised:

Please let me know if you have any suggestions @kubeflow/wg-automl-leads !

The Lesson I learned during the Project

Think Twice, Code Once: @andreyvelich taught me that we should think of the API specification and all the related details before coding. This can significantly reduce the workload of the coding period and avoid big refactor of the project. Meanwhile, my understanding of Katib got clear gradually during the over-and-over rounds of re-think and re-design of the architecture.
Dive into the Source Code: Engineering projects nowadays are extremely complex and need much effort to understand them. The best way to get familiar with the project is to dive into the source code and run several examples.
Communication: Communication is the most important thing when we collaborate with others. Expressing your idea precisely and making others understand you easily are significant skills not only in open source community but also in various scenes such as company and group works.

In the End

Special Thanks:

To my mentors @andreyvelich @johnugeorge @tenzen-y, especially @andreyvelich . Your great knowledge about the code base and the industry impressed me a lot. Thanks for your timely response to my PRs and for always attending the weekly meetings to solve my pending problems. I benefited a lot from your precious guidance.
To @gaocegege . You recommend me to the Kubeflow Community. Thanks for your patient answers for my endless silly questions.
To Google. Thanks for offering such a precious opportunity for me to begin my journey in the open source world!

I hold a firm belief that every small step counts, and everybody in the community is unique and of great significance. There is no doubt that our joint efforts will surely contribute to the flourishing of our Kubeflow Community, make it the world-best community managing AI lifecycle on Kubernetes, and attract much more attention from the industry. Then, more and more new comers will pour in and work along with us.

Again, I'll continue to contribute to Kubeflow.

The text was updated successfully, but these errors were encountered:

Electronic-Waste · 2024-06-03T05:28:32Z

/assign
/area gsoc

google-oss-prow bot assigned Electronic-Waste Jun 3, 2024

google-oss-prow bot added the area/gsoc label Jun 3, 2024

Electronic-Waste mentioned this issue Jun 5, 2024

[GSoC] KEP for Project 6: Push-based Metrics Collection for Katib #2328

Merged

1 task

This was referenced Jun 23, 2024

[GSoC] Add New Parameter in tune #2369

Merged

[GSoC] New Interface report_metrics in Python SDK #2371

Merged

Electronic-Waste mentioned this issue Jul 25, 2024

[SDK] grpc-related bugs in Python SDK #2395

Closed

This was referenced Aug 1, 2024

[SDK] test: Add e2e test for tune function. #2399

Merged

[GSoC] Compatibility Changes in Trial Controller #2394

Merged

[SDK] fix grpc related bugs in Python SDK #2398

Merged

This was referenced Sep 4, 2024

doc(katib): update push-based metrics collector. kubeflow/website#3844

Open

[SDK] Add Some Checks for metrics Field in report_metrics() Interface #2421

Closed

Electronic-Waste mentioned this issue Sep 28, 2024

[GSoC] Summary for Project 6: Push-based Metrics Collection kubeflow/blog#155

Open

andreyvelich mentioned this issue Oct 9, 2024

[Enhancement Request] Metrics Collector Push-based Implementation #577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC] Project6: Push-based Metrics Collection for Katib #2340

[GSoC] Project6: Push-based Metrics Collection for Katib #2340

Electronic-Waste commented Jun 3, 2024 •

edited

Loading

Electronic-Waste commented Jun 3, 2024

[GSoC] Project6: Push-based Metrics Collection for Katib #2340

[GSoC] Project6: Push-based Metrics Collection for Katib #2340

Comments

Electronic-Waste commented Jun 3, 2024 • edited Loading

Goal

What I did in GSoC Project & Ongoing Works

The Lesson I learned during the Project

In the End

Electronic-Waste commented Jun 3, 2024

Electronic-Waste commented Jun 3, 2024 •

edited

Loading