diff --git a/_posts/2024-07-22-kubeflow-1.9-release.md b/_posts/2024-07-22-kubeflow-1.9-release.md new file mode 100644 index 0000000..5ddd5bd --- /dev/null +++ b/_posts/2024-07-22-kubeflow-1.9-release.md @@ -0,0 +1,232 @@ +--- +title: "Kubeflow 1.9: New Tools for Model Management and Training Optimization" +layout: post +toc: false +comments: true +image: images/logo.png +hide: false +categories: [release] +permalink: /kubeflow-1.9-release/ +author: "Kubeflow 1.9 Release Team, Stefano Fioravanzo" +--- + +Kubeflow 1.9 significantly simplifies the development, tuning and management of secure machine learning models and LLMs. Highlights include: + +- **Model Registry**: Centralized management for ML models, versions, and artifacts. +- **Fine-Tune APIs for LLMs**: Simplifies fine-tuning of LLMs with custom datasets. +- **Pipelines**: Consolidation of Tekton and Argo Workflows backends for improved flexibility. +- **Security Enhancements**: Network policies, Oauth2-proxy, and CVE scanning. +- **Integration Upgrades**: Improved integrations with Ray, Seldon, BentoML, and KServe for LLM GPU optimizations. +- **Installation and Documentation**: Streamlined installation, updated platform dependencies, and enhanced documentation. + +These updates aim to simplify workflows, improve integration dependencies, and provide Kubernetes-native operational efficiencies for enterprise scale, security, and isolation. + + +## Model Registry + +A model registry provides a central catalog for ML model developers to index and manage models, versions, and ML artifacts metadata. It fills a gap between model experimentation and production activities. It provides a central interface for all stakeholders in the ML lifecycle to collaborate on ML models. Model registry has been [asked by the community](https://blog.kubeflow.org/kubeflow-user-survey-2023/#:~:text=lifecycle%2C%20followed%20by-,model%20registry%20(44%25),-and%20initial%20setup) for a long time and we are delighted to introduce it to the Kubeflow ecosystem. + +This initial release includes REST APIs and a Python SDK to track model artifacts and model metadata with a standardized format that can be reused across Kubeflow components, such as to deploy Inference Servers. You can get started by following the [Model Registry tutorial on the Kubeflow website](https://www.kubeflow.org/docs/components/model-registry/overview/), or see a short [demo video](https://www.youtube.com/watch?v=JVxUTkAKsMU) of the Model Registry in action. + +We are just getting started. This is an Alpha version and we look forward to feedback. The [model registry working group](https://docs.google.com/document/d/1DmMhcae081SItH19gSqBpFtPfbkr9dFhSMCgs-JKzNo/edit) meets biweekly: you can provide feedback by joining the meeting or directly on the [repository](https://github.com/kubeflow/model-registry/issues). + +## Fine-Tune APIs for LLMs + +In the rapidly evolving ML/AI landscape, the ability to fine-tune pre-trained models represents a significant leap towards achieving custom solutions with less effort and time. Fine-tuning with custom datasets allows practitioners to adapt large language models (LLMs) to their specific needs. + +However, fine-tuning tasks often require extensive manual intervention, including the configuration of training environments and the distribution of data across nodes. The new [Fine-Tune API](https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) aims to simplify this process, offering an easy-to-use Python interface that abstracts away the complexity involved in setting up and executing fine-tuning tasks on distributed systems. + +By providing this API, Training Operator not only simplifies the user experience for ML practitioners but also leverages its existing infrastructure for distributed training. You can take advantage of Kubernetes’ ability to dynamically schedule GPU, thus saving on compute resources and cost. Training Operator also gives you fault tolerant guarantees, to save your training procedures from cluster node failures. + +## Pipelines + +### v1 Feature Parity + +We made significant progress towards KFPv1 feature parity by adding more Kubernetes resources to the Pipelines code with the new kfp-kubernetes 1.2.0 Python package. We encourage every KFP user to test the new V2 functionality and plan your migration from V1 to V2. We still have some outstanding features that need to be ported over to V2, please help us to identify what's missing by openining a new issue in the KFP repository. + +### Argo Workflows and Tekton Backends Consolidation + +The Pipelines Tekton backend has been [merged](https://github.com/kubeflow/pipelines/pull/10678) into the main Kubeflow Pipelines repository. You can now choose what workflow engine to use from the same Pipelines version. This proves the extensibility and flexibility of the KFP v2 architecture, which encourages other contributors to bring support for other workflow engines. + +Both Argo Workflows and Tekton provide unique advantages. Argo Workflows is known for its simplicity and ease of use, making it a popular choice for many users. Tekton offers extensive customization options with its pipeline definitions and reusable components, which can be advantageous for integrating into various CI/CD systems. Depending on your specific requirements and preferences, you can leverage the strengths of either Argo Workflows or Tekton to optimize your machine learning workflows. + +In this [blog post](https://developer.ibm.com/blogs/awb-tekton-optimizations-for-kubeflow-pipelines-2-0/), you can find more details about the benefits of running KFP with either Tekton or Argo Workflows. + +### Argo Workflows Upgrade + +Kubeflow Pipelines’s Argo Workflows backend is [upgraded to 3.4.16](https://github.com/kubeflow/pipelines/issues/10469). This upgrade moves the supported version closer to the latest upstream version and brings lots of CVE resolutions. The previous minor version was no longer being patched by the Argo community, so lots of security issues had accumulated over time. + +## Katib + +Kubeflow 1.9 ships with Katib 0.17, which brings [official support](https://github.com/kubeflow/katib/pull/2315) for ARM64, getting us one step closer to full ARM64 coverage. + +For Data Scientists who submit training jobs with the Python SDK, you can now set the [algorithm settings](https://github.com/kubeflow/katib/pull/2227) and [environment variables](https://github.com/kubeflow/katib/pull/2235) from the tune method. Previously, you had to rely directly on Kubernetes CRD submission for these. You can also take advantage of the latest features from TensorFlow 2.16 and PyTorch 2.2. The team also worked to resolve [environmental conflicts](https://github.com/kubeflow/katib/issues/2346) that prevented the Katib Python SDK to be installed alongside the Kubeflow Python SDK. + +There are tons of additional improvements and bug fixes. Check out the full changelog [here](https://github.com/kubeflow/katib/blob/master/CHANGELOG.md). + +## Central Dashboard + +This release bring several improvements to the Kubeflow Central Dashboard, including: + +- Styling improvements to the sidebar, including [grouping](https://github.com/kubeflow/kubeflow/pull/7583) all Kubeflow Pipelines links to reduce clutter +- Significant [improvements](https://github.com/kubeflow/kubeflow/pull/7582) to the “manage contributors” page, including the ability to manage contributors for all profiles that you are the owner of, and see which profiles you have access to, even when you are not the owner +- Allow external services to [parse](https://github.com/kubeflow/kubeflow/pull/7138) the current profile (namespace) by sending the namespace selector value to non-iframed applications +- Significant [updates](https://github.com/kubeflow/kubeflow/pull/7578) to dependencies to reduce CVEs + +![Kubeflow notebook images](../images/2024-07-22-kubeflow-1.9-release/dashboard.png) + +## Notebooks + +With this release, we provide [significant updates](https://github.com/kubeflow/kubeflow/pull/7590) to all example notebook images including PyTorch 2.3.0, Tensorflow 2.15.1 and many other library updates. While you can continue to use the old images, we recommend updating to use the greatest and latest ML libraries. + +Additionally, notebooks images now run with a [non-root SecurityContext](https://github.com/kubeflow/kubeflow/pull/7622), allowing for an improved security. + +Take a look at the [changelog](https://github.com/kubeflow/kubeflow/releases/tag/v1.9.0) for a full list of bug fixes and improvements. + +While this release was light on new Notebooks features, the Working Group is hard at work on an exciting new project: we are actively developing Notebooks V2, with contributions from various companies, in the [new repository](https://github.com/kubeflow/notebooks/tree/notebooks-v2). Take a look [here](https://github.com/kubeflow/kubeflow/issues/7156) and join our Working Group meetings to get involved! + + +## Kubeflow Platform (Security and Manifests) + +### Security + +#### Network Policies + +Network policies are enabled for the Kubeflow core services as a second layer of defense before Istio authorization policies. This gives administrators a better network overview and segmentation while also enforcing common enterprise security guidelines. +You can read more about the current implementation and architecture [here](https://github.com/kubeflow/manifests/tree/master/common/networkpolicies). + +#### Authentication + +Oauth2-proxy replaces oidc-authservice, which brings improved token-based authentication. Machine Learning engineers can now use tokens instead of insecure passwords for CI/CD automation of Kubeflow deployment and maintenance (e.g. using GitHub actions). +You can read more about the current implementation and architecture [here](https://github.com/kubeflow/manifests/tree/master/common/oidc-client/oauth2-proxy). + +#### CVE Scanning + +With this release we are introducing [automated CVE scanning](https://github.com/kubeflow/manifests/blob/master/hack/trivy_scan.py) with [Trivy](https://github.com/aquasecurity/trivy) on the manifests [master branch](https://github.com/kubeflow/manifests/blob/master/.github/workflows/trivy.yaml). We appreciate contributions to reduce the number of CVEs, the Security Working Group needs help to build a more secure platform. You can find more details about our security scanning process and disclosure policy here. Here are is a summary from June 25th: + +![Kubeflow notebook images](../images/2024-07-22-kubeflow-1.9-release/CVE_table.png) + +You can find a detailed Security WG roadmap [here](https://github.com/kubeflow/manifests/issues/2598). + +### Manifests + +#### Installation and documentation improvements + +The [documentation](https://github.com/kubeflow/manifests?tab=readme-ov-file#upgrading-and-extending) has been improved and now contains guidelines for upgrading and extending the Kubeflow Platform for administrators. New users can now install kubeflow on their laptop in just a few minutes + +Platform dependencies updates: + +
Component + | +Kubernetes + | +Kustomize + | +Istio + | +Dex + | +Cert-Manager + | +Knative + | +
KF 1.9 Version + | +1.27 - 1.29 + | +5.2.1+ + | +1.22.1 + | +2.39.1 + | +1.14.5 + | +1.12.4 + | +