从零到一:基于 K3s 快速搭建本地化 kubeflow AI 机器学习平台

shida_csdn 2024-09-04 09:01:02 阅读 85

背景

Kubeflow 是一种开源的 Kubernetes 原生框架,可用于开发、管理和运行机器学习工作负载,支持诸如 PyTorch、TensorFlow 等众多优秀的机器学习框架,本文介绍如何在 Mac 上搭建本地化的 kubeflow 机器学习平台。

在这里插入图片描述

注意:本文以 deyloyKF 发行版作为主要安装对象,本地环境仅适用于开发测试使用,不可用于生产环境!

更多 kubeflow 发行版参考官网介绍:https://www.kubeflow.org/docs/started/installing-kubeflow/

基本环境:

OS:macos 13.1 (amd64)

DockerDesktop:v4.15.0

尽管 K3s 自身需要的资源不多,但是 kubeflow 套件组件众多,需要设置 Docker 的资源分配,避免安装过程中发生 Pod Pending.

Docker 资源建议设置:CPU 8 核,Memory 10G,磁盘 40G

在这里插入图片描述

安装部署步骤

1. 安装依赖的 CLI

<code>brew install bash argocd jq k3d kubectl kustomize

2. 创建 Kubernetes 集群

为了尽可能降低资源消耗,这里使用 K3s 运行本地集群:

k3d cluster create "kubeflow" --image "rancher/k3s:v1.27.10-k3s2"code>

通过如下命令检查集群是否就绪:

kubectl get -A pods

正常的输出结果类似如下这样:

NAMESPACE NAME READY STATUS RESTARTS AGE

kube-system local-path-provisioner-957fdf8bc-cj9l5 1/1 Running 0 2m30s

kube-system coredns-77ccd57875-xzzz4 1/1 Running 0 2m30s

kube-system metrics-server-648b5df564-gwnhq 1/1 Running 0 2m30s

kube-system helm-install-traefik-crd-49l4k 0/1 Completed 0 2m31s

kube-system helm-install-traefik-xrjtd 0/1 Completed 2 2m31s

kube-system svclb-traefik-a79cf0ef-lj4td 2/2 Running 0 89s

kube-system traefik-768bdcdcdd-mr8z8 1/1 Running 0 89s

3. 部署 ArgoCD

ArgoCD 是工作流编排工具,可以帮助我们实现 Kubeflow 的自动化部署

git clone -b main https://github.com/deployKF/deployKF.git

cd deployKF/argocd-plugin

chmod +x ./install_argocd.sh

bash ./install_argocd.sh

通过如下命令检查 ArgoCD 是否就绪:

kubectl get pod -n argocd

正常的输出结果类似如下这样:

NAME READY STATUS RESTARTS AGE

argocd-redis-69f8795dbd-7v4nn 1/1 Running 0 106s

argocd-applicationset-controller-7b9c4dfb77-7gsf2 1/1 Running 0 106s

argocd-notifications-controller-756764ddd5-jw92c 1/1 Running 0 106s

argocd-server-86f64667bc-7nt7d 1/1 Running 0 105s

argocd-application-controller-0 1/1 Running 0 105s

argocd-dex-server-9b5c6dccd-2p779 1/1 Running 0 106s

argocd-repo-server-5b55578f7c-sfzf4 2/2 Running 0 105s

4. 安装 kubeflow 套件

准备如下文件:deploykf-app-of-apps.yaml

apiVersion: argoproj.io/v1alpha1

kind: Application

metadata:

name: deploykf-app-of-apps

namespace: argocd

labels:

app.kubernetes.io/name: deploykf-app-of-apps

app.kubernetes.io/part-of: deploykf

spec:

project: "default"

source:

## source git repo configuration

## - we use the 'deploykf/deploykf' repo so we can read its 'sample-values.yaml'

## file, but you may use any repo (even one with no files)

##

repoURL: "https://github.com/deployKF/deployKF.git"

targetRevision: "v0.1.4"

path: "."

## plugin configuration

##

plugin:

name: "deploykf"

parameters:

## the deployKF generator version

## - available versions: https://github.com/deployKF/deployKF/releases

##

- name: "source_version"

string: "0.1.4"

## paths to values files within the `repoURL` repository

## - the values in these files are merged, with later files taking precedence

## - we strongly recommend using 'sample-values.yaml' as the base of your values

## so you can easily upgrade to newer versions of deployKF

##

- name: "values_files"

array:

- "./sample-values.yaml"

## a string containing the contents of a values file

## - this parameter allows defining values without needing to create a file in the repo

## - these values are merged with higher precedence than those defined in `values_files`

##

- name: "values"

string: |

##

## This demonstrates how you might structure overrides for the 'sample-values.yaml' file.

## For a more comprehensive example, see the 'sample-values-overrides.yaml' in the main repo.

##

## Notes:

## - YAML maps are RECURSIVELY merged across values files

## - YAML lists are REPLACED in their entirety across values files

## - Do NOT include empty/null sections, as this will remove ALL values from that section.

## To include a section without overriding any values, set it to an empty map: `{}`

##

## --------------------------------------------------------------------------------

## argocd

## --------------------------------------------------------------------------------

argocd:

namespace: argocd

project: default

## --------------------------------------------------------------------------------

## kubernetes

## --------------------------------------------------------------------------------

kubernetes:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------------------------------------------------

## deploykf-dependencies

## --------------------------------------------------------------------------------

deploykf_dependencies:

## --------------------------------------

## cert-manager

## --------------------------------------

cert_manager:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------

## istio

## --------------------------------------

istio:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------

## kyverno

## --------------------------------------

kyverno:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------------------------------------------------

## deploykf-core

## --------------------------------------------------------------------------------

deploykf_core:

## --------------------------------------

## deploykf-auth

## --------------------------------------

deploykf_auth:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------

## deploykf-istio-gateway

## --------------------------------------

deploykf_istio_gateway:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------

## deploykf-profiles-generator

## --------------------------------------

deploykf_profiles_generator:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------------------------------------------------

## deploykf-opt

## --------------------------------------------------------------------------------

deploykf_opt:

## --------------------------------------

## deploykf-minio

## --------------------------------------

deploykf_minio:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------

## deploykf-mysql

## --------------------------------------

deploykf_mysql:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------------------------------------------------

## kubeflow-tools

## --------------------------------------------------------------------------------

kubeflow_tools:

## --------------------------------------

## katib

## --------------------------------------

katib:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------

## notebooks

## --------------------------------------

notebooks:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

## --------------------------------------

## pipelines

## --------------------------------------

pipelines:

{ } # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!

destination:

server: "https://kubernetes.default.svc"

namespace: "argocd"

执行如下命令,部署工作流:

kubectl apply -f ./deploykf-app-of-apps.yaml

通过 UI 界面查看 ArgoCD 状态:

kubectl port-forward --namespace "argocd" svc/argocd-server 8090:https

浏览器打开 https://localhost:8090/,用户名:admin,密码可通过如下命令获取:

echo $(kubectl -n argocd get secret/argocd-initial-admin-secret \

-o jsonpath="{.data.password}" | base64 -d)code>

在这里插入图片描述

由于程序间存在依赖关系,可以通过如下脚本按序执行 Sync 操作:

<code>git clone -b main https://github.com/deployKF/deployKF.git

cd deployKF/scripts

chmod +x ./sync_argocd_apps.sh

bash ./sync_argocd_apps.sh

该脚本是幂等的,失败后可反复执行直到部署成功,成功部署后的运行中 Pod 列表类似如下这样:

NAMESPACE NAME READY STATUS RESTARTS AGE

argocd argocd-redis-69f8795dbd-x5wtv 1/1 Running 5 (17m ago) 105m

argocd argocd-server-86f64667bc-zfm7m 1/1 Running 4 (17m ago) 73m

argocd argocd-repo-server-5b55578f7c-x26zz 2/2 Running 10 (17m ago) 91m

argocd argocd-notifications-controller-756764ddd5-2fqbr 1/1 Running 5 (17m ago) 89m

argocd argocd-dex-server-9b5c6dccd-bl86m 1/1 Running 5 (17m ago) 91m

argocd argocd-application-controller-0 1/1 Running 5 (17m ago) 91m

argocd argocd-applicationset-controller-7b9c4dfb77-hph2r 1/1 Running 5 (17m ago) 105m

cert-manager cert-manager-c688c56f-w4jts 1/1 Running 5 (17m ago) 109m

cert-manager trust-manager-78766fd9bd-zd5zf 1/1 Running 5 (17m ago) 90m

cert-manager cert-manager-webhook-d45447457-q6cf8 1/1 Running 6 (17m ago) 109m

cert-manager cert-manager-cainjector-59d694bcc7-mrcvg 1/1 Running 6 (17m ago) 109m

deploykf-auth oauth2-proxy-5fd9888b79-tpnrt 2/2 Running 11 (16m ago) 73m

deploykf-auth dex-68c8bf56b9-78d5g 2/2 Running 8 (17m ago) 73m

deploykf-dashboard profile-controller-5575767c76-vshp2 2/2 Running 8 (17m ago) 73m

deploykf-dashboard kfam-api-75b64c9645-sjfcq 2/2 Running 10 (17m ago) 98m

deploykf-dashboard central-dashboard-6b5d9574dc-fmlt4 2/2 Running 10 (17m ago) 98m

deploykf-istio-gateway deploykf-gateway-6ddf8947cc-qz55g 1/1 Running 5 (17m ago) 98m

deploykf-minio deploykf-minio-568b877668-w2wct 2/2 Running 5 (17m ago) 52m

deploykf-mysql deploykf-mysql-0 1/1 Running 5 (17m ago) 109m

istio-system istiod-7b9b6df595-jbztw 1/1 Running 5 (17m ago) 91m

kube-system svclb-deploykf-gateway-7f7cba3a-kkskn 3/3 Running 15 (17m ago) 100m

kube-system metrics-server-648b5df564-gwnhq 1/1 Running 9 (17m ago) 5h43m

kube-system local-path-provisioner-957fdf8bc-cj9l5 1/1 Running 7 (17m ago) 5h43m

kube-system coredns-77ccd57875-xzzz4 1/1 Running 7 (17m ago) 5h43m

kube-system traefik-768bdcdcdd-mr8z8 1/1 Running 7 (17m ago) 5h42m

kube-system svclb-traefik-a79cf0ef-6ksjm 2/2 Running 10 (17m ago) 100m

kubeflow katib-controller-75858c4ddf-hwvkx 1/1 Running 8 (17m ago) 95m

kubeflow ml-pipeline-ui-68b7f6586d-qtjp5 2/2 Running 15 (17m ago) 94m

kubeflow ml-pipeline-persistenceagent-68bbd65f98-tsnqn 2/2 Running 10 (17m ago) 94m

kubeflow katib-ui-d4df8bdb6-2x75p 2/2 Running 10 (17m ago) 95m

kubeflow ml-pipeline-6445d9fb77-dxgv4 2/2 Running 24 (16m ago) 94m

kubeflow admission-webhook-deployment-789dc56fbf-z7cj8 1/1 Running 5 (17m ago) 94m

kubeflow metadata-writer-6f95b9588c-fmx4s 2/2 Running 8 (17m ago) 73m

kubeflow notebook-controller-deployment-649cf9b976-vnvwd 2/2 Running 10 (17m ago) 95m

kubeflow training-operator-7cf5c66858-jf5sr 1/1 Running 3 (17m ago) 43m

kubeflow tensorboards-web-app-deployment-778466f5f6-dmrks 2/2 Running 2 (17m ago) 43m

kubeflow tensorboard-controller-deployment-644f57dd7c-zlxnw 3/3 Running 24 (17m ago) 92m

kubeflow ml-pipeline-scheduledworkflow-578475988-kwz27 2/2 Running 10 (17m ago) 94m

kubeflow volumes-web-app-deployment-588d46bb75-95g6b 2/2 Running 2 (17m ago) 42m

kubeflow ml-pipeline-viewer-crd-6857ccc85c-zl895 2/2 Running 10 (17m ago) 94m

kubeflow metadata-grpc-deployment-566d54d578-wwj9n 2/2 Running 23 (16m ago) 94m

kubeflow ml-pipeline-visualizationserver-7b45b7fd56-s4pxh 2/2 Running 15 (17m ago) 94m

kubeflow cache-server-66d7586749-prmkq 2/2 Running 10 (17m ago) 94m

kubeflow jupyter-web-app-deployment-9c8c779c-hcqvr 2/2 Running 15 (17m ago) 91m

kubeflow katib-db-manager-6998f5bdd8-lrs77 1/1 Running 5 (17m ago) 95m

kubeflow metadata-envoy-deployment-b48db5966-542nh 1/1 Running 5 (17m ago) 94m

kubeflow-argo-workflows argo-workflow-controller-79fc5c6895-2g26t 2/2 Running 10 (17m ago) 98m

kubeflow-argo-workflows argo-server-6d97fb7649-lsfdw 2/2 Running 5 (16m ago) 73m

kyverno kyverno-cleanup-controller-6cb4d5848-hh8nm 1/1 Running 5 (17m ago) 109m

kyverno kyverno-admission-controller-964c74c7d-frknb 1/1 Running 5 (17m ago) 109m

kyverno kyverno-background-controller-796f77c79f-nwhrs 1/1 Running 5 (17m ago) 109m

kyverno kyverno-reports-controller-6d6d98fc96-z7qjv 1/1 Running 5 (17m ago) 109m

kyverno kyverno-admission-controller-964c74c7d-hgtc2 1/1 Running 4 (17m ago) 109m

kyverno kyverno-admission-controller-964c74c7d-x744h 1/1 Running 5 (17m ago) 109m

team-1 ml-pipeline-visualizationserver-677c86b748-nbrr5 2/2 Running 2 (17m ago) 73m

team-1 ml-pipeline-ui-artifact-7749b4f5f6-ld7kl 2/2 Running 10 (17m ago) 94m

team-1-prod ml-pipeline-visualizationserver-677c86b748-hqwsh 2/2 Running 2 (17m ago) 73m

team-1-prod ml-pipeline-ui-artifact-7749b4f5f6-hl6gk 2/2 Running 10 (17m ago) 94m

同步完成后的 ArgoCD 界面(完成 20 个应用同步):

在这里插入图片描述

5. 访问控制台

执行端口转发:

<code>kubectl port-forward \

--namespace "deploykf-istio-gateway" \code>

svc/deploykf-gateway 8080:http 8443:https

由于 Istio Gateway 基于 Host Header 区分访问的目标服务,因此需要配置本地 /etc/hosts 文件,追加如下内容:

127.0.0.1 deploykf.example.com

127.0.0.1 argo-server.deploykf.example.com

127.0.0.1 minio-api.deploykf.example.com

127.0.0.1 minio-console.deploykf.example.com

浏览器访问 https://deploykf.example.com:8443/

管理员:用户名 admin@example.com 密码 admin

用户 1: 用户名 user1@example.com 密码 user1

用户 2: 用户名 user2@example.com 密码 user2

在这里插入图片描述

6. 运行 Jupyter

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

更多功能持续探索中…

本文引用

https://www.deploykf.org/guides/local-quickstart/



声明

本文内容仅代表作者观点,或转载于其他网站,本站不以此文作为商业用途
如有涉及侵权,请联系本站进行删除
转载本站原创文章,请注明来源及作者。