kubernetes部署中文版llama3模型
在Kubernetes环境中通过ollama与lobe-chat,部署一个自定义的llama3模型。
在这篇文章中,我们将从hugging face下载一个llama3的中文版模型,最终通过Ollama和LobeChat部署在Kubernetes集群中。
环境信息
测试环境使用了一个单节点的k8s集群,并且不带GPU,得益于GGML,使用CPU运行llama3也能相当流畅。
$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
maao-dev Ready control-plane 224d v1.28.2 192.168.2.2 <none> Ubuntu 20.04.4 LTS 5.15.0-75-generic containerd://1.6.18
$ ollama -v
ollama version is 0.1.31
$ uname -a
Linux maao-dev 5.15.0-75-generic #82~20.04.1-Ubuntu SMP Wed Jun 7 19:37:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
制作、上传ollama模型
在制作ollama模型执行,先按照官方文档安装ollama。
接着从zhouzr/Llama3-8B-Chinese-Chat-GGUF · Hugging Face上,下载模型,我选择的是Llama3-8B-Chinese-Chat.q4_k_m.GGUF
这个版本,以节省内存。
接下来需要制作ollama模型,创建如下的Modelfile,放到Llama3-8B-Chinese-Chat.q4_k_m.GGUF
相同的目录下:
$ ls
Llama3-8B-Chinese-Chat.q5_k_m.GGUF Modelfile
$ cat Modelfile
FROM ./Llama3-8B-Chinese-Chat.q5_k_m.GGUF
TEMPLATE """
{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>
"""
SYSTEM """你是一个疯狂的科学家大卫,你总是为了毁灭宇宙而努力。"""
TEMPLATE
直接使用ollama library里llama3的template,通过SYSTEM加入system message。
接下来创建llama3-chinese的model
$ ollama create llama3-Chinese:8B -f Modelfile
待创建成功后,查看模型
$ ollama list
NAME ID SIZE MODIFIED
llama3-Chinese:8B 1cfcd2becfd0 5.7 GB 17 hours ago
接着我们将模型上传到仓库。ollama模型兼容registry,你可以像保存容器镜像那样保存,从registry仓库中push或pull model,但ollama目前无法通过harbor保存。
启动一个docker registry
$ docker run -d -p 5000:5000 --name registry registry:2
接下来将模型上传到registry
$ ollama cp llama3-Chinese:8B localhost:5000/ollama/llama3-chinese:8b
$ ollama push localhost:5000/ollama/llama3-chinese:8b
ollama cp
命令我在mac上执行会报错
Error: destination "llama3-Chinese:8B" is invalid
但linux就没问题,暂时不清楚原因
部署ollama
创建测试的namespace。
$ kubectl create ns ollama
创建ollama deploy使用的pvc,用于保存model。我这里是单节点,使用local-storage
作为演示。具体的storageClass、pv、pvc如下:
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: ollama-pv
spec:
capacity:
storage: 30Gi
storageClassName: local-storage
accessModes:
- ReadWriteOnce
hostPath:
path: /tmp/ollama-pv
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ollama
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "30Gi"
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "true"
name: local-storage
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
查看pvc
$ kubectl get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
ollama ollama-pvc Bound ollama-pv 30Gi RWO local-storage 53s
创建ollama deployment与service,yaml如下:
---
apiVersion: v1
kind: Service
metadata:
name: ollama-svc
namespace: ollama
spec:
type: NodePort
ports:
- port: 11434
targetPort: 11434
protocol: TCP
name: http
selector:
app.kubernetes.io/name: ollama
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deploy
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: ollama
template:
metadata:
labels:
app.kubernetes.io/name: ollama
spec:
containers:
- name: ollama
image: "ollama/ollama:0.1.32"
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 11434
protocol: TCP
livenessProbe:
httpGet:
path: /
port: http
readinessProbe:
httpGet:
path: /
port: http
resources:
limits:
cpu: 8
memory: 8Gi
requests:
cpu: 100m
memory: 128Mi
env:
# ollama server允许所有地址的请求
- name: OLLAMA_ORIGINS
value: "*"
# ollama server的监听地址
- name: OLLAMA_HOST
value: "0.0.0.0"
# ollama模型默认保存在~/.ollama目录下,将上面创建的卷挂载到.ollama
volumeMounts:
- name: llm-data
mountPath: /root/.ollama
volumes:
- name: llm-data
persistentVolumeClaim:
claimName: ollama-pvc
查看ollama pod与service
$ kubectl get po -n ollama
NAME READY STATUS RESTARTS AGE
ollama-deploy-6f69784c57-8bx84 1/1 Running 0 104s
$ kubectl get svc -n ollama
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ollama-svc NodePort 10.97.150.120 <none> 11434:32532/TCP 111s
接下来创建一个job,用于在ollama pod启动后,拉取我们刚才的model。
apiVersion: batch/v1
kind: Job
metadata:
name: ollama-llm-puller
namespace: ollama
labels:
app.kubernetes.io/name: ollama
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: llm-puller
image: alpine
command:
- /bin/sh
- -c
- |
set -e
apk add --no-cache curl
ollama_service="http://ollama-svc:11434"
while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' ${ollama_service})" != "200" ]]; do
echo "Waiting for Ollama service to be ready..."
sleep 5
done
echo "Pulling model: llama3-chinese"
curl -s ${ollama_service}/api/pull -d '{"name": "192.168.2.2:5000/ollama/llama3-chinese:8b", "insecure": true}'
restartPolicy: Never
由于pull操作是在ollama pod里执行的,因此我们使用主机的IP地址替换localhost
,即拉取模型192.168.2.2:5000/ollama/llama3-chinese:8b
。
查看拉取模型的pod
$ kubectl get po -n ollama -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ollama-deploy-6f69784c57-8bx84 1/1 Running 0 2m41s 10.0.0.112 maao-dev <none> <none>
ollama-llm-puller-6fp28 1/1 Running 0 15s 10.0.0.187 maao-dev <none> <none>
拉取完成后,进入到ollama-deploy-6f69784c57-8bx84
里,可以看到对应的模型。
$ kubectl exec -it ollama-deploy-6f69784c57-8bx84 -n ollama bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@ollama-deploy-6f69784c57-8bx84:/# ollama list
NAME ID SIZE MODIFIED
192.168.2.2:5000/ollama/llama3-chinese:8b 4c2c30771859 5.7 GB 5 minutes ago
部署LobeChat
部署lobeChat的deploy和service
---
apiVersion: v1
kind: Service
metadata:
name: ollama-webui-svc
namespace: ollama
labels:
app.kubernetes.io/name: ollama-webui
spec:
type: NodePort
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app.kubernetes.io/name: ollama-webui
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-webui-deploy
namespace: ollama
labels:
app.kubernetes.io/name: ollama-webui
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: ollama-webui
template:
metadata:
labels:
app.kubernetes.io/name: ollama-webui
spec:
containers:
- name: webui
image: "lobehub/lobe-chat:latest"
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 3210
protocol: TCP
resources:
requests:
cpu: "100m"
memory: "50Mi"
limits:
cpu: "1000m"
memory: "1Gi"
env:
# 后端ollama的服务地址
- name: OLLAMA_PROXY_URL
value: http://ollama-svc:11434
# 只展示llama3-chinese模型
- name: OLLAMA_MODEL_LIST
value: -all,+192.168.2.2:5000/ollama/llama3-chinese:8b=llama3-chinese
livenessProbe:
httpGet:
path: /
port: http
readinessProbe:
httpGet:
path: /
port: http
查看创建的lobe-chat
$ kubectl get svc -n ollama
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ollama-svc NodePort 10.97.150.120 <none> 11434:32532/TCP 122m
ollama-webui-svc NodePort 10.102.119.33 <none> 80:31668/TCP 16s
$ kubectl get pod -n ollama
NAME READY STATUS RESTARTS AGE
ollama-deploy-6f69784c57-8bx84 1/1 Running 0 122m
ollama-webui-deploy-f6f48c59f-n962k 1/1 Running 0 23s
通过浏览器访问NodePort,可以看到lobechat
的界面。点击ollama模型,进行配置
选择重置,可以看到我们的llama-chinese
。(根据这个issues在最新的版本中应该已经不需要进行模型的配置了)
最后使用llama-chinese
愉快的聊天了
这个回答还是很惊艳的,丝毫不提毁灭宇宙的事,这是已经学会说谎了?
参考
转载请注明出处