异常场景
执行如下命令,启动一个容器,同时在运行容器的时候调用gpu
docker run -d -p 6008:6008 --gpus all registry.cn-hangzhou.aliyuncs.com/fastgpt_docker/m3e-large-api:latest
运行容器出现如下异常:
error response from daemon: could not select device driver ““ with capabilities: [[gpu]]
原因分析
1.检查gpu设备驱动程序:
chatglm3) root@master:~/work/# nvidia-smi
tue feb 27 01:38:39 2024
+---------------------------------------------------------------------------------------+
| nvidia-smi 535.154.05 driver version: 535.154.05 cuda version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| gpu name persistence-m | bus-id disp.a | volatile uncorr. ecc |
| fan temp perf pwr:usage/cap | memory-usage | gpu-util compute m. |
| | | mig m. |
|=========================================+======================+======================|
| 0 tesla v100s-pcie-32gb off | 00000000:23:00.0 off | 0 |
| n/a 68c p0 50w / 250w | 12352mib / 32768mib | 0% default |
| | | n/a |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| processes: |
| gpu gi ci pid type process name gpu memory |
| id id usage |
|=======================================================================================|
| 0 n/a n/a 592394 c .../envs/langchain-chatchat/bin/python 12348mib |
2.更新docker版本:
sudo apt update
sudo apt upgrade docker-ce
sudo systemctl restart docker
3.推测原因
解决方法
安装和配置nvidia container toolkit步骤:
1.按照nvidia 官方文档中提供的指南,安装nvidia容器运行时。
2.配置docker守护程序以使用nvidia运行时。可以按照官方文档中的说明来编辑docker守护程序配置文件,以确保docker正确识别gpu资源。
安装
1.配置生产存储库:
curl -fssl https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -l https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
2.从存储库更新包列表:
sudo apt-get update
3.安装 nvidia container toolkit 软件包:
sudo apt-get install -y nvidia-container-toolkit
配置
先决条件:
安装了受支持的容器引擎(docker、containerd、cri-o、podman)
安装了 nvidia container toolkit
1.使用以下命令配置容器运行时nvidia-ctk:
sudo nvidia-ctk runtime configure --runtime=docker
检查 docker 配置文件,确保已启用对 gpu 的支持
{
"registry-mirrors": [
"https://xxxx.mirror.aliyuncs.com"
],
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
2.重新启动 docker 守护进程:
sudo systemctl restart docker
发表评论