Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12" in nvidia-driver-daemonset Pod #1212

utsumi-fj · 2025-01-20T00:19:34Z

In A100 GPU environment, when installing GPU Operator, the error Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12" occurs in nvidia-driver-daemonset Pod.

The status of nvidia-driver-daemonset Pod is ImagePullBackOff as follows:

# kubectl get pod -n gpu-operator
NAME                                                              READY   STATUS             RESTARTS   AGE
gpu-feature-discovery-sz7tw                                       0/1     Init:0/1           0          34m
gpu-operator-1737329855-node-feature-discovery-gc-54495f58tffkd   1/1     Running            0          34m
gpu-operator-1737329855-node-feature-discovery-master-77876hrzc   1/1     Running            0          34m
gpu-operator-1737329855-node-feature-discovery-worker-dcsjb       1/1     Running            0          34m
gpu-operator-7b8f77b698-rp5q8                                     1/1     Running            0          34m
nvidia-container-toolkit-daemonset-z6pzk                          0/1     Init:0/1           0          34m
nvidia-dcgm-exporter-vpdqp                                        0/1     Init:0/1           0          34m
nvidia-device-plugin-daemonset-gqvpn                              0/1     Init:0/1           0          34m
nvidia-driver-daemonset-pwnhc                                     0/1     ImagePullBackOff   0          34m
nvidia-operator-validator-gz7j4                                   0/1     Init:0/4           0          34m

The image nvcr.io/nvidia/driver:550.90.07-debian12 is not found as follows:

# kubectl describe pod nvidia-driver-daemonset-pwnhc -n gpu-operator
(...snip...)
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  36m                  default-scheduler  Successfully assigned gpu-operator/nvidia-driver-daemonset-pwnhc to kubeflow-control-plane
  Normal   Pulled     36m                  kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10" already present on machine
  Normal   Created    36m                  kubelet            Created container k8s-driver-manager
  Normal   Started    36m                  kubelet            Started container k8s-driver-manager
  Normal   Pulling    34m (x4 over 36m)    kubelet            Pulling image "nvcr.io/nvidia/driver:550.90.07-debian12"
  Warning  Failed     34m (x4 over 36m)    kubelet            Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:550.90.07-debian12": failed to resolve reference "nvcr.io/nvidia/driver:550.90.07-debian12": nvcr.io/nvidia/driver:550.90.07-debian12: not found
  Warning  Failed     34m (x4 over 36m)    kubelet            Error: ErrImagePull
  Warning  Failed     34m (x5 over 35m)    kubelet            Error: ImagePullBackOff
  Normal   BackOff    71s (x152 over 35m)  kubelet            Back-off pulling image "nvcr.io/nvidia/driver:550.90.07-debian12"

The installation command:

# helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v24.6.2

The following is my environment information.

Environment:

kind cluster with single node
k8s version: v1.31.0
node OS: Ubuntu 22.04.5 LTS
Linux Kernel version: 5.15.0-130-generic
A100 GPU passthrough
GPU Operator version: v24.6.2

# kubectl get node
NAME                     STATUS   ROLES           AGE     VERSION
kubeflow-control-plane   Ready    control-plane   2d22h   v1.31.0

# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

# uname -r
5.15.0-130-generic

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12" in nvidia-driver-daemonset Pod #1212

Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12" in nvidia-driver-daemonset Pod #1212

utsumi-fj commented Jan 20, 2025 •

edited

Loading

Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12" in nvidia-driver-daemonset Pod #1212

Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12" in nvidia-driver-daemonset Pod #1212

Comments

utsumi-fj commented Jan 20, 2025 • edited Loading

utsumi-fj commented Jan 20, 2025 •

edited

Loading