Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12" in nvidia-driver-daemonset Pod #1212

Open
utsumi-fj opened this issue Jan 20, 2025 · 0 comments

Comments

@utsumi-fj
Copy link

utsumi-fj commented Jan 20, 2025

In A100 GPU environment, when installing GPU Operator, the error Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12" occurs in nvidia-driver-daemonset Pod.

The status of nvidia-driver-daemonset Pod is ImagePullBackOff as follows:

# kubectl get pod -n gpu-operator
NAME                                                              READY   STATUS             RESTARTS   AGE
gpu-feature-discovery-sz7tw                                       0/1     Init:0/1           0          34m
gpu-operator-1737329855-node-feature-discovery-gc-54495f58tffkd   1/1     Running            0          34m
gpu-operator-1737329855-node-feature-discovery-master-77876hrzc   1/1     Running            0          34m
gpu-operator-1737329855-node-feature-discovery-worker-dcsjb       1/1     Running            0          34m
gpu-operator-7b8f77b698-rp5q8                                     1/1     Running            0          34m
nvidia-container-toolkit-daemonset-z6pzk                          0/1     Init:0/1           0          34m
nvidia-dcgm-exporter-vpdqp                                        0/1     Init:0/1           0          34m
nvidia-device-plugin-daemonset-gqvpn                              0/1     Init:0/1           0          34m
nvidia-driver-daemonset-pwnhc                                     0/1     ImagePullBackOff   0          34m
nvidia-operator-validator-gz7j4                                   0/1     Init:0/4           0          34m

The image nvcr.io/nvidia/driver:550.90.07-debian12 is not found as follows:

# kubectl describe pod nvidia-driver-daemonset-pwnhc -n gpu-operator
(...snip...)
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  36m                  default-scheduler  Successfully assigned gpu-operator/nvidia-driver-daemonset-pwnhc to kubeflow-control-plane
  Normal   Pulled     36m                  kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10" already present on machine
  Normal   Created    36m                  kubelet            Created container k8s-driver-manager
  Normal   Started    36m                  kubelet            Started container k8s-driver-manager
  Normal   Pulling    34m (x4 over 36m)    kubelet            Pulling image "nvcr.io/nvidia/driver:550.90.07-debian12"
  Warning  Failed     34m (x4 over 36m)    kubelet            Failed to pull image "nvcr.io/nvidia/driver:550.90.07-debian12": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:550.90.07-debian12": failed to resolve reference "nvcr.io/nvidia/driver:550.90.07-debian12": nvcr.io/nvidia/driver:550.90.07-debian12: not found
  Warning  Failed     34m (x4 over 36m)    kubelet            Error: ErrImagePull
  Warning  Failed     34m (x5 over 35m)    kubelet            Error: ImagePullBackOff
  Normal   BackOff    71s (x152 over 35m)  kubelet            Back-off pulling image "nvcr.io/nvidia/driver:550.90.07-debian12"

The installation command:

# helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v24.6.2

The following is my environment information.

Environment:

  • kind cluster with single node
  • k8s version: v1.31.0
  • node OS: Ubuntu 22.04.5 LTS
  • Linux Kernel version: 5.15.0-130-generic
  • A100 GPU passthrough
  • GPU Operator version: v24.6.2
# kubectl get node
NAME                     STATUS   ROLES           AGE     VERSION
kubeflow-control-plane   Ready    control-plane   2d22h   v1.31.0
# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
# uname -r
5.15.0-130-generic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant