nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

spiner-z · 2024-12-31T06:41:34Z

HOST INFORMATION

OS and Architecture: Ubuntu 22.04, x86_64
Kubernetes Distribution: Vanilla Kubernetes
Kubernetes Version: v1.31.2
Host Node GPUs: NVIDIA V100, A100
GPU Operator Installation Method: Helm

Steps to reproduce the issue

$ kubectl get pods -n nvidia-gpu-operator

gpu-feature-discovery-8g4pc                           2/2     Running                     
gpu-feature-discovery-j8797                           2/2     Running                     
gpu-feature-discovery-st644                           2/2     Running                     
nvdp-node-feature-discovery-worker-96gzj              1/1     Running                     
nvdp-node-feature-discovery-worker-xxl65              1/1     Running                     
nvdp-node-feature-discovery-worker-zt882              1/1     Running                     
nvidia-container-toolkit-daemonset-5vlk2              1/1     Running                     
nvidia-container-toolkit-daemonset-6chcr              1/1     Running                     
nvidia-container-toolkit-daemonset-rgdxz              1/1     Running                     
nvidia-cuda-validator-6hbzq                           0/1     Completed                   
nvidia-cuda-validator-b6thh                           0/1     Completed                   
nvidia-cuda-validator-wls5c                           0/1     Completed                   
nvidia-dcgm-exporter-589kn                            1/1     Running                     
nvidia-dcgm-exporter-hr66q                            1/1     Running                     
nvidia-dcgm-exporter-phrrd                            1/1     Running                     
nvidia-device-plugin-daemonset-88mbq                  2/2     Running                     
nvidia-device-plugin-daemonset-fm5dn                  2/2     Running                     
nvidia-device-plugin-daemonset-nz2st                  2/2     Running                     
nvidia-operator-validator-s8tfk                       0/1     Init:CrashLoopBackOff       
nvidia-operator-validator-vp6nk                       0/1     Init:CrashLoopBackOff       
nvidia-operator-validator-xdvt4                       0/1     Init:CrashLoopBackOff

All components except nvidia-operator-validator are in a normal state and are functioning properly.
I can use DCGM normally and assign GPUs to pods without any issues.

However, nvidia-operator-validator is stuck in Init:CrashLoopBackOff.

$ kubectl logs nvidia-operator-validator-27lns -n nvidia-gpu-operator -c plugin-validation

time="2024-12-31T06:33:02Z" level=info msg="version: 65c864c1, commit: 65c864c"
time="2024-12-31T06:33:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2024-12-31T06:33:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2024-12-31T06:33:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
time="2024-12-31T06:33:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 4"
time="2024-12-31T06:33:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 5"
time="2024-12-31T06:33:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 6"
time="2024-12-31T06:33:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 7"
time="2024-12-31T06:33:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 8"
time="2024-12-31T06:33:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 9"
time="2024-12-31T06:33:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 10"
time="2024-12-31T06:33:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 11"
time="2024-12-31T06:33:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 12"
time="2024-12-31T06:34:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 13"
time="2024-12-31T06:34:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 14"
time="2024-12-31T06:34:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 15"
time="2024-12-31T06:34:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 16"
time="2024-12-31T06:34:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 17"
time="2024-12-31T06:34:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 18"
time="2024-12-31T06:34:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 19"
time="2024-12-31T06:34:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 20"
time="2024-12-31T06:34:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 21"
time="2024-12-31T06:34:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 22"
time="2024-12-31T06:34:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 23"
time="2024-12-31T06:34:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 24"
time="2024-12-31T06:35:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 25"
time="2024-12-31T06:35:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 26"
time="2024-12-31T06:35:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 27"
time="2024-12-31T06:35:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 28"
time="2024-12-31T06:35:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2024-12-31T06:35:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2024-12-31T06:35:32Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"

The text was updated successfully, but these errors were encountered:

cdesiniotis · 2025-01-02T18:29:55Z

Can you provide logs from one of the nvidia-device-plugin pods? The plugin-validation logs suggest that no GPUs are allocatable.

spiner-z · 2025-01-03T11:19:43Z

Here are logs from one of the nvidia-device-plugin pods:

$ kubectl logs nvidia-device-plugin-daemonset-88mbq -n nvidia-gpu-operator -c nvidia-device-plugin

IS_HOST_DRIVER=true
NVIDIA_DRIVER_ROOT=/
DRIVER_ROOT_CTR_PATH=/host
NVIDIA_DEV_ROOT=/
DEV_ROOT_CTR_PATH=/host
Starting nvidia-device-plugin
I1231 03:16:20.612133      33 main.go:235] "Starting NVIDIA Device Plugin" version=<
	N/A
	commit: unknown
 >
I1231 03:16:20.612219      33 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1231 03:16:20.612789      33 main.go:245] Starting OS watcher.
I1231 03:16:20.613228      33 main.go:260] Starting Plugins.
I1231 03:16:20.613265      33 main.go:317] Loading configuration.
I1231 03:16:20.616013      33 main.go:342] Updating config with default resource matching patterns.
I1231 03:16:20.616138      33 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "volume-mounts"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/host"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*V100*",
        "name": "nvidia.com/v100"
      },
      {
        "pattern": "*A100*",
        "name": "nvidia.com/a100"
      },
      {
        "pattern": "*T4*",
        "name": "nvidia.com/t4"
      },
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ],
    "mig": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1231 03:16:20.616158      33 main.go:356] Retrieving plugins.
I1231 03:16:20.764488      33 server.go:195] Starting GRPC server for 'nvidia.com/v100'
I1231 03:16:20.767158      33 server.go:139] Starting to serve 'nvidia.com/v100' on /var/lib/kubelet/device-plugins/nvidia-v100.sock
I1231 03:16:20.772384      33 server.go:146] Registered device plugin for 'nvidia.com/v100' with Kubelet

$ kubectl logs nvidia-device-plugin-daemonset-88mbq -n nvidia-gpu-operator -c config-manager

W1231 03:16:19.780299      54 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1231 03:16:19.780784      54 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
I1231 03:16:19.797660      54 main.go:248] Label change detected: nvidia.com/device-plugin.config=
I1231 03:16:19.797884      54 main.go:360] No value set. Selecting default name: default
I1231 03:16:19.797899      54 main.go:304] Updating to config: default
I1231 03:16:19.798065      54 main.go:312] Already configured. Skipping update...
I1231 03:16:19.798078      54 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label

I guess the reason why no GPUs are allocatable is that we've renamed the device to report different GPU models. link

$ kubectl describe node xxx
...
Capacity:
  cpu:                56
  ephemeral-storage:  767792904Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528235656Ki
  nvidia.com/v100:    6
  pods:               110
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

spiner-z commented Dec 31, 2024 •

edited

Loading

cdesiniotis commented Jan 2, 2025

spiner-z commented Jan 3, 2025 •

edited

Loading

nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

Comments

spiner-z commented Dec 31, 2024 • edited Loading

HOST INFORMATION

Steps to reproduce the issue

cdesiniotis commented Jan 2, 2025

spiner-z commented Jan 3, 2025 • edited Loading

spiner-z commented Dec 31, 2024 •

edited

Loading

spiner-z commented Jan 3, 2025 •

edited

Loading