Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

Open
spiner-z opened this issue Dec 31, 2024 · 2 comments

Comments

@spiner-z
Copy link

spiner-z commented Dec 31, 2024

HOST INFORMATION

  1. OS and Architecture: Ubuntu 22.04, x86_64
  2. Kubernetes Distribution: Vanilla Kubernetes
  3. Kubernetes Version: v1.31.2
  4. Host Node GPUs: NVIDIA V100, A100
  5. GPU Operator Installation Method: Helm

Steps to reproduce the issue

$ kubectl get pods -n nvidia-gpu-operator

gpu-feature-discovery-8g4pc                           2/2     Running                     
gpu-feature-discovery-j8797                           2/2     Running                     
gpu-feature-discovery-st644                           2/2     Running                     
nvdp-node-feature-discovery-worker-96gzj              1/1     Running                     
nvdp-node-feature-discovery-worker-xxl65              1/1     Running                     
nvdp-node-feature-discovery-worker-zt882              1/1     Running                     
nvidia-container-toolkit-daemonset-5vlk2              1/1     Running                     
nvidia-container-toolkit-daemonset-6chcr              1/1     Running                     
nvidia-container-toolkit-daemonset-rgdxz              1/1     Running                     
nvidia-cuda-validator-6hbzq                           0/1     Completed                   
nvidia-cuda-validator-b6thh                           0/1     Completed                   
nvidia-cuda-validator-wls5c                           0/1     Completed                   
nvidia-dcgm-exporter-589kn                            1/1     Running                     
nvidia-dcgm-exporter-hr66q                            1/1     Running                     
nvidia-dcgm-exporter-phrrd                            1/1     Running                     
nvidia-device-plugin-daemonset-88mbq                  2/2     Running                     
nvidia-device-plugin-daemonset-fm5dn                  2/2     Running                     
nvidia-device-plugin-daemonset-nz2st                  2/2     Running                     
nvidia-operator-validator-s8tfk                       0/1     Init:CrashLoopBackOff       
nvidia-operator-validator-vp6nk                       0/1     Init:CrashLoopBackOff       
nvidia-operator-validator-xdvt4                       0/1     Init:CrashLoopBackOff       

All components except nvidia-operator-validator are in a normal state and are functioning properly.
I can use DCGM normally and assign GPUs to pods without any issues.

However, nvidia-operator-validator is stuck in Init:CrashLoopBackOff.

$ kubectl logs nvidia-operator-validator-27lns -n nvidia-gpu-operator -c plugin-validation

time="2024-12-31T06:33:02Z" level=info msg="version: 65c864c1, commit: 65c864c"
time="2024-12-31T06:33:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2024-12-31T06:33:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2024-12-31T06:33:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
time="2024-12-31T06:33:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 4"
time="2024-12-31T06:33:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 5"
time="2024-12-31T06:33:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 6"
time="2024-12-31T06:33:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 7"
time="2024-12-31T06:33:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 8"
time="2024-12-31T06:33:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 9"
time="2024-12-31T06:33:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 10"
time="2024-12-31T06:33:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 11"
time="2024-12-31T06:33:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 12"
time="2024-12-31T06:34:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 13"
time="2024-12-31T06:34:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 14"
time="2024-12-31T06:34:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 15"
time="2024-12-31T06:34:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 16"
time="2024-12-31T06:34:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 17"
time="2024-12-31T06:34:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 18"
time="2024-12-31T06:34:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 19"
time="2024-12-31T06:34:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 20"
time="2024-12-31T06:34:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 21"
time="2024-12-31T06:34:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 22"
time="2024-12-31T06:34:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 23"
time="2024-12-31T06:34:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 24"
time="2024-12-31T06:35:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 25"
time="2024-12-31T06:35:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 26"
time="2024-12-31T06:35:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 27"
time="2024-12-31T06:35:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 28"
time="2024-12-31T06:35:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2024-12-31T06:35:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2024-12-31T06:35:32Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"
@cdesiniotis
Copy link
Contributor

Can you provide logs from one of the nvidia-device-plugin pods? The plugin-validation logs suggest that no GPUs are allocatable.

@spiner-z
Copy link
Author

spiner-z commented Jan 3, 2025

Here are logs from one of the nvidia-device-plugin pods:

$ kubectl logs nvidia-device-plugin-daemonset-88mbq -n nvidia-gpu-operator -c nvidia-device-plugin

IS_HOST_DRIVER=true
NVIDIA_DRIVER_ROOT=/
DRIVER_ROOT_CTR_PATH=/host
NVIDIA_DEV_ROOT=/
DEV_ROOT_CTR_PATH=/host
Starting nvidia-device-plugin
I1231 03:16:20.612133      33 main.go:235] "Starting NVIDIA Device Plugin" version=<
	N/A
	commit: unknown
 >
I1231 03:16:20.612219      33 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1231 03:16:20.612789      33 main.go:245] Starting OS watcher.
I1231 03:16:20.613228      33 main.go:260] Starting Plugins.
I1231 03:16:20.613265      33 main.go:317] Loading configuration.
I1231 03:16:20.616013      33 main.go:342] Updating config with default resource matching patterns.
I1231 03:16:20.616138      33 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "volume-mounts"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/host"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*V100*",
        "name": "nvidia.com/v100"
      },
      {
        "pattern": "*A100*",
        "name": "nvidia.com/a100"
      },
      {
        "pattern": "*T4*",
        "name": "nvidia.com/t4"
      },
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ],
    "mig": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1231 03:16:20.616158      33 main.go:356] Retrieving plugins.
I1231 03:16:20.764488      33 server.go:195] Starting GRPC server for 'nvidia.com/v100'
I1231 03:16:20.767158      33 server.go:139] Starting to serve 'nvidia.com/v100' on /var/lib/kubelet/device-plugins/nvidia-v100.sock
I1231 03:16:20.772384      33 server.go:146] Registered device plugin for 'nvidia.com/v100' with Kubelet
$ kubectl logs nvidia-device-plugin-daemonset-88mbq -n nvidia-gpu-operator -c config-manager

W1231 03:16:19.780299      54 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1231 03:16:19.780784      54 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
I1231 03:16:19.797660      54 main.go:248] Label change detected: nvidia.com/device-plugin.config=
I1231 03:16:19.797884      54 main.go:360] No value set. Selecting default name: default
I1231 03:16:19.797899      54 main.go:304] Updating to config: default
I1231 03:16:19.798065      54 main.go:312] Already configured. Skipping update...
I1231 03:16:19.798078      54 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label

I guess the reason why no GPUs are allocatable is that we've renamed the device to report different GPU models. link

$ kubectl describe node xxx
...
Capacity:
  cpu:                56
  ephemeral-storage:  767792904Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528235656Ki
  nvidia.com/v100:    6
  pods:               110
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants