-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190
Comments
Can you provide logs from one of the |
Here are logs from one of the $ kubectl logs nvidia-device-plugin-daemonset-88mbq -n nvidia-gpu-operator -c nvidia-device-plugin
IS_HOST_DRIVER=true
NVIDIA_DRIVER_ROOT=/
DRIVER_ROOT_CTR_PATH=/host
NVIDIA_DEV_ROOT=/
DEV_ROOT_CTR_PATH=/host
Starting nvidia-device-plugin
I1231 03:16:20.612133 33 main.go:235] "Starting NVIDIA Device Plugin" version=<
N/A
commit: unknown
>
I1231 03:16:20.612219 33 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1231 03:16:20.612789 33 main.go:245] Starting OS watcher.
I1231 03:16:20.613228 33 main.go:260] Starting Plugins.
I1231 03:16:20.613265 33 main.go:317] Loading configuration.
I1231 03:16:20.616013 33 main.go:342] Updating config with default resource matching patterns.
I1231 03:16:20.616138 33 main.go:353]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"mpsRoot": "/run/nvidia/mps",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"volume-mounts"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/host"
}
},
"resources": {
"gpus": [
{
"pattern": "*V100*",
"name": "nvidia.com/v100"
},
{
"pattern": "*A100*",
"name": "nvidia.com/a100"
},
{
"pattern": "*T4*",
"name": "nvidia.com/t4"
},
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
},
"imex": {}
}
I1231 03:16:20.616158 33 main.go:356] Retrieving plugins.
I1231 03:16:20.764488 33 server.go:195] Starting GRPC server for 'nvidia.com/v100'
I1231 03:16:20.767158 33 server.go:139] Starting to serve 'nvidia.com/v100' on /var/lib/kubelet/device-plugins/nvidia-v100.sock
I1231 03:16:20.772384 33 server.go:146] Registered device plugin for 'nvidia.com/v100' with Kubelet $ kubectl logs nvidia-device-plugin-daemonset-88mbq -n nvidia-gpu-operator -c config-manager
W1231 03:16:19.780299 54 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I1231 03:16:19.780784 54 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
I1231 03:16:19.797660 54 main.go:248] Label change detected: nvidia.com/device-plugin.config=
I1231 03:16:19.797884 54 main.go:360] No value set. Selecting default name: default
I1231 03:16:19.797899 54 main.go:304] Updating to config: default
I1231 03:16:19.798065 54 main.go:312] Already configured. Skipping update...
I1231 03:16:19.798078 54 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label I guess the reason why no GPUs are allocatable is that we've renamed the device to report different GPU models. link $ kubectl describe node xxx
...
Capacity:
cpu: 56
ephemeral-storage: 767792904Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 528235656Ki
nvidia.com/v100: 6
pods: 110
... |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
HOST INFORMATION
Steps to reproduce the issue
All components except
nvidia-operator-validator
are in a normal state and are functioning properly.I can use DCGM normally and assign GPUs to pods without any issues.
However,
nvidia-operator-validator
is stuck inInit:CrashLoopBackOff
.The text was updated successfully, but these errors were encountered: