Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator fails to render a valid daemonset on OCP when using 64K kernel page size. #1207

Open
mvazquezc opened this issue Jan 16, 2025 · 2 comments · May be fixed by #1210
Open

Operator fails to render a valid daemonset on OCP when using 64K kernel page size. #1207

mvazquezc opened this issue Jan 16, 2025 · 2 comments · May be fixed by #1210

Comments

@mvazquezc
Copy link

The relevant code for the version we are using (v24.6.1):

https://github.com/NVIDIA/gpu-operator/blob/release-24.6/controllers/object_controls.go#L2809

The function above will render the "sanitized" string as: 5.14.0-427.37.1.el9.4..64k-rhcos4.16 which is not a valid name.

The resulting DaemonSet name will be nvidia-driver-daemonset-5.14.0-427.37.1.el9.4..64k-rhcos4.16 and API will complain about it when the controller tries to create it:

{"level":"info","ts":"2025-01-16T09:45:22Z","logger":"controllers.ClusterPolicy","msg":"DaemonSet not found, creating","DaemonSet":"nvidia-driver-daemonset","Namespace":"nvidia-gpu-operator","Name":"nvidia-driver-daemonset-5.14.0-427.37.1.el9.4..64k-rhcos4.16"}
.
.
.
{"level":"info","ts":"2025-01-16T09:45:00Z","logger":"controllers.ClusterPolicy","msg":"Couldn't create DaemonSet","DaemonSet":"nvidia-driver-daemonset","Namespace":"nvidia-gpu-operator","Name":"nvidia-driver-daemonset-5.14.0-427.37.1.el9.4..64k-rhcos4.16","Error":"DaemonSet.apps \"nvidia-driver-daemonset-5.14.0-427.37.1.el9.4..64k-rhcos4.16\" is invalid: metadata.name: Invalid value: \"nvidia-driver-daemonset-5.14.0-427.37.1.el9.4..64k-rhcos4.16\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')"}
@mvazquezc mvazquezc linked a pull request Jan 17, 2025 that will close this issue
@tariq1890
Copy link
Contributor

Hey @mvazquezc Thanks for raising this issue and the PR to fix it. For my reference can you share the following node label value of the node running on 64k kernel?

feature.node.kubernetes.io/kernel-version.full

@mvazquezc
Copy link
Author

@tariq1890:

feature.node.kubernetes.io/kernel-version.full: 5.14.0-427.37.1.el9_4.aarch64_64k

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants