-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The resource requests and limits are not being applied to the pod as expected. #1145
Comments
Any update on this issue? |
@badaldavda8 @IndhumithaR Privileged Pods have direct access to the host's devices—they share the host's device namespace and can directly access everything under the /dev directory. This basically bypasses the container's device isolation. |
Thanks a lot for your response. The thing is, we did have same configuration of our pods even before. But only with the change of EC2 installed nvidia drivers vs GPU Operator Nvidia Driver, we have seen this difference |
But when we install nvidia driver with gpu operator , and it worked as expected. I mean limits and request are getting set properly. Can we restrict the limit and request inspite of running privileged pod? |
The driver installation method itself isn't likely the primary cause of resource limitation differences. The key factor is whether the Pod is running in privileged mode, which fundamentally impacts resource control and isolation.
Direct GPU driver installation requires manual configuration on each machine, suitable for small-scale environments but complex to manage. GPU Operator provides automated, cluster-wide driver management, offering more efficient and consistent resource deployment for large-scale Kubernetes environments.
Privileged pods are granted extensive host capabilities, including direct access to devices in the /dev directories. This broad range of permissions can bypass certain Kubernetes security restrictions, making it difficult to enforce standard resource limitations and management mechanisms effectively. |
Context - We wanted to set ecc=0 for which we already had to install nvidia driver on the host EC2 instance. We wanted to avoid the time it took to uninstall the driver and then re-install using gpu-operator. Therefore now we are not installing drivers using GPU Operator. The strange part is that when we had uninstalled and reinstalled with GPU-operator, the pods got allocated GPUs perfectly even with privileged mode enabled. But now that we are not installing it with GPU operator, its allocating all the resources to the pods with privileged mode.
The pod was running in privileged mode even before with GPU operator.
We are using EKS and using Image Builder Pipeline to create an AMI with driver already installed so the scalability is handled here. We wanted to check how is the gpu operator installation changing such that even with privileged pods, its not allocating all the gpus. And how is installing it directly on the AMI making gpus allocate to a single pod? |
Gpu operator version: v24.6.1
driver.version: 535.154.05
device plugin verion: v0.16.2-ubi8
Kubernetes distribution
EKS
Kubernetes version
v1.27.0
Hi,
We attempted to install the Nvidia driver directly on our node's base image instead of using the GPU operator. However, after doing so, the resource requests and limits set for the pods are no longer effective, and all containers within the pods are able to access all the GPUs.
Sample pod spec
Here I am trying to set request and limit to 5.
But when I enter into the container and check, I am able to see all the 8 gpus.
However, we tested running the same pod in a different environment where the same driver version was installed using the GPU operator (instead of directly in the base image), and it worked as expected.
What could be the problem? Is there a way to fix it?
The text was updated successfully, but these errors were encountered: