Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what is the correct way to enable MIG for the GPU card via Gpu operator #1197

Open
okyspace opened this issue Jan 10, 2025 · 3 comments
Open

Comments

@okyspace
Copy link

Hi, I have an openshift cluster installed with nvidia gpu operator.

In this openshift cluster, I have requirements to set mig manager to false and config the GPU cards in the node to be MIG enabled but the MIG profiles are to be managed by a 3rd party application. It was advised to run nvidia-smi -i -mig=1 directly in the node. I have tried and it seems to work but after the node is rebooted, the nvidia pods in the node are stuck in init stage. From the logs it seems that the necessary labels are not added to the node by gpu operator and thus the validation pods cannot complete the validations. Thereafter all other pods are stuck at init.

With the above, I suspects that the issue might be how we configure the MIG, so trying to find the correct way to do so.
Looking for advice if there could be other possible issues.

@shan100github
Copy link

shan100github commented Jan 11, 2025

Hope the following will be helpful

  1. While installing gpu-operator through helm chart make sure to give parameter "mig.strategy=single".

  2. Modify or Create the configmap in the gpu-operator namespace mig-parted-config with the following content.. similar to https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-mig-manager/0400_configmap.yaml

data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      # A100
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.5gb": 7
      all-2g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "2g.10gb": 3

Refer manual and create the config profile based on your gpu model

  1. make sure above config map is used in
    kubectl edit ds nvidia-mig-manager -n gpu-operator-resources
      volumes:
        - name: mig-parted-config
          configMap:
            name: <configmap name>
            defaultMode: 420

to enable mig in the node kubectl label nodes <nodename> nvidia.com/mig.config=all-1g.5gb --overwrite
to disable label should be kubectl label nodes <nodename> nvidia.com/mig.config=all-disabled --overwrite

@okyspace
Copy link
Author

Thanks @shan100github . I was trying to add the following so that I can use this configuraiton to enable all the cards with MIG only. But this method requires mig manager to be enabled right?

data:
  config.yaml: |
    version: v1
    mig-configs:
      mig-enable-only:
        - devices: all
          mig-enabled: true

For my use case, the exact MIG profile is to be managed by a 3rd party application, which requires mig manager to be disabled. So in this case, this method will not work right?

@shan100github
Copy link

@okyspace could you please share the 3rd party application name?
I am not sure how it will work without enabling MIG in the gpu-operator helm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants