what is the correct way to enable MIG for the GPU card via Gpu operator #1197

okyspace · 2025-01-10T03:49:36Z

Hi, I have an openshift cluster installed with nvidia gpu operator.

In this openshift cluster, I have requirements to set mig manager to false and config the GPU cards in the node to be MIG enabled but the MIG profiles are to be managed by a 3rd party application. It was advised to run nvidia-smi -i -mig=1 directly in the node. I have tried and it seems to work but after the node is rebooted, the nvidia pods in the node are stuck in init stage. From the logs it seems that the necessary labels are not added to the node by gpu operator and thus the validation pods cannot complete the validations. Thereafter all other pods are stuck at init.

With the above, I suspects that the issue might be how we configure the MIG, so trying to find the correct way to do so.
Looking for advice if there could be other possible issues.

shan100github · 2025-01-11T18:02:18Z

Hope the following will be helpful

While installing gpu-operator through helm chart make sure to give parameter "mig.strategy=single".
Modify or Create the configmap in the gpu-operator namespace mig-parted-config with the following content.. similar to https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-mig-manager/0400_configmap.yaml

data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      # A100
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.5gb": 7
      all-2g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "2g.10gb": 3

Refer manual and create the config profile based on your gpu model

make sure above config map is used in
kubectl edit ds nvidia-mig-manager -n gpu-operator-resources

      volumes:
        - name: mig-parted-config
          configMap:
            name: <configmap name>
            defaultMode: 420

to enable mig in the node kubectl label nodes <nodename> nvidia.com/mig.config=all-1g.5gb --overwrite
to disable label should be kubectl label nodes <nodename> nvidia.com/mig.config=all-disabled --overwrite

okyspace · 2025-01-13T09:08:13Z

Thanks @shan100github . I was trying to add the following so that I can use this configuraiton to enable all the cards with MIG only. But this method requires mig manager to be enabled right?

data:
  config.yaml: |
    version: v1
    mig-configs:
      mig-enable-only:
        - devices: all
          mig-enabled: true

For my use case, the exact MIG profile is to be managed by a 3rd party application, which requires mig manager to be disabled. So in this case, this method will not work right?

shan100github · 2025-01-14T18:18:08Z

@okyspace could you please share the 3rd party application name?
I am not sure how it will work without enabling MIG in the gpu-operator helm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what is the correct way to enable MIG for the GPU card via Gpu operator #1197

what is the correct way to enable MIG for the GPU card via Gpu operator #1197

okyspace commented Jan 10, 2025

shan100github commented Jan 11, 2025 •

edited

Loading

okyspace commented Jan 13, 2025

shan100github commented Jan 14, 2025

what is the correct way to enable MIG for the GPU card via Gpu operator #1197

what is the correct way to enable MIG for the GPU card via Gpu operator #1197

Comments

okyspace commented Jan 10, 2025

shan100github commented Jan 11, 2025 • edited Loading

okyspace commented Jan 13, 2025

shan100github commented Jan 14, 2025

shan100github commented Jan 11, 2025 •

edited

Loading