Skip to content

Conversation

yuanchen8911
Copy link
Collaborator

@yuanchen8911 yuanchen8911 commented Oct 16, 2024

This PR creates a new MIG example gpu-test-mig.yaml with the new apiVersion: resource.k8s.io/v1alpha3 to the demo/quickstart folder.

Validated on an A100 machine with the following MIG configuration.

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
  MIG 3g.20gb     Device  0: (UUID: MIG-ba67e81d-d10e-5b3f-9bba-1af4b97b4b18)
  MIG 2g.10gb     Device  1: (UUID: MIG-c22ee1da-dd8f-57be-8f0a-d951c67ad3f3)
  MIG 1g.5gb      Device  2: (UUID: MIG-2f18f1a5-2ea8-5c05-a674-aee0e69e22ca)
  MIG 1g.5gb      Device  3: (UUID: MIG-a4bbead5-b0b1-5339-af1b-6239e2e6b4bd)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
  MIG 3g.20gb     Device  0: (UUID: MIG-a4a1fc88-103b-5c5b-ba1c-8c5c617fba7d)
  MIG 2g.10gb     Device  1: (UUID: MIG-6eb2e7a3-4440-562d-98e8-536a814b5ffd)
  MIG 1g.5gb      Device  2: (UUID: MIG-2286c62b-847f-5aaa-85b2-21b147544503)
  MIG 1g.5gb      Device  3: (UUID: MIG-0f47e714-65d5-5e11-b1bb-0d49bdaa5b29)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
  MIG 3g.20gb     Device  0: (UUID: MIG-8b4bd806-4301-5a29-8bca-a603a52e7192)
  MIG 2g.10gb     Device  1: (UUID: MIG-c0c44878-d0a8-5c7e-9386-df111231427d)
  MIG 1g.5gb      Device  2: (UUID: MIG-ba4a9d35-943c-5eac-b7a9-513b58c39ae0)
  MIG 1g.5gb      Device  3: (UUID: MIG-d1ac450c-f47e-53a9-bbbe-9cb23a589a6c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
  MIG 3g.20gb     Device  0: (UUID: MIG-7a6eeb45-6baa-5c5e-8366-b58486c7748a)
  MIG 2g.10gb     Device  1: (UUID: MIG-74b6741b-0bca-5cf4-b6b8-364fbd875d7f)
  MIG 1g.5gb      Device  2: (UUID: MIG-cb6813d1-f566-5bd4-ab58-e3f7016e28c1)
  MIG 1g.5gb      Device  3: (UUID: MIG-f7c166a5-e33a-579c-ac74-5fcd27ecd212)
GPU 4: NVIDIA A100-SXM4-40GB (UUID: GPU-ec9d53cc-125d-d4a3-9687-304df8eb4749)
GPU 5: NVIDIA A100-SXM4-40GB (UUID: GPU-3eb87630-93d5-b2b6-b8ff-9b359caf4ee2)
GPU 6: NVIDIA A100-SXM4-40GB (UUID: GPU-8216274a-c05d-def0-af18-c74647300267)
GPU 7: NVIDIA A100-SXM4-40GB (UUID: GPU-b1028956-cfa2-0990-bf4a-5da9abb51763)
$ kubectl get resourceclaim -n gpu-test-mig
NAME                                    STATE                AGE
pod-646f7467bc-6kt6k-mig-ts-gpu-22fh2   allocated,reserved   7m38s
pod-646f7467bc-bb8lp-mig-ts-gpu-jmdrh   allocated,reserved   7m38s
pod-646f7467bc-lf5tg-mig-ts-gpu-kmxgv   allocated,reserved   7m38s
pod-646f7467bc-tt989-mig-ts-gpu-ck85v   allocated,reserved   7m38s

$ kubectl get pods -n gpu-test-mig
NAME                   READY   STATUS    RESTARTS   AGE
pod-646f7467bc-6kt6k   1/1     Running   0          7m45s
pod-646f7467bc-bb8lp   1/1     Running   0          7m45s
pod-646f7467bc-lf5tg   1/1     Running   0          7m45s
pod-646f7467bc-tt989   1/1     Running   0          7m45s

Signed-off-by: Yuan Chen <[email protected]>

Create a new example for MIG

Signed-off-by: Yuan Chen <[email protected]>

Update comment

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 requested a review from klueska October 16, 2024 16:26
@yuanchen8911 yuanchen8911 changed the title Add an MIG example Add a MIG example Oct 16, 2024
Comment on lines +25 to +27
constraints:
- requests: []
matchAttribute: "gpu.nvidia.com/parentUUID"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for this with just a single MIG device request -- this pulls together the different requests and ensures the allocations come from the same underlying GPU

args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: mig-ts-gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use the name *-ts-* here unless you put an explicit timeSlicing config on the request.

Comment on lines +46 to +48
resourceClaims:
- name: mig-ts-gpu
resourceClaimTemplateName: mig-devices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this below the list of containers

kind: ResourceClaimTemplate
metadata:
namespace: gpu-test-mig
name: mig-devices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: mig-devices
name: mig-device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

2 participants