[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

kevinzwang · 2024-10-05T03:00:33Z

Resolves #2896

Some details about this PR:

I moved the actor-local singleton out of PyActorPool into a specialized class PyStatefulActorSingleton
I changed GPU resources to be accounted on a per-device level. That resulted in creating the data class AcquiredResources to store the resources used by a task or actor. The runner resources includes not only amount of CPU and memory resources, but the exact GPUs that each task/actor is using, which enables setting CUDA_VISIBLE_DEVICES in actors.
I also moved resource acquisition and releasing logic into a PyRunnerResources class
I added validation that GPU resources must be integers if greater than 1, which means it is no longer accurate to request for actor_resource_requests * num_workers anymore, so the actor pool context now asks for them individually.

codspeed-hq · 2024-10-05T03:13:11Z

CodSpeed Performance Report

Merging #3002 will not alter performance

_{Comparing kevin/stateful-udf-rank (0a30f41) with main (e4c6f3f)}

Summary

✅ 17 untouched benchmarks

daft/context.py

daft/internal/gpu.py

daft/runners/pyrunner.py

jaychia · 2024-10-07T07:21:54Z

Took a quick first pass, but I'll probably need to give it a much more thorough review again given how much logic is changing in the PyRunner

codecov · 2024-10-07T23:38:02Z

Codecov Report

Attention: Patch coverage is 80.55556% with 35 lines in your changes missing coverage. Please review.

Project coverage is 78.34%. Comparing base (272163f) to head (0a30f41).
Report is 30 commits behind head on main.

Files with missing lines	Patch %	Lines
src/common/resource-request/src/lib.rs	70.58%	15 Missing ⚠️
daft/runners/pyrunner.py	89.71%	11 Missing ⚠️
daft/internal/gpu.py	57.14%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3002      +/-   ##
==========================================
- Coverage   78.39%   78.34%   -0.05%     
==========================================
  Files         603      611       +8     
  Lines       71443    72515    +1072     
==========================================
+ Hits        56005    56813     +808     
- Misses      15438    15702     +264

Files with missing lines	Coverage Δ
daft/dependencies.py	`58.33% <ø> (+0.64%)`	⬆️
...al_optimization/rules/split_actor_pool_projects.rs	`95.00% <100.00%> (-0.07%)`	⬇️
daft/internal/gpu.py	`56.00% <57.14%> (-5.54%)`	⬇️
daft/runners/pyrunner.py	`85.87% <89.71%> (+0.61%)`	⬆️
src/common/resource-request/src/lib.rs	`65.06% <70.58%> (-2.03%)`	⬇️

... and 106 files with indirect coverage changes

daft/context.py

daft/internal/gpu.py

daft/runners/ray_runner.py

src/common/resource-request/src/lib.rs

daft/runners/pyrunner.py

tests/actor_pool/test_actor_context.py

daft/runners/pyrunner.py

jaychia

Still not 100% sure yet on the PR, saw quite a few gaps still on this pass-through. Let's find some time to do a live review in person?

tests/actor_pool/test_actor_context.py

daft/runners/pyrunner.py

daft/context.py

daft/internal/gpu.py

daft/runners/pyrunner.py

jaychia · 2024-10-17T22:53:55Z

daft/runners/pyrunner.py


-                            future.add_done_callback(lambda _: self._release_resources(resource_request))
+                            future.add_done_callback(create_resource_release_callback(resources))


I don't quite understand this as to why we need to have this as an inline function.

Could we not do:

def _release_resources_callback(self, resources): return lambda _: self._resources.release(resources) ... future.add_done_callback(self._release_resources_callback(resources))

kevinzwang added 3 commits October 4, 2024 18:06

[FEAT] Add rank and CUDA_VISIBLE_DEVICES to stateful UDFs

2cd49d5

add rank

4d24cac

change wording

798119e

github-actions bot added the enhancement New feature or request label Oct 5, 2024

kevinzwang added 2 commits October 4, 2024 20:47

add tests

cf871d9

fix tests

2e7f0b2

kevinzwang requested a review from jaychia October 5, 2024 04:21

kevinzwang marked this pull request as ready for review October 5, 2024 04:26

kevinzwang added 4 commits October 4, 2024 21:32

fix other tests

df2507f

add resource request num_gpu check

127807e

fix test

6db6229

fix resource freeing when unable to acquire

821b8e1

jaychia reviewed Oct 7, 2024

View reviewed changes

daft/context.py Outdated Show resolved Hide resolved

daft/internal/gpu.py Show resolved Hide resolved

daft/runners/pyrunner.py Outdated Show resolved Hide resolved

daft/runners/pyrunner.py Show resolved Hide resolved

jaychia mentioned this pull request Oct 7, 2024

[FEAT] Assign CUDA_VISIBLE_DEVICES to actor pools in PyRunner #2882

Closed

kevinzwang added 3 commits October 7, 2024 14:42

fix done callback

4d86eda

Merge branch 'main' into kevin/stateful-udf-rank

25a077c

reduce test actor concurrency to less than CI machine cores lol

65e2ca1

kevinzwang requested a review from jaychia October 7, 2024 23:20

jaychia reviewed Oct 8, 2024

View reviewed changes

kevinzwang added 3 commits October 8, 2024 16:41

partial review changes

81aa3cc

add tests and other review fixes

da17746

fix sql tests

9a89ba8

kevinzwang requested a review from jaychia October 9, 2024 19:47

make test cleaner

2ae44c1

jaychia reviewed Oct 17, 2024

View reviewed changes

kevinzwang commented Oct 17, 2024

View reviewed changes

daft/internal/gpu.py Show resolved Hide resolved

daft/runners/pyrunner.py Outdated Show resolved Hide resolved

daft/runners/pyrunner.py Outdated Show resolved Hide resolved

daft/runners/pyrunner.py Outdated Show resolved Hide resolved

daft/runners/pyrunner.py Outdated Show resolved Hide resolved

some review changes

8943ef1

kevinzwang added 4 commits October 17, 2024 14:35

fix test

3c69ea1

undo ray runner change

ded9d6f

fix tests again

5026714

small fixes

d1519a4

kevinzwang requested a review from jaychia October 17, 2024 22:47

jaychia approved these changes Oct 17, 2024

View reviewed changes

kevinzwang added 2 commits October 17, 2024 15:55

add tests

a356755

move callback creator to method

0a30f41

kevinzwang enabled auto-merge (squash) October 18, 2024 21:22

kevinzwang merged commit 5795adc into main Oct 18, 2024
40 checks passed

kevinzwang deleted the kevin/stateful-udf-rank branch October 18, 2024 21:38


		future.add_done_callback(lambda _: self._release_resources(resource_request))
		future.add_done_callback(create_resource_release_callback(resources))

[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

[FEAT] Add stateful actor context and set CUDA_VISIBLE_DEVICES #3002

Uh oh!

Conversation

kevinzwang commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #3002 will not alter performance

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaychia commented Oct 7, 2024

Uh oh!

codecov bot commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaychia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaychia Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevinzwang commented Oct 5, 2024 •

edited

Loading

codspeed-hq bot commented Oct 5, 2024 •

edited

Loading

codecov bot commented Oct 7, 2024 •

edited

Loading