- 
                Notifications
    
You must be signed in to change notification settings  - Fork 145
 
rocfft: fix callbacks for single-proc multi-GPU transforms #2350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
The std::optional is only present on TreeNodes for the first load from and last store to global memory of the FFT, and is std::nullopt otherwise. This information lets us know which nodes need to run callback functions, even if no other load/store ops need to be run.
| 
           These commits can be reviewed independently. Multi-GPU CI on this PR is not expected to pass until #2328 is merged, as that PR modifies the tests to supply the multi-device callbacks correctly. MPI callbacks are probably also fixed by this, but we need some additional work in the test infrastructure to be able to confirm this.  | 
    
          Codecov Report❌ Patch coverage is  ❌ Your project status has failed because the head coverage (48.85%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@             Coverage Diff             @@
##           develop    #2350      +/-   ##
===========================================
+ Coverage    49.04%   50.16%   +1.12%     
===========================================
  Files          124      125       +1     
  Lines        32149    32224      +75     
  Branches      4230     4250      +20     
===========================================
+ Hits         15766    16163     +397     
+ Misses       15237    14897     -340     
- Partials      1146     1164      +18     
 Flags with carried forward coverage won't be shown. Click here to find out more. 
 🚀 New features to boost your workflow:
  | 
    
| 
           
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to remove the temp file, and should be good.
| auto params = param_generator_complex(test_prob, | ||
| multi_gpu_sizes, | ||
| precision_range_sp_dp, | ||
| {4, 1}, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep using multi_gpu_batch_range, ioffset_range_zero and ooffset_range_zero (declared above)? I find that more readable. If not let's remove them as they'd become unused.
| {{0, 0}}, | ||
| {fft_placement_inplace, fft_placement_notinplace}, | ||
| false, | ||
| run_callbacks); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add auto_alloc_setting (otherwise unused) to make sure we don't have misleading results the day these instances are re-enabled.
| if(load) | ||
| { | ||
| if(src_fn) | ||
| callbacks[b.location.device].load_fn = src_fn[src_idx]; | ||
| if(src_data) | ||
| callbacks[b.location.device].load_data = src_data[src_idx]; | ||
| } | ||
| else | ||
| { | ||
| if(src_fn) | ||
| callbacks[b.location.device].store_fn = src_fn[src_idx]; | ||
| if(src_data) | ||
| callbacks[b.location.device].store_data = src_data[src_idx]; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that bricks of a given field must be assigned different devices, but I'm not clear regarding two bricks of two (different) fields: can those be assigned the same device, conceptually? If yes, aren't we risking to overwrite the content of a callbacks[b.location.device] at some point above?
| // we have at most one load callback | ||
| if(exec_info.load_cb_fns) | ||
| { | ||
| callbacks.resize(local_device + 1); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: defining callbacks as an actual std::map<int, device_callback_t> would avoid having local_device irrelevant entries therein.
It might also help identify possible conflicts if the other comment above is relevant.
| const std::optional<LoadOps> loadOps, | ||
| const std::optional<StoreOps> storeOps, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const std::optional<LoadOps> loadOps, | |
| const std::optional<StoreOps> storeOps, | |
| const std::optional<LoadOps>& loadOps, | |
| const std::optional<StoreOps>& storeOps, | 
if it can be done.
| const std::optional<LoadOps> loadOps, | ||
| const std::optional<StoreOps> storeOps, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const std::optional<LoadOps> loadOps, | |
| const std::optional<StoreOps> storeOps, | |
| const std::optional<LoadOps>& loadOps, | |
| const std::optional<StoreOps>& storeOps, | 
(cf above).
Motivation
Fix load/store callbacks on single-process multi-GPU transforms.
Technical Details
Callbacks were not being attached to the correct TreeNodes during multi-GPU plan execution. Even though callbacks are only known at execute time, we can piggyback on the load/store ops infrastructure for extra operations we want to do that are known at plan creation time.
Modified the load/store ops to be a std::optional on plans. If the optional load/store op is present, then that means this node in the plan is doing the first load from global memory or last store to global memory for the FFT. The load/store op will std::nullopt for internal nodes of a multi-kernel FFT, or for communication steps.
At execution time, we can then look at presence/absence of the load/store op to know if load/store callbacks should be applied to that node of the transform.
We also need to build up an internal device->callback map, so that we can know exactly which callback to invoke for the device that this node executes on. This can only be constructed at a time when we have both the plan description (which knows the field/brick layout, and specifically the ordering of bricks and the devices they reside on) and an execution info (which has the callbacks supplied by the user).
Test Plan
Enabled callback test cases in rocFFT and hipFFT for single-proc multi-GPU transforms.
Test Result
Tests pass.
Submission Checklist