-
Notifications
You must be signed in to change notification settings - Fork 293
BUG: create duplicate of MPI communicator, to avoid double comm.Free #4875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi! Welcome, and thanks for opening this pull request. We have some guidelines for new pull requests, and soon you'll hear back about the results of our tests and continuous integration checks. Thank you for your contribution! |
Thanks for the PR! By any chance, is there a minimal working example that could be used as a test? |
The following minimal code reproduces the issue: #!/usr/bin/env python3
from mpi4py import MPI
import yt
yt.enable_parallelism()
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print('rank ', rank) This probably needs an HPC environment with MPI. On an Infiniband cluster with OpenMPI we get from each MPI rank at the very end of the program:
|
Spent a bit of time trying to reproduce the failure locally and unsurprisingly I wasn't really able to... closest I got was this very contrived example where I pop out the comm, which triggers the garbage collector and #!/usr/bin/env python3
import yt
from yt.utilities.parallel_tools.parallel_analysis_interface import communication_system
yt.enable_parallelism()
communication_system.communicators.pop() on main this errors, on each rank you get: Exception ignored in: <function Communicator.__del__ at 0x1686a5bd0>
Traceback (most recent call last):
File "/Users/chavlin/src/yt_/yt_dev/yt/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 734, in __del__
self.comm.Free()
File "mpi4py/MPI/Comm.pyx", line 229, in mpi4py.MPI.Comm.Free
mpi4py.MPI.Exception: MPI_ERR_COMM: invalid communicator With the changes in this PR, there is no error as only the copy is freed. |
@reuterk thank you for submitting it. I'm going to go ahead and accept it (even with the outstanding minor style issue @neutrinoceros raised) so that we can get it in. Thank you! |
Hooray! Congratulations on your first merged pull request! We hope we keep seeing you around! 🎆 |
…r, to avoid double comm.Free
…codes
PR Summary
This PR modifies
yt.enable_parallelism()
to create a duplicate of the MPI communicator it uses. That communicator is either passed in via the named argument, or the communicatorMPI.COMM_WORLD
is used internally by default. To create the duplicate, the MPI callcomm.Dup()
is used.Issue description: In the original implementation, a copy of the handle of the original MPI communicator was kept. Internally an instance of
class Communicator()
is created. However, this class has a destructor which explicitly calls the methodFree()
of the communicator. This is clearly not wanted in case the global communicatorMPI.COMM_WORLD
is used. In that casempi4py
should do the cleanup internally. In case a communicator was passed in by the user, the call toFree()
is also not wanted because the user might want to use the communicator elsewhere in the program.By creating a local duplicate of the communicator we sidestep these issues. The cleanup of the original communicator should better be handled by the destructor implemented by
mpi4py
when the communicator object goes out of scope.Background: This PR also fixes a double
free()
memory issue which we observed using OpenMPI (versions 4 and 4.1) on an InfiniBand HPC system, causing any code usingmpi4py
explicitly in combination withyt
to crash at the end of the program. I could trace this back to the issue previously described and can confirm that the doublefree()
vanishes with the proposed fix.PR Checklist