Skip to content

Conversation

reuterk
Copy link
Contributor

@reuterk reuterk commented Apr 12, 2024

…codes

PR Summary

This PR modifies yt.enable_parallelism() to create a duplicate of the MPI communicator it uses. That communicator is either passed in via the named argument, or the communicator MPI.COMM_WORLD is used internally by default. To create the duplicate, the MPI call comm.Dup() is used.

Issue description: In the original implementation, a copy of the handle of the original MPI communicator was kept. Internally an instance of class Communicator() is created. However, this class has a destructor which explicitly calls the method Free() of the communicator. This is clearly not wanted in case the global communicator MPI.COMM_WORLD is used. In that case mpi4py should do the cleanup internally. In case a communicator was passed in by the user, the call to Free() is also not wanted because the user might want to use the communicator elsewhere in the program.
By creating a local duplicate of the communicator we sidestep these issues. The cleanup of the original communicator should better be handled by the destructor implemented by mpi4py when the communicator object goes out of scope.

Background: This PR also fixes a double free() memory issue which we observed using OpenMPI (versions 4 and 4.1) on an InfiniBand HPC system, causing any code using mpi4py explicitly in combination with yt to crash at the end of the program. I could trace this back to the issue previously described and can confirm that the double free() vanishes with the proposed fix.

PR Checklist

  • New features are documented, with docstrings and narrative docs
  • Adds a test for any bugs fixed. Adds tests for new features.

Copy link

welcome bot commented Apr 12, 2024

Hi! Welcome, and thanks for opening this pull request. We have some guidelines for new pull requests, and soon you'll hear back about the results of our tests and continuous integration checks. Thank you for your contribution!

@reuterk reuterk changed the title create duplicate of MPI communicator, to avoid double comm.Free BUG: create duplicate of MPI communicator, to avoid double comm.Free Apr 12, 2024
@cphyc cphyc added bug backport-stable on-merge: backport to stable labels Apr 12, 2024
@cphyc
Copy link
Member

cphyc commented Apr 12, 2024

Thanks for the PR! By any chance, is there a minimal working example that could be used as a test?

@reuterk
Copy link
Contributor Author

reuterk commented Apr 12, 2024

The following minimal code reproduces the issue:

#!/usr/bin/env python3
from mpi4py import MPI
import yt

yt.enable_parallelism()

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print('rank ', rank)

This probably needs an HPC environment with MPI. On an Infiniband cluster with OpenMPI we get from each MPI rank at the very end of the program:

free(): invalid pointer
[node3140:59252] *** Process received signal ***
[node3140:59252] Signal: Aborted (6)
[node3140:59252] Signal code:  (-6)
[node3140:59252] [ 0] /lib64/libpthread.so.0(+0x16910)[0x14667ec82910]
[node3140:59252] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x14667e968d2b]
[node3140:59252] [ 2] /lib64/libc.so.6(abort+0x177)[0x14667e96a3e5]
[node3140:59252] [ 3] /lib64/libc.so.6(+0x90c27)[0x14667e9aec27]
[node3140:59252] [ 4] /lib64/libc.so.6(+0x98cca)[0x14667e9b6cca]
[node3140:59252] [ 5] /lib64/libc.so.6(+0x9a774)[0x14667e9b8774]
[node3140:59252] [ 6] /soft/skylake/openmpi/gcc_11-11.2.0/4.1.6/lib/libmpi.so.40(ompi_comm_free+0x1e4)[0x14666a435664]
[node3140:59252] [ 7] /soft/skylake/openmpi/gcc_11-11.2.0/4.1.6/lib/libmpi.so.40(PMPI_Comm_free+0x16)[0x14666a464706]
[node3140:59252] [ 8] /soft/skylake/mpi4py/gcc_11-11.2.0-anaconda_3_2023.03-2023.03-openmpi_4.1-4.1.6/3.1.5/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-l
inux-gnu.so(+0x9d861)[0x14666a89d861]
[node3140:59252] [ 9] /soft/x86_64/anaconda/3/2023.03/bin/python3[0x500416]

@chrishavlin
Copy link
Contributor

Spent a bit of time trying to reproduce the failure locally and unsurprisingly I wasn't really able to... closest I got was this very contrived example where I pop out the comm, which triggers the garbage collector and free() call:

#!/usr/bin/env python3
import yt
from yt.utilities.parallel_tools.parallel_analysis_interface import communication_system 

yt.enable_parallelism()
communication_system.communicators.pop()

on main this errors, on each rank you get:

Exception ignored in: <function Communicator.__del__ at 0x1686a5bd0>
Traceback (most recent call last):
  File "/Users/chavlin/src/yt_/yt_dev/yt/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 734, in __del__
    self.comm.Free()
  File "mpi4py/MPI/Comm.pyx", line 229, in mpi4py.MPI.Comm.Free
mpi4py.MPI.Exception: MPI_ERR_COMM: invalid communicator

With the changes in this PR, there is no error as only the copy is freed.

@matthewturk
Copy link
Member

@reuterk thank you for submitting it. I'm going to go ahead and accept it (even with the outstanding minor style issue @neutrinoceros raised) so that we can get it in. Thank you!

@matthewturk matthewturk merged commit fbe8cc5 into yt-project:main Apr 26, 2024
Copy link

welcome bot commented Apr 26, 2024

Hooray! Congratulations on your first merged pull request! We hope we keep seeing you around! 🎆

meeseeksmachine pushed a commit to meeseeksmachine/yt that referenced this pull request Apr 26, 2024
@neutrinoceros neutrinoceros added this to the 4.4.0 milestone Apr 27, 2024
@neutrinoceros neutrinoceros removed the backport-stable on-merge: backport to stable label Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants