-
Notifications
You must be signed in to change notification settings - Fork 14
Distributed multiprocessing #439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We can merge after testing on the DAQ. I'm worried that the RabbitMQ won't scale since things like Celery just use it as a message backend, then use Redis/Mongo as a data backend. If it runs on DAQ, then I'm sure fine. Revisit ROOT lockfile crap if ever has a problem since you're reinventing lockfiles. |
|
I'll review. Want to go through for hour on Monday? Will fix master too. Sent from phone
|
Conflicts: requirements.txt
This adds distributed processing capabilities to pax. The main motivation is parallizing the event builder to achieve a higher throughput of events, but we can probably make broader use of it.
Introduction
Pax multiprocessing works like this:
Each blue box is a pax process. Arrows indicate the flow of events (which actually goes in blocks, usually of 10 events apiece) and green circles indicate queues in which events wait until we are ready to work on them.
As you can see, an "input pax" produces events and puts them on a processing queue. Several workers fetch these events, process them, and put them on an output queue. Finally, an "output pax" fetches from this queue, puts them back in the original order, and writes them to a file.
You might call such a series of paxes in communication with each other via queues a chain. Other chain layouts are imaginable (e.g. several paxes simulating waveforms that are combined into one dataset by an output pax), but this is the most important one. Input/ouput operations are often intrinsically serial (e.g. writing to file) or very hard to parallelize (triggering a batch of data), but most of the CPU-intensive work can easily be parallelized, as each event is independent.
Inside pax, access to the queues is handled by the input plugin PullFromQueue and the output plugin PushToQueue. (The pax core used to handle this itself, but it turned into a big mess.) Worker paxes use both, input paxes only use PushToQueue (and some input plugin, e.g. ReadZipped or the trigger) and output paxes only PullFomQueue (and some output plyugin, e.g. WriteROOTClass or WriteZipped).
Distributed multiprocessing
There are thee challenges in multiprocessing:
When we wrote the pax multiprocessing code, we assumed the processes would all run locally. We addressed these challenges by a master process that periodically polls all the child processes and a global status variable maintained in shared memory (I won't go into more details of the current implementation, see #172 and #298). For remote processing we must revisit this; I'll discuss how below.
This is the distributed multiprocessing layout introduced here:
There are two new concepts here:
When you start pax with
--cpus 10, you activate local multiprocessing (pax.parallel.local_multiprocessing). This starts a host process to do "old-style" pax multiprocessing using shared-memory queues. There is no message broker involved.When you start pax with
--cpus 10 --remote, you activate remote multiprocessingpax.parallel.remote_multiprocessing, shown in the left of the figure. Locally, you will host two paxes -- one input, one output. These are configured to put their events in / get their events from two newly made queues on the message broker. Simultaneously, 10 requests to start several worker paxes are sent to a "startup watch" queue on the message broker.The
paxmakerscript is a host process (small gray boxes on the right) that fetches these startup requests from the message broker. When it gets one, it starts a pax with the indicated configuration. In remote multiprocessing, this will start the workers, which are again configured to use the right queues on the message broker for their events. You could also use a listeningpaxmakerlike a batch queue, e.g. to remotely start a pax to plot a few 1000 waveforms you want to look at later.I've here drawn a single pax chain using the message broker, but running multiple independent chains with the same message broker and the same set of paxmakers is fully supported. The paxmaker processes can then end up hosting paxes belonging to different chains. There will be one crash fanout and startup channel shared by all (since it is the host processes, not the paxes themselves, that communicate with this).
When are we done?
When the source of events in the input pax has dried up, it sends a message NO_MORE_EVENTS to the processing queue. A worker process receiving this message knows it can end, but before it does so, it will pass the NO_MORE_EVENTS message back up the processing queue (not the output queue!) so other workers will receive it. (when all of the workers have terminated, there will actually still be this one NO_MORE_EVENTS message on the queue)
For the output pax, things are more complicated. The workers know when they themselves are done (when they ask for the next event block but instead see a no-more-events message), but not when all of them are done. Other workers may still be busy processing event blocks.
The solution I chose here is to have the workers push a "register" message on the output queue when they start. When the worker is done, it passes an "unregisters" message. The output plugin knows it can end when the last worker has unregistered and there are no more events left for it to write. The output plugin always checks for new register/unregister messages, not just at startup, since workers may have very different startup times.
Ordering the events
The output worker usually receives blocks out of order (as they don't all take the same time to process) so it needs some way to re-order the blocks. This is not trivial:
--event 0 2 7 38 ...), which means we can't use schemes relying on predicting what event numbers the next block will have (e.g. start writing the block with event 0, I know blocks are 10 events long, so the next block I could write will have event 10, etc.).The solution I chose is to tag each block with an incremental "block id" when it is created. All events in the block are tagged with the same id (in the new field Event.block_id) so we don't lose the information when we pass the events to other plugins. With this id, the output worker knows which of the blocks it has received is the next one to write, or if that block hasn't arrived yet.
Most of this was already implemented before. The only change is annotating the events themselves with the block id, whereas previously we used a shortcut where the pax core keeps track of the block id.
Watching for crashes
Life is tough; things break. Ideally you want them to crash clearly and not just cause an infinite hang. You also don't want to ssh to all for every crash to figure out what happened and get things back up. I included two levels of protection for this:
Again, the host processes themselves stay alive. The timeout exceptions get propagated to other pax hosts connected to the broker and will terminate the other paxes in the chain
ROOT compilation lockfiles
I changed the lock file handling for ROOT class compilation to support multiple processes trying to compile the class in the same directory. In case you want to know the details:
Future directions
The main feature currently missing is propagation of log messages. This doesn't matter for the event builder, where all of the action happens in the input plugin, and the rest is dumb but CPU-intensive data fetching and transcoding. If we want to use distributed processing for... well, actual processing, it would be nice to get the log messages in one place as opposed to several STDOUTs of remote machines. Perhaps it would be best to tag along the log messages with the events that generated them?
We might want to add a feature that has more workers startup when the processing queue is full. This will just require a small change to remote_multiprocessing.
Another small feature that would be good to add is auto-breaking of a ROOT compilation lock that has lingered due to a segfault in a previous compilation. Previously pax would crash if this happens, forcing people to manually remove the lock -- now it is actually worse, as it will hang forever waiting for the lock file to be removed. Breaking the lock if it is >1 or 2 minutes old will resolve this.I added breaking of locks more than 90 seconds old. The compilation should not take that long, and people will want to retry it rather soon after it crashes.It would be interesting to figure out how RabbitMQ actually handles messages: does it store the actual content on the broker machine, or does it just store a reference, and when a worker fetches it, tells that worker to talk to whoever has the content? The latter would probably be preferable for the event builder, though maybe it will work fine either way.
And of course we need to actually test this on the DAQ. ;-) I did include a few unit tests for the new queue plugins and tested the remote functionality manually.
Please do not pull master in yet as it's currently failing (@tunnell will clean this up soon).