-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
Change the underlying concurrency model of Terragrunt so that a pool of runners are leveraged instead of run groups.
Add Units to the pool of runners when they are ready.
When all Units have completed their runs, end the run.
Motivation
Terragrunt currently runs Units in a concurrency model where Units are grouped based on their dependencies, and groups are run in parallel when they do not have any pending group they depend upon.
e.g.
$ terragrunt run-all plan
14:09:52.456 INFO The stack at . will be processed in the following order for command plan:
Group 1
- Module ./unit-a
- Module ./unit-b
Group 2
- Module ./unit-depends-on-unit-a
- Module ./unit-depends-on-unit-b
This is a simple concurrency model, and is easy to display in logs.
Individual Units failing during runs can cause entire groups, and dependent groups to fail, ultimately meaning that individual failing Units can cause widespread failure for a Stack.
In addition, there is wasted time in a run, as groups execute when they have no dependent groups they are waiting on. A group dependent on another group will only start running when the slowest Unit in the dependency completes.
$ terragrunt run-all plan
14:09:52.456 INFO The stack at . will be processed in the following order for command plan:
Group 1
- Module ./slow-unit
- Module ./fast-unit
Group 2
- Module ./unit-depends-on-slow-unit
- Module ./unit-depends-on-fast-unit
Proposal
When Terragrunt starts a run, create a runner pool and a Unit queue, then add unblocked Units to the pool from the queue and run them until the queue is empty.
This will make it so that runs can complete more efficiently and so that individual failing Units do not cause the entire run to fail.
Some users may prefer to support the current behavior where Terragrunt will fail early if an individual Unit fails, and to support that use-case, a --terragrunt-fail-early
flag will be introduced.
Technical Details
Algorithm
The algorithm for working with the runner pool will be as follows:
-
Discover Units and add them to a queue with metadata regarding their dependencies.
Store that metadata as a slice of Units the discovered Unit is
blocked by
.In addition store a slice of
dependencies
for each Unit.Set the status of the Units to
ready
if they have an empty list ofblocked by
and set their status toblocked
if they do not.Sort the queue by:
- Units with
ready
status first. - Units with more dependencies before Units with less dependencies.
- Units with
-
Create a runner pool equal to the minimum of
- The total number of Units
- The configured maximum concurrency.
-
Do the following in a loop:
- Add
ready
Units to the pool based on precedence (more dependencies go first) and set their status topending
until either:- There is no space in the pool.
- There are no more
ready
Units in the queue.
- Concurrently run all
pending
Units. This changes their status torunning
. When a run completes, it does the following:- If the run was successful (exit code of 0):
- Change the status to
succeeded
. - Remove the Unit from any
blocked by
in the queue.- If any Unit has an empty
blocked by
as a consequence, set the status toready
.
- If any Unit has an empty
- Remove the Unit from the pool.
- Change the status to
- If the run failed (exit code ≠ 0):
- Change the status to
failed
. - For any Units in the queue that were
blocked by
the Unit:- Set their status to
ancestor failed
. - Remove them from the queue.
- Recursively repeat for any Unit
blocked by
a removed Unit.
- Set their status to
- If a user sets the
--terragrunt-fail-fast
flag, do the following for all remaining Units:- Set their status to
fail fast
. - Evict them from the queue.
- Set their status to
- Remove the Unit from the pool.
- Change the status to
- If the run was successful (exit code of 0):
- Poll the queue for one of the following:
- The queue is empty, break the loop.
- The pool has space and one or more Units in the queue are
ready
.
- Add
Diagrams
Simple diagram of how units run in the current groups approach vs. in runner pools:
Worst case scenario of how this would impact performance:
In Groups
In Pools
The worst case scenario for the change to Runner Pools is that the total runtime for everything in the run queue is the same between the two approaches. You can see that by the blue 8s and green 6s combining to slow down the total execution in both.
Even in this scenario, note that the purple units that depend on red complete their runs faster, however. This is one of the main advantages of this approach. More overall concurrency is used on average, using more of available hardware.
Compare this to a best case scenario where the 8s blue unit has dropped down to a 4s runtime:
In Groups
In Pools
As you can see, because the slower green unit is no longer blocked by the entirety of group 1, the total run completes faster, and the purple units finish at the same timestamp.
Press Release
Introducing Terragrunt Runner Pools!
Starting with release x.y.z, Terragrunt now ships with an additional experimental concurrency model referred to as Runner Pools.
This new concurrency model allows users to perform large run-all
invocations without individual failures impacting the success of the overall run, and allows runs to finish faster, on average.
To enable runner pools, leverage the following flag to opt-in:
TERRAGRUNT_EXPERIMENTAL_RUNNER_POOL=1
Drawbacks
This is a more complicated model than is currently used by Terragrunt, and may be more difficult to display in logs.
e.g.
$ terragrunt run-all plan
14:09:52.456 INFO The stack at . has been added to the runner queue in the following order:
| ./unit-a |
| ./unit-b |
| ./unit-depends-on-unit-a |
| --> Depends on: [ ./unit-a ] |
| ./unit-depends-on-unit-b |
| --> Depends on: [ ./unit-b ] |
| ./unit-depends-on-unit-a-and-b |
| --> Depends on: [ ./unit-a, ./unit-b ] |
There are also more opportunities to accidentally deadlock Terragrunt, as checks have to be done at multiple stages to proceed with a run.
A big change like this should also be opt-in initially, as any significant issue with the new concurrency model will probably make Terragrunt unusable for users. We'll also want to give users a chance to validate that the new model does improve performance in real production use-cases before forcing everyone to switch over. This can be a significant maintenance burden, and might make it hard to keep development velocity up.
Alternatives
Don't do it
The current model works, and works at fairly large scale. It is simpler to reason about and display. Recent additions like the error
block also makes it easier to ignore failures, so users can simply ignore errors if they have flaky units they want to ignore.
Don't release Runner Pools as an experiment
It adds quite a bit of complexity to simultaneously support two concurrency models, and can lead to risk that bugs are introduced to either. The value of allowing users to use runner pools as an experiment is that they can test them out in the wild, but it might be better to just undergo extensive testing preemptively, then make this the only mechanism for running Terragrunt.
Migration Strategy
Start with this new concurrency model being an experimental opt-in feature. Users may prefer the old concurrency model, and they should be given time to try out the new model before it is the default.
This will also give time for the new concurrency model to be tested on production infrastructure for early adopters before all users are forced to adopt it when it becomes the default.
Unresolved Questions
- What are the hidden risks to changing the concurrency model in this way?
- This is a more complex concurrency model, and may require additional design to convey information about the way in which Terragrunt is going to run units in the terminal. How should Terragrunt help users understand what is happening, visually?
- How long should the experiment run, and what determines success?
References
No response
Proof of Concept Pull Request
No response
Support Level
- I have Terragrunt Enterprise Support
- I am a paying Gruntwork customer
Customer Name
No response