Runner Pool

### Summary

Change the underlying concurrency model of Terragrunt so that a pool of runners are leveraged instead of run groups.

Add Units to the pool of runners when they are ready.

When all Units have completed their runs, end the run.

### Motivation

Terragrunt currently runs Units in a concurrency model where Units are grouped based on their dependencies, and groups are run in parallel when they do not have any pending group they depend upon.

e.g.

```bash
$ terragrunt run-all plan
14:09:52.456 INFO   The stack at . will be processed in the following order for command plan:
Group 1
- Module ./unit-a
- Module ./unit-b

Group 2
- Module ./unit-depends-on-unit-a
- Module ./unit-depends-on-unit-b
```

This is a simple concurrency model, and is easy to display in logs.

Individual Units failing during runs can cause entire groups, and dependent groups to fail, ultimately meaning that individual failing Units can cause widespread failure for a Stack.

In addition, there is wasted time in a run, as groups execute when they have no dependent groups they are waiting on. A group dependent on another group will only start running when the slowest Unit in the dependency completes.

```bash
$ terragrunt run-all plan
14:09:52.456 INFO   The stack at . will be processed in the following order for command plan:
Group 1
- Module ./slow-unit
- Module ./fast-unit

Group 2
- Module ./unit-depends-on-slow-unit
- Module ./unit-depends-on-fast-unit
```

### Proposal

When Terragrunt starts a run, create a runner pool and a Unit queue, then add unblocked Units to the pool from the queue and run them until the queue is empty.

This will make it so that runs can complete more efficiently and so that individual failing Units do not cause the entire run to fail.

Some users may prefer to support the current behavior where Terragrunt will fail early if an individual Unit fails, and to support that use-case, a `--terragrunt-fail-early` flag will be introduced.

### Technical Details

### Algorithm

The algorithm for working with the runner pool will be as follows:

1. Discover Units and add them to a queue with metadata regarding their dependencies. 
    
    Store that metadata as a slice of Units the discovered Unit is `blocked by`. 
    
    In addition store a slice of `dependencies` for each Unit.
    
    Set the status of the Units to `ready` if they have an empty list of `blocked by` and set their status to `blocked` if they do not.
    
    Sort the queue by:
    
    1. Units with `ready` status first.
    2. Units with more dependencies before Units with less dependencies.
2. Create a runner pool equal to the minimum of 
      1. The total number of Units
      2. The configured maximum concurrency.
4. Do the following in a loop:
    1. Add `ready` Units to the pool based on precedence (more dependencies go first) and set their status to `pending` until either:   
        1. There is no space in the pool.    
        2. There are no more `ready` Units in the queue.
    2. Concurrently run all `pending` Units. This changes their status to `running`. When a run completes, it does the following:
        1. If the run was successful (exit code of 0):
            1. Change the status to `succeeded`.
            2. Remove the Unit from any `blocked by` in the queue.
                1. If any Unit has an empty `blocked by` as a consequence, set the status to `ready`.
            3. Remove the Unit from the pool.
        2. If the run failed (exit code ≠ 0):
            1. Change the status to `failed`.
            2. For any Units in the queue that were `blocked by` the Unit:
                1. Set their status to `ancestor failed`.
                2. Remove them from the queue.
                3. Recursively repeat for any Unit `blocked by` a removed Unit.
            3. If a user sets the `--terragrunt-fail-fast` flag, do the following for all remaining Units:
                1. Set their status to `fail fast`.
                2. Evict them from the queue.
            4. Remove the Unit from the pool.
    3. Poll the queue for one of the following:
        1. The queue is empty, break the loop.
        2. The pool has space and one or more Units in the queue are `ready`.

### Diagrams

Simple diagram of how units run in the current groups approach vs. in runner pools:

![groups-vs-runner-pools](https://github.com/user-attachments/assets/15c0dab6-41e9-4852-b46f-69e6043529b8)

Worst case scenario of how this would impact performance:

In Groups

![worst-case-scenario-groups](https://github.com/user-attachments/assets/5a55830f-669e-4b82-8167-f12955ab84d5)

In Pools

![worst-case-scenario-pools](https://github.com/user-attachments/assets/38e55df4-0462-4bca-8b48-b47e733c32da)

The worst case scenario for the change to Runner Pools is that the total runtime for everything in the run queue is the same between the two approaches. You can see that by the blue 8s and green 6s combining to slow down the total execution in both.

Even in this scenario, note that the purple units that depend on red complete their runs faster, however. This is one of the main advantages of this approach. More overall concurrency is used on average, using more of available hardware.

Compare this to a best case scenario where the 8s blue unit has dropped down to a 4s runtime:

In Groups

![best-case-scenario-groups](https://github.com/user-attachments/assets/d416b8b6-7b60-4ae3-85d1-afd67f37c734)

In Pools

![best-case-scenario-pools](https://github.com/user-attachments/assets/869e4333-5dc1-4228-b6df-b03f7e42b847)

As you can see, because the slower green unit is no longer blocked by the entirety of group 1, the total run completes faster, and the purple units finish at the same timestamp.

### Press Release

Introducing Terragrunt Runner Pools!

Starting with release x.y.z, Terragrunt now ships with an additional experimental concurrency model referred to as Runner Pools.

This new concurrency model allows users to perform large `run-all` invocations without individual failures impacting the success of the overall run, and allows runs to finish faster, on average.

To enable runner pools, leverage the following flag to opt-in:

`TERRAGRUNT_EXPERIMENTAL_RUNNER_POOL=1`

### Drawbacks

This is a more complicated model than is currently used by Terragrunt, and may be more difficult to display in logs.

e.g.

```bash
$ terragrunt run-all plan
14:09:52.456 INFO   The stack at . has been added to the runner queue in the following order:
| ./unit-a                               |
| ./unit-b                               |
| ./unit-depends-on-unit-a               |
| --> Depends on: [ ./unit-a ]           |
| ./unit-depends-on-unit-b               |
| --> Depends on: [ ./unit-b ]           |
| ./unit-depends-on-unit-a-and-b         |
| --> Depends on: [ ./unit-a, ./unit-b ] |
```

There are also more opportunities to accidentally deadlock Terragrunt, as checks have to be done at multiple stages to proceed with a run.

A big change like this should also be opt-in initially, as any significant issue with the new concurrency model will probably make Terragrunt unusable for users. We'll also want to give users a chance to validate that the new model does improve performance in real production use-cases before forcing everyone to switch over. This can be a significant maintenance burden, and might make it hard to keep development velocity up. 

### Alternatives

### Don't do it

The current model works, and works at fairly large scale. It is simpler to reason about and display. Recent additions like the `error` block also makes it easier to ignore failures, so users can simply ignore errors if they have flaky units they want to ignore.

### Don't release Runner Pools as an experiment

It adds quite a bit of complexity to simultaneously support two concurrency models, and can lead to risk that bugs are introduced to either. The value of allowing users to use runner pools as an experiment is that they can test them out in the wild, but it might be better to just undergo extensive testing preemptively, then make this the only mechanism for running Terragrunt.

### Migration Strategy

Start with this new concurrency model being an experimental opt-in feature. Users may prefer the old concurrency model, and they should be given time to try out the new model before it is the default.

This will also give time for the new concurrency model to be tested on production infrastructure for early adopters before all users are forced to adopt it when it becomes the default.

### Unresolved Questions

- What are the hidden risks to changing the concurrency model in this way?
- This is a more complex concurrency model, and may require additional design to convey information about the way in which Terragrunt is going to run units in the terminal. How should Terragrunt help users understand what is happening, visually?
- How long should the experiment run, and what determines success?

### References

_No response_

### Proof of Concept Pull Request

_No response_

### Support Level

- [ ] I have Terragrunt Enterprise Support
- [ ] I am a paying Gruntwork customer

### Customer Name

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Runner Pool #3629

Summary

Motivation

Proposal

Technical Details

Algorithm

Diagrams

Press Release

Drawbacks

Alternatives

Don't do it

Don't release Runner Pools as an experiment

Migration Strategy

Unresolved Questions

References

Proof of Concept Pull Request

Support Level

Customer Name

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Runner Pool #3629

Description

Summary

Motivation

Proposal

Technical Details

Algorithm

Diagrams

Press Release

Drawbacks

Alternatives

Don't do it

Don't release Runner Pools as an experiment

Migration Strategy

Unresolved Questions

References

Proof of Concept Pull Request

Support Level

Customer Name

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions