[GC] Avoid OOM in large-allocation-only workloads #105521

markples · 2024-07-25T22:41:10Z

A few problems can occur with workloads that only allocate huge objects:

Regions can be pushed to the global decommit list faster than we can process them. This can lead to OOM (more easily reproducible using a hard heap limit).
BGCs don't currently call distribute_free_regions, so if only those occur, then the freelist processing never occurs at all.

This change addresses those by doing the following:

Call distribute_free_regions during background_mark_phase (while we initially have the VM suspended to avoid complications).
Currently, distribute_free_regions can't hit the move_highest_free_regions code path when background_running_p(). Change that check to also require settings.condemned_generation != max_generation.
Slow the movement of huge regions from the freelist to the global decommit list.
- Add aging to regions in global_free_huge_regions.
- Require an age of MIN_AGE_TO_DECOMMIT_HUGE (2) before including them in a surplus balance.
- Factor aging code into age_free_regions.

Fixes #94175

fix

src/coreclr/gc/gc.cpp

Maoni0 · 2024-07-26T00:04:57Z

actually there is an optimization we should do here to avoid calling distribute_free_regions at the beginning of a BGC if we just did an ephemeral GC (there's no need to call it again as that ephemeral GC would've called it).

markples · 2024-07-29T19:23:46Z

actually there is an optimization we should do here to avoid calling distribute_free_regions at the beginning of a BGC if we just did an ephemeral GC (there's no need to call it again as that ephemeral GC would've called it).

I did this for distribute_free_regions and age_free_regions. The latter prevents the basic regions from being aged twice but requires the ephemeral GC to age the large and huge regions.

Maoni0

LGTM!

markples · 2024-07-30T00:33:14Z

verified that original test case in #94175 is fixed by this PR

markples · 2024-08-05T19:17:45Z

I used our LowVolatilityRuns (which has normal, soh_pinning, loh, and poh workloads) and didn't see any impact from this change. Overall workload time and memory usage looked the same. Any time I saw a long 1st first BGC pause it was due to running an ephemeral GC, and when I removed those any blip that I saw ended up being due to suspension time, which occurs with or without this change. The last optimization added has the nice effect of making this a nop for BGCs that run an ephemeral GC (though it should be noted that the cases -without- ephemeral GC would the ones where the first BGC pause could be increased the most percentage-wise).

Methodology for this was to fix the runtime to emit the BGC events, change TraceEvent to store the BGCStart and BGC1stNonCondStop event times, and view summary tables and charts in our analysis notebook.

) In #105521, the number of regions to be decommitted can be reduced, but the budgets weren't updated to include the new regions. This was fine for huge regions, which just sit in the global free list anyway, and it (sort of) works in release builds (though some regions may end up decommitted anyway if they are still in the surplus list at the end of distribution), but it isn't the intended behavior and can trigger a debug assertion that the surplus list is empty. This change (a subset of #106168), restructures distribute_free_regions so that instead of "decommit or adjust budgets", we first decommit and adjust the remaining balance. Then we adjust budgets based on the new value.

markples added 10 commits June 19, 2024 11:51

draft

e45a945

draft redist and age

af4ef87

avoid negative balance

f8ddef6

globalfree

f780936

bgc decommit

27b765f

heap 0 only

6010e73

under join

fd2ca07

same suspend and only age global free once

692bae3

cleanup

d21a993

fix

factor age_free_regions

cebeabd

markples added this to the 9.0.0 milestone Jul 25, 2024

markples self-assigned this Jul 25, 2024

ghost added the area-GC-coreclr label Jul 25, 2024

Maoni0 reviewed Jul 25, 2024

View reviewed changes

src/coreclr/gc/gc.cpp Outdated Show resolved Hide resolved

Maoni0 reviewed Jul 25, 2024

View reviewed changes

src/coreclr/gc/gc.cpp Outdated Show resolved Hide resolved

markples added 4 commits July 26, 2024 12:57

Merge remote-tracking branch 'dotnet/main' into 94175_balance

4eb7ae4

Extract descr_generations call. Hoist global_free_huge_regions line.

53ed0cd

single call to redistribute and age

ed3b09e

build is_bgc_in_progress

869e09d

Maoni0 approved these changes Jul 29, 2024

View reviewed changes

markples mentioned this pull request Jul 30, 2024

Improve memory utilization in the case of continuous BGC #87715

Open

markples merged commit b27a808 into dotnet:main Aug 5, 2024

This was referenced Aug 9, 2024

[GC] Handle growing global_regions_to_decommit lists #106168

Closed

[GC] Allow distribute_free_regions to decommit and redistribute #106414

Merged

github-actions bot locked and limited conversation to collaborators Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[GC] Avoid OOM in large-allocation-only workloads #105521

[GC] Avoid OOM in large-allocation-only workloads #105521

Uh oh!

markples commented Jul 25, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Maoni0 commented Jul 26, 2024

Uh oh!

markples commented Jul 29, 2024

Uh oh!

Maoni0 left a comment

Uh oh!

markples commented Jul 30, 2024

Uh oh!

markples commented Aug 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[GC] Avoid OOM in large-allocation-only workloads #105521

[GC] Avoid OOM in large-allocation-only workloads #105521

Uh oh!

Conversation

markples commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Maoni0 commented Jul 26, 2024

Uh oh!

markples commented Jul 29, 2024

Uh oh!

Maoni0 left a comment

Choose a reason for hiding this comment

Uh oh!

markples commented Jul 30, 2024

Uh oh!

markples commented Aug 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

markples commented Jul 25, 2024 •

edited

Loading