Skip to content

Conversation

@markples
Copy link
Contributor

@markples markples commented Jul 25, 2024

A few problems can occur with workloads that only allocate huge objects:

  • Regions can be pushed to the global decommit list faster than we can process them. This can lead to OOM (more easily reproducible using a hard heap limit).
  • BGCs don't currently call distribute_free_regions, so if only those occur, then the freelist processing never occurs at all.

This change addresses those by doing the following:

  • Call distribute_free_regions during background_mark_phase (while we initially have the VM suspended to avoid complications).
  • Currently, distribute_free_regions can't hit the move_highest_free_regions code path when background_running_p(). Change that check to also require settings.condemned_generation != max_generation.
  • Slow the movement of huge regions from the freelist to the global decommit list.
    • Add aging to regions in global_free_huge_regions.
    • Require an age of MIN_AGE_TO_DECOMMIT_HUGE (2) before including them in a surplus balance.
    • Factor aging code into age_free_regions.

Fixes #94175

@markples markples added this to the 9.0.0 milestone Jul 25, 2024
@markples markples self-assigned this Jul 25, 2024
@ghost ghost added the area-GC-coreclr label Jul 25, 2024
@Maoni0
Copy link
Member

Maoni0 commented Jul 26, 2024

actually there is an optimization we should do here to avoid calling distribute_free_regions at the beginning of a BGC if we just did an ephemeral GC (there's no need to call it again as that ephemeral GC would've called it).

@markples
Copy link
Contributor Author

actually there is an optimization we should do here to avoid calling distribute_free_regions at the beginning of a BGC if we just did an ephemeral GC (there's no need to call it again as that ephemeral GC would've called it).

I did this for distribute_free_regions and age_free_regions. The latter prevents the basic regions from being aged twice but requires the ephemeral GC to age the large and huge regions.

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@markples
Copy link
Contributor Author

verified that original test case in #94175 is fixed by this PR

@markples
Copy link
Contributor Author

markples commented Aug 5, 2024

I used our LowVolatilityRuns (which has normal, soh_pinning, loh, and poh workloads) and didn't see any impact from this change. Overall workload time and memory usage looked the same. Any time I saw a long 1st first BGC pause it was due to running an ephemeral GC, and when I removed those any blip that I saw ended up being due to suspension time, which occurs with or without this change. The last optimization added has the nice effect of making this a nop for BGCs that run an ephemeral GC (though it should be noted that the cases -without- ephemeral GC would the ones where the first BGC pause could be increased the most percentage-wise).

Methodology for this was to fix the runtime to emit the BGC events, change TraceEvent to store the BGCStart and BGC1stNonCondStop event times, and view summary tables and charts in our analysis notebook.

@markples markples merged commit b27a808 into dotnet:main Aug 5, 2024
markples added a commit that referenced this pull request Aug 16, 2024
)

In #105521, the number of regions to be decommitted can be reduced, but the budgets weren't updated to include the new regions.  This was fine for huge regions, which just sit in the global free list anyway, and it (sort of) works in release builds (though some regions may end up decommitted anyway if they are still in the surplus list at the end of distribution), but it isn't the intended behavior and can trigger a debug assertion that the surplus list is empty.

This change (a subset of #106168), restructures distribute_free_regions so that instead of "decommit or adjust budgets", we first decommit and adjust the remaining balance.  Then we adjust budgets based on the new value.
@github-actions github-actions bot locked and limited conversation to collaborators Sep 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpected OOM with GCHeapLimit

2 participants