-
Couldn't load subscription status.
- Fork 5.2k
[GC] Avoid OOM in large-allocation-only workloads #105521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
actually there is an optimization we should do here to avoid calling |
I did this for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
verified that original test case in #94175 is fixed by this PR |
|
I used our LowVolatilityRuns (which has normal, soh_pinning, loh, and poh workloads) and didn't see any impact from this change. Overall workload time and memory usage looked the same. Any time I saw a long 1st first BGC pause it was due to running an ephemeral GC, and when I removed those any blip that I saw ended up being due to suspension time, which occurs with or without this change. The last optimization added has the nice effect of making this a nop for BGCs that run an ephemeral GC (though it should be noted that the cases -without- ephemeral GC would the ones where the first BGC pause could be increased the most percentage-wise). Methodology for this was to fix the runtime to emit the BGC events, change TraceEvent to store the BGCStart and BGC1stNonCondStop event times, and view summary tables and charts in our analysis notebook. |
) In #105521, the number of regions to be decommitted can be reduced, but the budgets weren't updated to include the new regions. This was fine for huge regions, which just sit in the global free list anyway, and it (sort of) works in release builds (though some regions may end up decommitted anyway if they are still in the surplus list at the end of distribution), but it isn't the intended behavior and can trigger a debug assertion that the surplus list is empty. This change (a subset of #106168), restructures distribute_free_regions so that instead of "decommit or adjust budgets", we first decommit and adjust the remaining balance. Then we adjust budgets based on the new value.
A few problems can occur with workloads that only allocate huge objects:
distribute_free_regions, so if only those occur, then the freelist processing never occurs at all.This change addresses those by doing the following:
distribute_free_regionsduringbackground_mark_phase(while we initially have the VM suspended to avoid complications).distribute_free_regionscan't hit themove_highest_free_regionscode path whenbackground_running_p(). Change that check to also requiresettings.condemned_generation != max_generation.global_free_huge_regions.age_free_regions.Fixes #94175