Optimise Random.Shuffle #119890

hamarb123 · 2025-09-19T11:40:28Z

Fixes #119860

This change does the following:

Remove the i != j check
Remove unnecessary local n (which caused extra assembly)

Benchmarks on AMD CPU (thanks EgorBot):

Implement optimisation from dotnet#119860 for structs of size 16 or less.

src/libraries/System.Private.CoreLib/src/System/Random.cs

xtqqczze · 2025-09-19T19:58:40Z

not sure if we prefer Unsafe.SizeOf or sizeof+unsafe

sizeof is preferred over Unsafe.SizeOf, see #78741

xtqqczze · 2025-09-19T20:21:17Z

I think you'll get more efficient register allocation by inlining the local n.

jkotas · 2025-09-19T20:36:19Z

@EgorBot -intel

using BenchmarkDotNet.Attributes;

public class Bench
{
    public static string[] s_a = GenerateStrings();

    static string[] GenerateStrings()
    {
         var a = new string[8];
         for (int i = 0; i < a.Length; i++) a[i] = i.ToString();
         return a;
    }

    [Benchmark]
    public void Shuffle() => Random.Shared.Shuffle(s_a);
}

hamarb123 · 2025-09-19T21:39:50Z

I think you'll get more efficient register allocation by inlining the local n.

Not sure what you're meaning by this.

Also, it seems like your benchmark broke somehow @jkotas (usually only takes like 20 mins max for something that small iirc), might want to re-run it or check with egor if it's just waiting in a queue or something lol - it would be nice if it had a message when it actually started running it, also with a cancel option somehow (maybe via thumbs down or something).

jkotas · 2025-09-19T22:04:20Z

@EgorBo Could you please check whether #119890 (comment) got stuck?

xtqqczze · 2025-09-19T22:28:00Z

I think you'll get more efficient register allocation by inlining the local n.

Not sure what you're meaning by this.

I mean replace the variable n with values.Length

hamarb123 · 2025-09-19T22:30:39Z

I see - that does seem to minorly improve codegen - https://godbolt.org/z/KW8E5qdh5

Move value of `n` (values.Length) inline to improve register allocation.

src/libraries/System.Private.CoreLib/src/System/Random.cs

Change limit to 2*pointer rather than 16 (still 16 on 64-bit).

hamarb123 · 2025-09-21T04:11:24Z

Ok for #119890 (comment), based on some analysis of the problem, I think we can increase the size to 64 (or at least 32) for x64 and arm64 (and any platform with simd probably). Here's why:

We are considering what to do with the approximately ln(x + 0.5) - 0.42 (f(x)) cases that we would be doing a pointless set of operations on average. The options are:

Have a branch to skip over pointless writes - this will be mispredicted for all of these f(x) cases for x > 2, and doesn't help other cases at all
Do pointless reads & writes that will operate entirely on cached memory for the read and will not substantially impact the store buffer for writes* (* = for suitably small structs / suitably few instructions)

The branch misprediction is generally much more expensive than the extra instructions are likely to be, and critically, these occur in the same quantity - and additionally, regardless of which we do, it should have little consequence at large input size (other than the benefit from not having a branch at all vs having a branch that is predicted, but this should become basically non-measurable at sizes large enough to consume memory bandwidth even), due to it scaling logarithmically. This means that as long as our read/write of the value don't expand into too many instructions, we should be able to get away with doing up to a whole aligned cache line (or 2) of unnecessary writes without any concequence, since we already have the memory loaded from the previous swap.

If we have managed values within the value, then we may need to split up our reads/writes a lot more than usual, so this needs to be considered, and similarly if we are on a platform without simd, then it might take a large number of instructions to do a sizeable struct. However, if we are on a platform with simd, doing writes on unmanaged values, then we should be able to get close to a whole cache line's worth of unnecessary operations & basically only benefit. The size of a cache line is generally 64 bytes (or 128 on arm macOS) - from testing on my machines (included below), it does not appear to be the case that any substantial performance loss occurs up to 64 byte structs, which can exist in up to 2 cache lines (due to not necessarily being aligned) - there are a handful of outliers on large input sizes (in both directions, these just seem like random variation, probably due to something going on in my laptop's background, to me), and across the board substantial improvements on small input sizes for up to 64 byte structs at least.

However managed structs cannot be done with simd to the same degree that unmanaged structs can be, so I propose for those, and for platforms without simd (which I'll just check with Vector.IsHardwareAccelerated) I think we should only go up to 2 * sizeof(nint), to set a reasonable limit on the number of operations we're going to end up emitting.

Does this idea sound reasonable to you @jkotas?

Measurements: https://gist.github.com/hamarb123/3c1896b991cde4830dd23a47f61a3f91.

jkotas · 2025-09-21T06:40:12Z

The core libraries are not optimized for large structs that do not follow guidelines https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/choosing-between-class-and-struct. To shuffle span of large structs in most efficient way, you would want to use an algorithm that does ~N element writes, not ~2N element writes like the current algorithm. Providing second implementation like that in the core libraries would be over-engineering inconsistent with the general approach.

There are two variables in play: cost of mis-predicted branch (this varies a lot between different types of hardware) and cost of reading and writing an element of a specific type (this depends on more than just the size). The cost of the mis-predicted branch grows as the amount to shuffle shrinks. The branch is 50/50 coinflip in the last iteration of the loop. I think that it is hard to come up with a simple formula that approximates comparison of these two costs and holds well across broad variety of hardware. It may be best to just delete the mis-predicted branch to keep this simple.

You may want to look at other tweaks to make this method faster. For example, call Next(int maxValue) instead of the two-argument Next overload.

hamarb123 · 2025-09-21T08:28:37Z

It may be best to just delete the mis-predicted branch to keep this simple.

Yeah cool, I'll just delete the check entirely then.

You may want to look at other tweaks to make this method faster. For example, call Next(int maxValue) instead of the two-argument Next overload.

This didn't seem to be faster (although I didn't try inverting the loop order in its entirety, nor did I try on new Random(int) version).

I might add this change & then run an egorbot benchmark to compare all of the changes so far to the original.

Also, can you add tenet-performance label to this PR? Thanks :)

- Unconditionally remove the i != j check - Unroll loop by 2 to improve performance of small cases in particular

- Remove now-unnecessary unsafe

Remove (incorrectly) unrolled logic, and just do it one at a time like before.

src/libraries/System.Private.CoreLib/src/System/Random.cs

Update Random.cs

386edf2

Implement optimisation from dotnet#119860 for structs of size 16 or less.

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 19, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 19, 2025

bartonjs reviewed Sep 19, 2025

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Random.cs Outdated Show resolved Hide resolved

EgorBot mentioned this pull request Sep 19, 2025

Benchmarks for #119890 (jkotas) EgorBot/runtime-utils#497

Open

Implement feedback

94f630b

Make comment spacing nicer

8ae0f41

This comment was marked as resolved.

Sign in to view

EgorBot mentioned this pull request Sep 19, 2025

Benchmarks for #119890 (hamarb123) EgorBot/runtime-utils#498

Open

Update Random.cs

1290696

Move value of `n` (values.Length) inline to improve register allocation.

xtqqczze reviewed Sep 19, 2025

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Random.cs Outdated Show resolved Hide resolved

Update Random.cs

674245e

Change limit to 2*pointer rather than 16 (still 16 on 64-bit).

hamarb123 changed the title ~~Optimise Random.Shuffle for structs of size 16 or less~~ Optimise Random.Shuffle for structs of size <= 2 * sizeof(pointer) or less Sep 19, 2025

Update Random.cs

01ae050

- Unconditionally remove the i != j check - Unroll loop by 2 to improve performance of small cases in particular

hamarb123 changed the title ~~Optimise Random.Shuffle for structs of size <= 2 * sizeof(pointer) or less~~ Optimise Random.Shuffle Sep 21, 2025

Update Random.cs

12fcd90

- Remove now-unnecessary unsafe

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request Sep 21, 2025

Benchmarks for #119890 (hamarb123) EgorBot/runtime-utils#499

Open

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request Sep 21, 2025

Benchmarks for #119890 (hamarb123) EgorBot/runtime-utils#500

Open

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request Sep 21, 2025

Benchmarks for #119890 (hamarb123) EgorBot/runtime-utils#501

Open

Update Random.cs

2bd8c5f

Remove (incorrectly) unrolled logic, and just do it one at a time like before.

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request Sep 21, 2025

Benchmarks for #119890 (hamarb123) EgorBot/runtime-utils#502

Open

Update Random.cs

52acaf0

xtqqczze reviewed Sep 21, 2025

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Random.cs Outdated Show resolved Hide resolved

Update Random.cs

e8fa7ef

build-analysis bot mentioned this pull request Sep 21, 2025

Sometimes the helix SDK uses GetWorkItemsAsync when workitems aren't done processing. dotnet/dnceng#6011

Open

3 tasks

jkotas added the tenet-performance Performance related issue label Sep 21, 2025

jkotas reviewed Sep 21, 2025

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Random.cs Outdated Show resolved Hide resolved

Update Random.cs

79e7f11

jkotas approved these changes Sep 22, 2025

View reviewed changes

jkotas merged commit cf8b0d6 into dotnet:main Sep 22, 2025
144 checks passed

Optimise Random.Shuffle #119890

Optimise Random.Shuffle #119890

Uh oh!

Conversation

hamarb123 commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xtqqczze commented Sep 19, 2025

Uh oh!

xtqqczze commented Sep 19, 2025

Uh oh!

jkotas commented Sep 19, 2025

Uh oh!

hamarb123 commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Sep 19, 2025

Uh oh!

This comment was marked as resolved.

xtqqczze commented Sep 19, 2025

Uh oh!

hamarb123 commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hamarb123 commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamarb123 commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hamarb123 commented Sep 19, 2025 •

edited

Loading

hamarb123 commented Sep 19, 2025 •

edited

Loading

hamarb123 commented Sep 19, 2025 •

edited

Loading

hamarb123 commented Sep 21, 2025 •

edited

Loading

jkotas commented Sep 21, 2025 •

edited

Loading

hamarb123 commented Sep 21, 2025 •

edited

Loading