Skip to content

Conversation

hamarb123
Copy link
Contributor

@hamarb123 hamarb123 commented Sep 19, 2025

Fixes #119860

This change does the following:

  • Remove the i != j check
  • Remove unnecessary local n (which caused extra assembly)

Benchmarks on AMD CPU (thanks EgorBot):
image

Implement optimisation from dotnet#119860 for structs of size 16 or less.
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 19, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 19, 2025
@xtqqczze
Copy link
Contributor

not sure if we prefer Unsafe.SizeOf or sizeof+unsafe

sizeof is preferred over Unsafe.SizeOf, see #78741

@xtqqczze
Copy link
Contributor

I think you'll get more efficient register allocation by inlining the local n.

@jkotas
Copy link
Member

jkotas commented Sep 19, 2025

@EgorBot -intel

using BenchmarkDotNet.Attributes;

public class Bench
{
    public static string[] s_a = GenerateStrings();

    static string[] GenerateStrings()
    {
         var a = new string[8];
         for (int i = 0; i < a.Length; i++) a[i] = i.ToString();
         return a;
    }

    [Benchmark]
    public void Shuffle() => Random.Shared.Shuffle(s_a);
}

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 19, 2025

I think you'll get more efficient register allocation by inlining the local n.

Not sure what you're meaning by this.

Also, it seems like your benchmark broke somehow @jkotas (usually only takes like 20 mins max for something that small iirc), might want to re-run it or check with egor if it's just waiting in a queue or something lol - it would be nice if it had a message when it actually started running it, also with a cancel option somehow (maybe via thumbs down or something).

@jkotas
Copy link
Member

jkotas commented Sep 19, 2025

@EgorBo Could you please check whether #119890 (comment) got stuck?

@hamarb123

This comment was marked as resolved.

@xtqqczze
Copy link
Contributor

I think you'll get more efficient register allocation by inlining the local n.

Not sure what you're meaning by this.

I mean replace the variable n with values.Length

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 19, 2025

I see - that does seem to minorly improve codegen - https://godbolt.org/z/KW8E5qdh5

Move value of `n` (values.Length) inline to improve register allocation.
Change limit to 2*pointer rather than 16 (still 16 on 64-bit).
@hamarb123 hamarb123 changed the title Optimise Random.Shuffle for structs of size 16 or less Optimise Random.Shuffle for structs of size <= 2 * sizeof(pointer) or less Sep 19, 2025
@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 21, 2025

Ok for #119890 (comment), based on some analysis of the problem, I think we can increase the size to 64 (or at least 32) for x64 and arm64 (and any platform with simd probably). Here's why:

We are considering what to do with the approximately ln(x + 0.5) - 0.42 (f(x)) cases that we would be doing a pointless set of operations on average. The options are:

  • Have a branch to skip over pointless writes - this will be mispredicted for all of these f(x) cases for x > 2, and doesn't help other cases at all
  • Do pointless reads & writes that will operate entirely on cached memory for the read and will not substantially impact the store buffer for writes* (* = for suitably small structs / suitably few instructions)

The branch misprediction is generally much more expensive than the extra instructions are likely to be, and critically, these occur in the same quantity - and additionally, regardless of which we do, it should have little consequence at large input size (other than the benefit from not having a branch at all vs having a branch that is predicted, but this should become basically non-measurable at sizes large enough to consume memory bandwidth even), due to it scaling logarithmically. This means that as long as our read/write of the value don't expand into too many instructions, we should be able to get away with doing up to a whole aligned cache line (or 2) of unnecessary writes without any concequence, since we already have the memory loaded from the previous swap.

If we have managed values within the value, then we may need to split up our reads/writes a lot more than usual, so this needs to be considered, and similarly if we are on a platform without simd, then it might take a large number of instructions to do a sizeable struct. However, if we are on a platform with simd, doing writes on unmanaged values, then we should be able to get close to a whole cache line's worth of unnecessary operations & basically only benefit. The size of a cache line is generally 64 bytes (or 128 on arm macOS) - from testing on my machines (included below), it does not appear to be the case that any substantial performance loss occurs up to 64 byte structs, which can exist in up to 2 cache lines (due to not necessarily being aligned) - there are a handful of outliers on large input sizes (in both directions, these just seem like random variation, probably due to something going on in my laptop's background, to me), and across the board substantial improvements on small input sizes for up to 64 byte structs at least.

However managed structs cannot be done with simd to the same degree that unmanaged structs can be, so I propose for those, and for platforms without simd (which I'll just check with Vector.IsHardwareAccelerated) I think we should only go up to 2 * sizeof(nint), to set a reasonable limit on the number of operations we're going to end up emitting.

Does this idea sound reasonable to you @jkotas?

Measurements: https://gist.github.com/hamarb123/3c1896b991cde4830dd23a47f61a3f91.

@jkotas
Copy link
Member

jkotas commented Sep 21, 2025

The core libraries are not optimized for large structs that do not follow guidelines https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/choosing-between-class-and-struct. To shuffle span of large structs in most efficient way, you would want to use an algorithm that does ~N element writes, not ~2N element writes like the current algorithm. Providing second implementation like that in the core libraries would be over-engineering inconsistent with the general approach.

There are two variables in play: cost of mis-predicted branch (this varies a lot between different types of hardware) and cost of reading and writing an element of a specific type (this depends on more than just the size). The cost of the mis-predicted branch grows as the amount to shuffle shrinks. The branch is 50/50 coinflip in the last iteration of the loop. I think that it is hard to come up with a simple formula that approximates comparison of these two costs and holds well across broad variety of hardware. It may be best to just delete the mis-predicted branch to keep this simple.

You may want to look at other tweaks to make this method faster. For example, call Next(int maxValue) instead of the two-argument Next overload.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 21, 2025

It may be best to just delete the mis-predicted branch to keep this simple.

Yeah cool, I'll just delete the check entirely then.

You may want to look at other tweaks to make this method faster. For example, call Next(int maxValue) instead of the two-argument Next overload.

This didn't seem to be faster (although I didn't try inverting the loop order in its entirety, nor did I try on new Random(int) version).

I might add this change & then run an egorbot benchmark to compare all of the changes so far to the original.

Also, can you add tenet-performance label to this PR? Thanks :)

- Unconditionally remove the i != j check
- Unroll loop by 2 to improve performance of small cases in particular
@hamarb123 hamarb123 changed the title Optimise Random.Shuffle for structs of size <= 2 * sizeof(pointer) or less Optimise Random.Shuffle Sep 21, 2025
- Remove now-unnecessary unsafe
@hamarb123

This comment was marked as outdated.

@hamarb123

This comment was marked as outdated.

@hamarb123

This comment was marked as outdated.

Remove (incorrectly) unrolled logic, and just do it one at a time like before.
@hamarb123

This comment was marked as outdated.

@jkotas jkotas merged commit cf8b0d6 into dotnet:main Sep 22, 2025
144 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Indicates that the PR has been added by a community member needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners tenet-performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimization idea for Random.Shuffle

4 participants