-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Optimise Random.Shuffle #119890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise Random.Shuffle #119890
Conversation
Implement optimisation from dotnet#119860 for structs of size 16 or less.
|
I think you'll get more efficient register allocation by inlining the local |
@EgorBot -intel using BenchmarkDotNet.Attributes;
public class Bench
{
public static string[] s_a = GenerateStrings();
static string[] GenerateStrings()
{
var a = new string[8];
for (int i = 0; i < a.Length; i++) a[i] = i.ToString();
return a;
}
[Benchmark]
public void Shuffle() => Random.Shared.Shuffle(s_a);
} |
Not sure what you're meaning by this. Also, it seems like your benchmark broke somehow @jkotas (usually only takes like 20 mins max for something that small iirc), might want to re-run it or check with egor if it's just waiting in a queue or something lol - it would be nice if it had a message when it actually started running it, also with a cancel option somehow (maybe via thumbs down or something). |
@EgorBo Could you please check whether #119890 (comment) got stuck? |
This comment was marked as resolved.
This comment was marked as resolved.
I mean replace the variable |
I see - that does seem to minorly improve codegen - https://godbolt.org/z/KW8E5qdh5 |
Move value of `n` (values.Length) inline to improve register allocation.
Change limit to 2*pointer rather than 16 (still 16 on 64-bit).
Ok for #119890 (comment), based on some analysis of the problem, I think we can increase the size to 64 (or at least 32) for x64 and arm64 (and any platform with simd probably). Here's why: We are considering what to do with the approximately
The branch misprediction is generally much more expensive than the extra instructions are likely to be, and critically, these occur in the same quantity - and additionally, regardless of which we do, it should have little consequence at large input size (other than the benefit from not having a branch at all vs having a branch that is predicted, but this should become basically non-measurable at sizes large enough to consume memory bandwidth even), due to it scaling logarithmically. This means that as long as our read/write of the value don't expand into too many instructions, we should be able to get away with doing up to a whole aligned cache line (or 2) of unnecessary writes without any concequence, since we already have the memory loaded from the previous swap. If we have managed values within the value, then we may need to split up our reads/writes a lot more than usual, so this needs to be considered, and similarly if we are on a platform without simd, then it might take a large number of instructions to do a sizeable struct. However, if we are on a platform with simd, doing writes on unmanaged values, then we should be able to get close to a whole cache line's worth of unnecessary operations & basically only benefit. The size of a cache line is generally 64 bytes (or 128 on arm macOS) - from testing on my machines (included below), it does not appear to be the case that any substantial performance loss occurs up to 64 byte structs, which can exist in up to 2 cache lines (due to not necessarily being aligned) - there are a handful of outliers on large input sizes (in both directions, these just seem like random variation, probably due to something going on in my laptop's background, to me), and across the board substantial improvements on small input sizes for up to 64 byte structs at least. However managed structs cannot be done with simd to the same degree that unmanaged structs can be, so I propose for those, and for platforms without simd (which I'll just check with Does this idea sound reasonable to you @jkotas? Measurements: https://gist.github.com/hamarb123/3c1896b991cde4830dd23a47f61a3f91. |
The core libraries are not optimized for large structs that do not follow guidelines https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/choosing-between-class-and-struct. To shuffle span of large structs in most efficient way, you would want to use an algorithm that does ~N element writes, not ~2N element writes like the current algorithm. Providing second implementation like that in the core libraries would be over-engineering inconsistent with the general approach. There are two variables in play: cost of mis-predicted branch (this varies a lot between different types of hardware) and cost of reading and writing an element of a specific type (this depends on more than just the size). The cost of the mis-predicted branch grows as the amount to shuffle shrinks. The branch is 50/50 coinflip in the last iteration of the loop. I think that it is hard to come up with a simple formula that approximates comparison of these two costs and holds well across broad variety of hardware. It may be best to just delete the mis-predicted branch to keep this simple. You may want to look at other tweaks to make this method faster. For example, call |
Yeah cool, I'll just delete the check entirely then.
This didn't seem to be faster (although I didn't try inverting the loop order in its entirety, nor did I try on I might add this change & then run an egorbot benchmark to compare all of the changes so far to the original. Also, can you add tenet-performance label to this PR? Thanks :) |
- Unconditionally remove the i != j check - Unroll loop by 2 to improve performance of small cases in particular
- Remove now-unnecessary unsafe
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Remove (incorrectly) unrolled logic, and just do it one at a time like before.
Fixes #119860
This change does the following:
i != j
checkn
(which caused extra assembly)Benchmarks on AMD CPU (thanks EgorBot):
