-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[RFC] Utilize shared memory to deduplicate the network system-wide #6173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
clang-format 20 needs to be run on this PR. (execution 16851640258 / attempt 1) |
nice, i can take a look at linux/macos impl again.. b) the shm file needs an adapter depending on the os, shm_open which we'll probably use later only works up to c) was the issue fixed where it didn't work when multiple instances were spawned too quickly ? @vondele and I thought about this in the past and were maybe thinking about some sort of mutex which might be required to sync the shm file ? d) probably needs some numa testing later again, @vondele and I were in the past thinking about creating multiple shm files which are then mapped and each process in a numa domain would just point to a different mmapped file, this would get rid of most numa code except the affinity binding, no? e) there were some quirks on linux to get huge pages to work with shm, because they had to be enabled for the underlying shm fs, not sure if windows really respects the system level large page support here.. ? do large pages for shm now require admin privileges? was that also the case before? |
at least on my end it was due to large pages. Don't have other hardware to test on.
could always use a hash of it if the length is a problem
yes this has been fixed with the addition of proper content hashing - the issue was caused by two different networks being considered identical
yes that would be wise
i mean, yea, but would push all the responsibility to the user? or do I misunderstand who's responsible for creating these allocations on specific nodes?
there's no filesystem, so... idk, I'd think that it's pretty much identical to normal allocations
It uses the same machinery as VirtualAlloc AFAIK, requiring the same permissions. I don't remember how I have it configured but I never ran the terminal with administrator rights for Stockfish (edit. seems to be a local group policy option). |
shm objects are created in virtual fs (tmpfs) and hugepages depend on how that tmpfs is created.. |
ah forgot to ask, do we have a fallback in place if shared memory is not working ? i.e. in docker the default shm size is too small for out network |
I see, you're right. https://man7.org/linux/man-pages/man5/tmpfs.5.html. I appears |
it also says that by default swap is allowed, so in some scenarios I guess we might end up swapping the network.. ? this is really not ideal.. and mounting our own tmpfs feels weird |
worst case we can make this an UCI option |
I think this is a nice direction, happy to test as soon as we have a linux version. I think Disservin's question higher up can maybe be clarified with an example.. if we have a 2 socket NUMA system, and start a fishtest instance on each of these sockets (with taskset properly matching the two sockets), will the SF instances on the one socket need to access the memory of the other socket for the network? Ideally, each numa domain has one copy of the network, and processes/threads always use the nearest copy. |
There are 4 discriminators
for each unique tuple of these discriminators one and only one network instance exists So if there are 2 fishtest instances, each on a different socket, they will have different NumaPolicy, and will therefore NOT interfere with each other While it won't actually detect that two nodes are physically the same I think a logical distinction is more desirable from user perspective bar some weird edge cases. |
ok, so that will work, I think that case will be matched by be the executable path, I assume the NumaPolicy would be none on all instances. |
it uses the stringified policy which is always expanded to full form https://github.com/official-stockfish/Stockfish/pull/6173/files#diff-d3deced7cf7b187a18c7c5cac6a88895f2e10502ff710ac4d51e6cbd555d0d04R1331-R1335, same as visible in speedtest |
that's weird at least if I understand it correctly.. |
hmm, you're right actually, I didn't verify this correctly, looking at RAMMap now it indeed does not show. However, even after making the change (now pushed) it still does not show in RAMMap, so I'm confused. Would probably have to check it programmatically somehow |
https://github.com/Disservin/Stockfish/commits/shared_memory_sopel/ here is my branch with the linux code, i removed the exe path discriminator for now but this is the branch where i get the slowdown, no MADV_HUGEPAGE flag for madvise/mmap worked for me.. |
I need |
okay so when I mount a tmpfs with
again really hacky and this requires root.. |
I'll try to rewrite this as an UCI option when I find some time. The speed loss would be negligible to those who benefit from this, but we can't have a regression like this. |
there might still be a way around without uci options? Was there no memfd_create which allows for bypassing /dev/shm ? Anyway, first local testing of this looks good. Still want to do some more testing. |
I'm still not so sure about a uci option it really shifts the burden on the user/framework we don't know which machines might have support so it would require some additional checks to disable/enable the option again.. and same for other environments So if shm fails it should use normal allocations and regarding the hugetlb idk |
When I told you about that I didn't realize that we'd have to share the file descriptor between the processes and there isn't a really good way to do this either other than writing it to a file I guess |
two shared memories |
one shared file/memory in tmpfs to share the descriptor, I guess that's also what sopel is saying? |
mh.. i have to think about how to do this properly even.. fd aren't a global thing they are bound to the process from which they were created so other processes would have to read sometimes i had a positive effect with |
As I understand it these descriptors can actually be shared with other processes.
so they could just be shared via shm shared memory |
That only works with sockets, although there is pidfd_getfd to properly duplicate this, still doesn't help me in the case the pid process is no longer alive.. would require sharing all alive pids in a file and then taking one from the list I guess See the second answer |
@Sopel97 in your implementation on windows I get the following output
network gets init twice with same size but different hash ? i dont quite understand this |
EvalFile has different content at first initialization I believe edit.
I think it might just generally be the default-initialized version that's there before actually loading a network but I cba digging through the code |
…ed to allocate the whole network on the shared memory but it uses LargePagePtr
ill try to figure out the semaphore naming later.. |
BTW, on linux this gives a good overview of how the memory is being handled:
|
It looks like the large speedups reported for this, are largely an artifact of how it has been measured. Using the above script, if each of the processes executes a So, we would need to update how fishtest measures nps (currently similar to bench), otherwise we reduce TC by 2x instead of 1.15 or less. |
That makes sense, it essentially allows higher reuse in L3, but is still limited by it. Might be problematic to calibrate to be fully consistent across machines. Is there actually a measurable slowdown on 1 instance? |
you mean running a single process, single threaded ? yes, there is a small slowdown.
|
https://discord.com/channels/435943710472011776/735707599353151579/1396504227592798241 I've said as much before on the measurements being inflatable. A cheap, partial solution, is to just randomize the order of the benched positions, and make sure there is no functional diff with the TT age. But you still get some extra sharing you would not normally get. I see something like a > 100% gain in a bench, 30-60% gain in a randomized bench, 20-40% gain in gameplay. Its important to note that for gameplay, you have to measure the speedup in terms of (master vs master) vs (patch vs patch), not (master vs patch). This patch speeds up all other engines that are playing. I posted this data once before:
|
right, so here, and in several more instances, you are sharing the graph of inflated numbers, knowing they are a measurement artifact? https://discord.com/channels/435943710472011776/813919248455827515/1396393406086778982 Edit: and this one is plain wrong https://discord.com/channels/435943710472011776/882956631514689597/1392309734715166832 |
I provided the above information about speedups to help you guys have a productive conversation about what to do for Fishtest. Not every time I post that graphic do I have the time and energy to do a deep dive on the context and the nuance. However, when going into details in conversations with Disservin and Sopel, I do just that. Both of them are aware of the fickle nature of the measurements. Disservin from the link I posted -- Sopel from our shared experience when testing the original NUMA network-duplication patches on CCC. I don't have any interest in continuing to play out you not liking me. Take that somewhere else instead of detracting from the productive work being done here. |
I see a +60% NPS improvement in gameplay, taking total nodes divided by total time for (verbatim vs verbatim) compared to (self vs self) on my 7950x. pgns.zip. Not even that far off of the biased measurements. Your base NPS gain beats Torch's 40% NPS gain, probably in part to having a L1=3072 network instead of L1=2560. You probably would get even more % if not for the mini network for imbalances. Gain a bit more too if I was not overclocking this memory.
Master average NPS: 387,482 |
I have just pointed out that the graphs you have posted were based on flawed measurements and the statement of 2x throughput increase on fishtest was wrong. Even on hardware where this change has most effect, the measured result is below that. I'm not exactly sure what the format of your pgns is, but I get more like 40% ...
I might be wrong though, maybe you can post what you used to analyze it. For my own tests on similar hardware, I do get 42% and somewhat beefier hardware 37%:
|
Emphasis on "up to", if you really care to have such a semantic argument about nothing. Not to mention that the data provided is for a different chess engine... Although I imagine you could find an example without a qualifier, but such is a result of a campaign to get you guys to make this change. Which worked. And it is hard to to see how one can be unhappy with a 60% gain. The grep command you provided do not correctly parse all data in the PGNs. It only parses positions with a score of |
ah, that clarifies. I'm obviously still interested to have the 10-40% we might get from this.
thanks, now I can reproduce the numbers you provided. |
So what's currently the best script to accurately measure realistic speedup of this patch in game play conditions on 1,2,4,8 etc threads? Could someone post it here? |
probably the fastchess match in #6173 (comment) Edit adjusted for the thread count of the machine used for testing, as well as the number of threads used for the games |
Thanks for the quick reply. Will try to produce some numbers tomorrow. Will only do 1th games, but with increasing concurrency. Does the grep command in that script need to be changed as well? |
no the grep + awk just extracts from the pgn the total number of nodes searched, that ought to remain the same. |
So I played with the this a bit. I am trying to improve on the above script (on a mobile!), and hence often interrupted the fastchess tournaments with ctrl-c. Sadly, I have now reached a stage where the patch is unresponsive to Uci commands on startup. So each new tournament fails. Also tried compiler command from cli. This now enters interactive shell rather than printing info and quit. Also quit command no longer works.
Any idea how I can get out of this? Without rebooting the machine. Of course, something to fix before merge as well, I think. |
yeah, you probably have a lock file that can't get cleaned up because the process that owns it dies (and so you have a deadlock). The workaround right now is to remove files in This is a problem with posix semaphores. Probably requires a change to use pthread_mutexattr_setrobust and no longer posix semaphores. |
Yeah, thanks. That worked for me. I deleted two files, one of them a lock file I think. |
not fully, but hardcoding SEM_NAME_LEN would fix the issue we see in CI. Having said that, to avoid the issue @robertnurnberg (and I at an earlier stage) experienced, we would have to drop posix semaphores... |
right, I missed it, sorry. So file locking? |
Can I get credit somewhere for the initial identification of this problem: official-stockfish/fishtest#2077 and the subsequent demonstration of the solution through the first impl of net sharing in a chess engine: official-monty/Monty#62 (as well as discovering the behaviour of elo gain in multi process SPRT conditions etc.). AGEs work built on top of this |
My exploration is indeed predicated on Viren's observations. |
well in /dev/shm... this ithe example chatgpt cooked up... though I didn't test it... the key thing is
|
I have some numbers now. Edited the above script to run multiple concurrency values and compute speed up automatically. #!/bin/bash
for i in $(seq 0 6)
do
concurrency=$(( 1 << i ))
rounds=$(( 10 << i ))
echo " ===== concurrency: $concurrency ====="
oldnps=
for bin in sopel master
do
echo " ===== $bin ====="
start=$(date +%s)
pgn=${bin}${concurrency}.pgn
rm -f ${pgn}
./fastchess -concurrency ${concurrency} -rounds ${rounds} -games 2 -repeat\
-srand 42 -openings file=UHO_Lichess_4852_v1.epd format=epd order=random\
-engine name=${bin}1 cmd=./stockfish.${bin}\
-engine name=${bin}2 cmd=./stockfish.${bin}\
-each proto=uci option.Threads=1 option.Hash=16 tc=10+0.1\
-pgnout file=${pgn} nodes=true nps=true >& out.${bin}${concurrency}
nodes=$(grep -o 'n=[0-9]*' ${pgn} | cut -d= -f2- | awk '{s=s+$1}END{printf "%.0f\n", s}')
end=$(date +%s)
echo "$((end - start)) seconds for $nodes nodes"
nps=$(echo "$nodes/($end - $start)" | bc -l)
printf "nps: %.5e\n" $nps
if [[ $oldnps ]]; then
speedup=$(echo "$oldnps/$nps" | bc -l)
printf "concurrency %d speedup: %.5f\n" $concurrency $speedup
else
oldnps=$nps
fi
done
done The numbers on my machine are as follows.
So below 32 concurrency sadly no speedup. The machine is a
Edit01: A second run on the same machine gives this:
Edit02: Repeat with fastchess option
Edit03: Repeat with
|
thanks for the numbers and the script (I fixed a small pasto in the script). |
Since the Monty merge request was pointed out, the number there are:
see also https://discord.com/channels/435943710472011776/813919248455827515/1405555036880113664 |
I have updated #6173 (comment) with some more numbers. I would be very curious to know if others with AMD Ryzen chips see similarly underwhelming numbers. Edit: See also https://discord.com/channels/435943710472011776/1405613835020406815 for a discussion of these results. |
This PR introduces a
SystemWideSharedConstant<T>
class that allows placing a content-addressable constant in shared-memory. It is deduplicated system-wide (for a single user). Compared to @Disservin's approach here https://github.com/Disservin/Stockfish/tree/wip-mmap-2 it maintains a single network structure and is less intrusive. Fully compatible and trivially integrated with our NUMA machinery (assuming shared memory allocation follows normal allocation rules with respect to NUMA node placement). Aside from performance improvements it also results in significant memory footprint reduction. This issue was explored and put into light by @AndyGrant verbatim.pdf.Discord discussion: https://discord.com/channels/435943710472011776/813919248455827515/1397183730635772024
The current state of the code is that it works on Windows and is mostly complete, with a few ugly things to improve. Adding linux support should be minimal amount of work at this point.
What it involves:
Test on windows 11 with large pages, 7800x3d, via script provided by @Disservin https://pastebin.com/89syg9mu:
