Skip to content

[RFC] Utilize shared memory to deduplicate the network system-wide #6173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 57 commits into
base: master
Choose a base branch
from

Conversation

Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Jul 23, 2025

This PR introduces a SystemWideSharedConstant<T> class that allows placing a content-addressable constant in shared-memory. It is deduplicated system-wide (for a single user). Compared to @Disservin's approach here https://github.com/Disservin/Stockfish/tree/wip-mmap-2 it maintains a single network structure and is less intrusive. Fully compatible and trivially integrated with our NUMA machinery (assuming shared memory allocation follows normal allocation rules with respect to NUMA node placement). Aside from performance improvements it also results in significant memory footprint reduction. This issue was explored and put into light by @AndyGrant verbatim.pdf.

Discord discussion: https://discord.com/channels/435943710472011776/813919248455827515/1397183730635772024

The current state of the code is that it works on Windows and is mostly complete, with a few ugly things to improve. Adding linux support should be minimal amount of work at this point.

What it involves:

  1. The network structure can no longer use indirection, it needs to be trivial, i.e. the data needs to be stored in contiguous memory. This is required to make it work from shared memory without complex allocators and pointer->offset conversion.
  2. Following the above, in some place dynamic allocations are required because the object is too large to fit on the stack. This is only for intermediate copies though.
  3. Additional copies are created throughput during the initialization process, though this can be improved slightly, not much care has been taken to address this yet, old code just assumed the network object is cheap to move.
  4. Writing the network to file now creates a copy of the feature transformer to avoid in-place modification and preserve const-correctness.
  5. Proper hashing functions are added for network related structures, they are used to detrive content hash of the network for shared memory addressing.
  6. A FixedString class that stores a string with fixed capacity without heap allocations, necessary for EvalFile as it's part of the network structure

Test on windows 11 with large pages, 7800x3d, via script provided by @Disservin https://pastebin.com/89syg9mu:
graph

FINAL SUMMARY
==================================================

Command 1 Results:
Threads | Avg Nodes/s | Min | Max | Median | StdDev
-------------------------------------------------------
      1 |  1344021.00 | 1335100.00 | 1352942.00 | 1344021.00 | 12616.20
      2 |  1327394.00 | 1321839.00 | 1333741.00 | 1326998.00 | 5014.58
      4 |  1303861.75 | 1297324.00 | 1311121.00 | 1303747.00 | 4867.75
      8 |  1139116.00 | 1104799.00 | 1174991.00 | 1137885.50 | 20481.99
     16 |   770158.22 | 759475.00 | 778693.00 | 770336.00 | 4745.81

Command 2 Results:
Threads | Avg Nodes/s | Min | Max | Median | StdDev
-------------------------------------------------------
      1 |  1329460.00 | 1326731.00 | 1332189.00 | 1329460.00 | 3859.39
      2 |  1317041.00 | 1313630.00 | 1318601.00 | 1317966.50 | 2301.10
      4 |  1277356.75 | 1266281.00 | 1287642.00 | 1276741.00 | 7691.70
      8 |  1081963.94 | 1047183.00 | 1116268.00 | 1077491.00 | 19623.02
     16 |   690055.75 | 678430.00 | 707478.00 | 688052.50 | 7302.30```

More benchmarks from higher core machines are welcome.

Copy link

github-actions bot commented Jul 23, 2025

clang-format 20 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/plucky/clang-format-20.

(execution 16851640258 / attempt 1)

@Disservin
Copy link
Member

Disservin commented Jul 23, 2025

nice, i can take a look at linux/macos impl again..
a) was the single core regression fixed now? (maybe run a longer speedup command)

b) the shm file needs an adapter depending on the os, shm_open which we'll probably use later only works up to NAME_MAX which is 255 chars, though on macos NAME_MAX was something really low like 20 iirc, i'll try to write something for this when i give linux impl a try

c) was the issue fixed where it didn't work when multiple instances were spawned too quickly ? @vondele and I thought about this in the past and were maybe thinking about some sort of mutex which might be required to sync the shm file ?

d) probably needs some numa testing later again, @vondele and I were in the past thinking about creating multiple shm files which are then mapped and each process in a numa domain would just point to a different mmapped file, this would get rid of most numa code except the affinity binding, no?

e) there were some quirks on linux to get huge pages to work with shm, because they had to be enabled for the underlying shm fs, not sure if windows really respects the system level large page support here.. ? do large pages for shm now require admin privileges? was that also the case before?

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

a) was the single core regression fixed now? (maybe run a longer speedup command)

at least on my end it was due to large pages. Don't have other hardware to test on.

b) the shm file needs an adapter depending on the os, shm_open which we'll probably use later only works up to NAME_MAX which is 255 chars, though on macos NAME_MAX was something really low like 20 iirc, i'll try to write something for this when i give linux impl a try

could always use a hash of it if the length is a problem

c) was the issue fixed where it didn't work when multiple instances were spawned too quickly ?

yes this has been fixed with the addition of proper content hashing - the issue was caused by two different networks being considered identical

d) probably needs some numa testing later again,

yes that would be wise

and I were in the past thinking about creating multiple shm files which are then mapped and each process in a numa domain would just point to a different mmapped file, this would get rid of most numa code except the affinity binding, no?

i mean, yea, but would push all the responsibility to the user? or do I misunderstand who's responsible for creating these allocations on specific nodes?

e) there were some quirks on linux to get huge pages to work with shm, because they had to be enabled for the underlying shm fs

there's no filesystem, so... idk, I'd think that it's pretty much identical to normal allocations

not sure if windows really respects the system level large page support here.. ? do large pages for shm now require admin privileges? was that also the case before?

It uses the same machinery as VirtualAlloc AFAIK, requiring the same permissions. I don't remember how I have it configured but I never ran the terminal with administrator rights for Stockfish (edit. seems to be a local group policy option).

@Disservin
Copy link
Member

there's no filesystem, so... idk, I'd think that it's pretty much identical to normal allocations

shm objects are created in virtual fs (tmpfs) and hugepages depend on how that tmpfs is created..
"Accessing shared memory objects via the filesystem" https://man7.org/linux/man-pages/man7/shm_overview.7.html

@Disservin
Copy link
Member

Disservin commented Jul 23, 2025

ah forgot to ask, do we have a fallback in place if shared memory is not working ? i.e. in docker the default shm size is too small for out network

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

there's no filesystem, so... idk, I'd think that it's pretty much identical to normal allocations

shm objects are created in virtual fs (tmpfs) and hugepages depend on how that tmpfs is created.. "Accessing shared memory objects via the filesystem" https://man7.org/linux/man-pages/man7/shm_overview.7.html

I see, you're right. https://man7.org/linux/man-pages/man5/tmpfs.5.html. I appears never is the default, though it's unclear to me how it differs from deny and whether it actually ignores MAP_HUGETLB/SHM_HUGETLB or not. Quite messy compared to Windows, admittedly, and I'd have to test it in practice.

@Disservin
Copy link
Member

it also says that by default swap is allowed, so in some scenarios I guess we might end up swapping the network.. ? this is really not ideal.. and mounting our own tmpfs feels weird

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

worst case we can make this an UCI option

@vondele
Copy link
Member

vondele commented Jul 23, 2025

I think this is a nice direction, happy to test as soon as we have a linux version.
Windows clang still seems to fail, with a -Werror.

I think Disservin's question higher up can maybe be clarified with an example.. if we have a 2 socket NUMA system, and start a fishtest instance on each of these sockets (with taskset properly matching the two sockets), will the SF instances on the one socket need to access the memory of the other socket for the network? Ideally, each numa domain has one copy of the network, and processes/threads always use the nearest copy.

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

I think Disservin's question higher up can maybe be clarified with an example.. if we have a 2 socket NUMA system, and start a fishtest instance on each of these sockets (with taskset properly matching the two sockets), will the SF instances on the one socket need to access the memory of the other socket for the network? Ideally, each numa domain has one copy of the network, and processes/threads always use the nearest copy.

There are 4 discriminators

  1. Network content
  2. Executable path
  3. resolved NumaPolicy, so the string (i.e. "0-7:8-15") that means how processors are mapped to numa nodes
  4. The NUMA node index (in the policy string above)

for each unique tuple of these discriminators one and only one network instance exists

So if there are 2 fishtest instances, each on a different socket, they will have different NumaPolicy, and will therefore NOT interfere with each other

While it won't actually detect that two nodes are physically the same I think a logical distinction is more desirable from user perspective bar some weird edge cases.

@vondele
Copy link
Member

vondele commented Jul 23, 2025

ok, so that will work, I think that case will be matched by be the executable path, I assume the NumaPolicy would be none on all instances.

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

it uses the stringified policy which is always expanded to full form https://github.com/official-stockfish/Stockfish/pull/6173/files#diff-d3deced7cf7b187a18c7c5cac6a88895f2e10502ff710ac4d51e6cbd555d0d04R1331-R1335, same as visible in speedtest

@Disservin
Copy link
Member

at least on my end it was due to large pages. Don't have other hardware to test on.

that's weird at least if I understand it correctly..
https://learn.microsoft.com/en-us/windows/win32/memory/creating-a-file-mapping-using-large-pages
the MapViewOfFile is missing FILE_MAP_LARGE_PAGES so it shouldn't really work ?

@Disservin
Copy link
Member

this branch with linux patch

Figure_1

my old branch
Figure_1

max speedup varies a bit from run to run though..

however my current implementation has the usual single core performance issue,

sf_base =  1270384 +/-  23285 (95%)
sf_test =  1236415 +/-  24377 (95%)
diff    =   -33968 +/-   7287 (95%)
speedup = -2.67390% +/- 0.574% (95%)

dunno how I can really force hugepages here, maybe ill remount dev/shm and try that

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

at least on my end it was due to large pages. Don't have other hardware to test on.

that's weird at least if I understand it correctly.. https://learn.microsoft.com/en-us/windows/win32/memory/creating-a-file-mapping-using-large-pages the MapViewOfFile is missing FILE_MAP_LARGE_PAGES so it shouldn't really work ?

hmm, you're right actually, I didn't verify this correctly, looking at RAMMap now it indeed does not show. However, even after making the change (now pushed) it still does not show in RAMMap, so I'm confused. Would probably have to check it programmatically somehow

@Disservin
Copy link
Member

https://github.com/Disservin/Stockfish/commits/shared_memory_sopel/

here is my branch with the linux code, i removed the exe path discriminator for now but this is the branch where i get the slowdown, no MADV_HUGEPAGE flag for madvise/mmap worked for me..

@vondele
Copy link
Member

vondele commented Jul 23, 2025

I need -lrt of the LDFLAG for me to link.

@Disservin
Copy link
Member

okay so when I mount a tmpfs with huge=always,size=2G and move the file creation there the regression is gone.. and actually a speedup probably due to the even larger page?

sf_base =  1276171 +/-   9657 (95%)
sf_test =  1306195 +/-   8113 (95%)
diff    =    30023 +/-  11961 (95%)
speedup = 2.35263% +/- 0.937% (95%)

again really hacky and this requires root..

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

I'll try to rewrite this as an UCI option when I find some time. The speed loss would be negligible to those who benefit from this, but we can't have a regression like this.

@vondele
Copy link
Member

vondele commented Jul 23, 2025

there might still be a way around without uci options? Was there no memfd_create which allows for bypassing /dev/shm ?

Anyway, first local testing of this looks good. Still want to do some more testing.

@Disservin
Copy link
Member

I'm still not so sure about a uci option it really shifts the burden on the user/framework we don't know which machines might have support so it would require some additional checks to disable/enable the option again.. and same for other environments

So if shm fails it should use normal allocations and regarding the hugetlb idk

@Disservin
Copy link
Member

there might still be a way around without uci options? Was there no memfd_create which allows for bypassing /dev/shm ?

Anyway, first local testing of this looks good. Still want to do some more testing.

When I told you about that I didn't realize that we'd have to share the file descriptor between the processes and there isn't a really good way to do this either other than writing it to a file I guess

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

there might still be a way around without uci options? Was there no memfd_create which allows for bypassing /dev/shm ?
Anyway, first local testing of this looks good. Still want to do some more testing.

When I told you about that I didn't realize that we'd have to share the file descriptor between the processes and there isn't a really good way to do this either other than writing it to a file I guess

two shared memories

@vondele
Copy link
Member

vondele commented Jul 23, 2025

one shared file/memory in tmpfs to share the descriptor, I guess that's also what sopel is saying?

@Disservin
Copy link
Member

mh.. i have to think about how to do this properly even.. fd aren't a global thing they are bound to the process from which they were created so other processes would have to read /proc/pid/fd/fd, but what if the initial process that created the file doesn't exist anymore.. ?

sometimes i had a positive effect with MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE in later benchmarks i didnt anymore.. idk

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 23, 2025

As I understand it these descriptors can actually be shared with other processes.

• The process that called memfd_create() could transfer the
resulting file descriptor to the second process via a UNIX
domain socket (see unix(7) and cmsg(3)). The second
process then maps the file using mmap(2).

so they could just be shared via shm shared memory

@Disservin
Copy link
Member

That only works with sockets, although there is pidfd_getfd to properly duplicate this, still doesn't help me in the case the pid process is no longer alive.. would require sharing all alive pids in a file and then taking one from the list I guess

See the second answer

https://stackoverflow.com/questions/42319310/is-a-file-descriptor-local-to-its-process-or-global-on-unix

@Disservin
Copy link
Member

@Sopel97 in your implementation on windows I get the following output

Stockfish dev-20250723-78b403ee by the Stockfish developers (see AUTHORS file)
SystemWideSharedConstant total size: 146062212
SystemWideSharedConstant using normal pages...
initializing: Local\2063140025228914963$7883485429838667673$10995037620754966421
SystemWideSharedConstant total size: 146062212
SystemWideSharedConstant using normal pages...
initializing: Local\6149776985357895272$7883485429838667673$10995037620754966421

network gets init twice with same size but different hash ? i dont quite understand this

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 29, 2025

EvalFile has different content at first initialization I believe

edit.

C:\dev\stockfish-master\src>stockfish.exe
Stockfish dev-20250729-nogit by the Stockfish developers (see AUTHORS file)
EvalFile hash: 17186747165409177709
EvalFile hash: 11334573790148862207
SystemWideSharedConstant using large pages...
initializing: Local\2063140025228914963$6761756704518560236$4423640227798757523
EvalFile hash: 1441746021925920697
EvalFile hash: 5339658797736544225
SystemWideSharedConstant using large pages...
initializing: Local\6149776985357895272$6761756704518560236$4423640227798757523

I think it might just generally be the default-initialized version that's there before actually loading a network but I cba digging through the code

…ed to allocate the whole network on the shared memory but it uses LargePagePtr
@Disservin
Copy link
Member

ill try to figure out the semaphore naming later..

@vondele
Copy link
Member

vondele commented Aug 10, 2025

BTW, on linux this gives a good overview of how the memory is being handled:
pmap -XX -p 1378392

$ pmap  -XX -p 1378392
1378392:   ./stockfish.sopel
         Address Perm   Offset Device    Inode   Size KernelPageSize MMUPageSize    Rss   Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible                 VmFlags Mapping
    55675024b000 r--p 00000000 103:01  9309607     24              4           4     24     0           24            0             0             0         24         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /home/vondele/chess/vondele/matetrack/stockfish.sopel
    556750251000 r-xp 00006000 103:01  9309607    284              4           4    284     9          284            0             0             0        284         0        0             0              0             0              0               0    0       0      0           0       rd ex mr mw me sd /home/vondele/chess/vondele/matetrack/stockfish.sopel
    556750298000 r--p 0004d000 103:01  9309607  76624              4           4  76624  2554        76624            0             0             0      76624         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /home/vondele/chess/vondele/matetrack/stockfish.sopel
    556754d6d000 r--p 04b21000 103:01  9309607      4              4           4      4     4            0            0             0             4          4         4        0             0              0             0              0               0    0       0      0           0       rd mr mw me ac sd /home/vondele/chess/vondele/matetrack/stockfish.sopel
    556754d6e000 rw-p 04b22000 103:01  9309607      4              4           4      4     4            0            0             0             4          4         4        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd /home/vondele/chess/vondele/matetrack/stockfish.sopel
    556754d6f000 rw-p 00000000  00:00        0   1112              4           4   1108  1108            0            0             0          1108       1108      1108        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd 
    556755797000 rw-p 00000000  00:00        0    264              4           4    108   108            0            0             0           108        108       108        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd [heap]
    7f545abf1000 rw-s 00000000  00:1a     2276 142640              4           4 138992  4778            0       138992             0             0     138992         0        0             0              0             0              0               0    0       0      0           0 rd wr sh mr mw me ms sd /dev/shm/cb3e3f4f978b31ef
    7f5464000000 rw-p 00000000  00:00        0  15860              4           4  15860 15860            0            0             0         15860      15860     15860        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me nr sd 
    7f5464f7d000 ---p 00000000  00:00        0  49676              4           4      0     0            0            0             0             0          0         0        0             0              0             0              0               0    0       0      0           0          mr mw me nr sd 
    7f546a887000 rw-p 00000000  00:00        0   1508              4           4      8     8            0            0             0             8          8         8        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd 
    7f546aa00000 rw-p 00000000  00:00        0  16384              4           4  16384 16384            0            0             0         16384      16384     16384        0         16384              0             0              0               0    0       0      0           1 rd wr mr mw me ac sd hg 
    7f546ba00000 rw-p 00000000  00:00        0    544              4           4      0     0            0            0             0             0          0         0        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd 
    7f546ba88000 ---p 00000000  00:00        0      4              4           4      0     0            0            0             0             0          0         0        0             0              0             0              0               0    0       0      0           0             mr mw me sd 
    7f546ba89000 rw-p 00000000  00:00        0   8220              4           4    236   236            0            0             0           236        236       236        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd 
    7f546c290000 r--p 00000000 103:01 10489065    160              4           4    160     0          160            0             0             0        160         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/libc.so.6
    7f546c2b8000 r-xp 00028000 103:01 10489065   1620              4           4   1120     5         1120            0             0             0       1120         0        0             0              0             0              0               0    0       0      0           0       rd ex mr mw me sd /usr/lib/x86_64-linux-gnu/libc.so.6
    7f546c44d000 r--p 001bd000 103:01 10489065    352              4           4    192     0          192            0             0             0        192         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/libc.so.6
    7f546c4a5000 ---p 00215000 103:01 10489065      4              4           4      0     0            0            0             0             0          0         0        0             0              0             0              0               0    0       0      0           0             mr mw me sd /usr/lib/x86_64-linux-gnu/libc.so.6
    7f546c4a6000 r--p 00215000 103:01 10489065     16              4           4     16    16            0            0             0            16         16        16        0             0              0             0              0               0    0       0      0           0       rd mr mw me ac sd /usr/lib/x86_64-linux-gnu/libc.so.6
    7f546c4aa000 rw-p 00219000 103:01 10489065      8              4           4      8     8            0            0             0             8          8         8        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd /usr/lib/x86_64-linux-gnu/libc.so.6
    7f546c4ac000 rw-p 00000000  00:00        0     52              4           4     20    20            0            0             0            20         20        20        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd 
    7f546c4b9000 r--p 00000000 103:01 10532070     12              4           4     12     0           12            0             0             0         12         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
    7f546c4bc000 r-xp 00003000 103:01 10532070    108              4           4     64     1           64            0             0             0         64         0        0             0              0             0              0               0    0       0      0           0       rd ex mr mw me sd /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
    7f546c4d7000 r--p 0001e000 103:01 10532070     16              4           4     16     0           16            0             0             0         16         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
    7f546c4db000 r--p 00021000 103:01 10532070      4              4           4      4     4            0            0             0             4          4         4        0             0              0             0              0               0    0       0      0           0       rd mr mw me ac sd /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
    7f546c4dc000 rw-p 00022000 103:01 10532070      4              4           4      4     4            0            0             0             4          4         4        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
    7f546c4dd000 r--p 00000000 103:01 10489190     56              4           4     56     0           56            0             0             0         56         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/libm.so.6
    7f546c4eb000 r-xp 0000e000 103:01 10489190    496              4           4    320     3          320            0             0             0        320         0        0             0              0             0              0               0    0       0      0           0       rd ex mr mw me sd /usr/lib/x86_64-linux-gnu/libm.so.6
    7f546c567000 r--p 0008a000 103:01 10489190    364              4           4    128     3          128            0             0             0        128         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/libm.so.6
    7f546c5c2000 r--p 000e4000 103:01 10489190      4              4           4      4     4            0            0             0             4          4         4        0             0              0             0              0               0    0       0      0           0       rd mr mw me ac sd /usr/lib/x86_64-linux-gnu/libm.so.6
    7f546c5c3000 rw-p 000e5000 103:01 10489190      4              4           4      4     4            0            0             0             4          4         4        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd /usr/lib/x86_64-linux-gnu/libm.so.6
    7f546c5c4000 r--p 00000000 103:01 10532071    624              4           4    624    19          624            0             0             0        624         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32
    7f546c660000 r-xp 0009c000 103:01 10532071   1220              4           4    712    22          712            0             0             0        712         0        0             0              0             0              0               0    0       0      0           0       rd ex mr mw me sd /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32
    7f546c791000 r--p 001cd000 103:01 10532071    564              4           4    124     5          124            0             0             0        124         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32
    7f546c81e000 ---p 0025a000 103:01 10532071      4              4           4      0     0            0            0             0             0          0         0        0             0              0             0              0               0    0       0      0           0             mr mw me sd /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32
    7f546c81f000 r--p 0025a000 103:01 10532071     44              4           4     44    44            0            0             0            44         44        44        0             0              0             0              0               0    0       0      0           0       rd mr mw me ac sd /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32
    7f546c82a000 rw-p 00265000 103:01 10532071     12              4           4     12    12            0            0             0            12         12        12        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32
    7f546c82d000 rw-p 00000000  00:00        0     16              4           4     12    12            0            0             0            12         12        12        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd 
    7f546c855000 rw-p 00000000  00:00        0      8              4           4      8     8            0            0             0             8          8         8        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd 
    7f546c857000 r--p 00000000 103:01 10485864      8              4           4      8     0            8            0             0             0          8         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
    7f546c859000 r-xp 00002000 103:01 10485864    168              4           4    168     0          168            0             0             0        168         0        0             0              0             0              0               0    0       0      0           0       rd ex mr mw me sd /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
    7f546c883000 r--p 0002c000 103:01 10485864     44              4           4     44     0           44            0             0             0         44         0        0             0              0             0              0               0    0       0      0           0          rd mr mw me sd /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
    7f546c88e000 rw-s 00000000  00:1a     2275      4              4           4      4     0            0            4             0             0          4         0        0             0              0             0              0               0    0       0      0           0 rd wr sh mr mw me ms sd /dev/shm/sem.cb3e3f4f978b31ef_mutex
    7f546c88f000 r--p 00037000 103:01 10485864      8              4           4      8     8            0            0             0             8          8         8        0             0              0             0              0               0    0       0      0           0       rd mr mw me ac sd /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
    7f546c891000 rw-p 00039000 103:01 10485864      8              4           4      8     8            0            0             0             8          8         8        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me ac sd /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
    7fff47ec2000 rw-p 00000000  00:00        0    132              4           4     92    92            0            0             0            92         92        92        0             0              0             0              0               0    0       0      0           0    rd wr mr mw me gd ac [stack]
    7fff47f44000 r--p 00000000  00:00        0     16              4           4      0     0            0            0             0             0          0         0        0             0              0             0              0               0    0       0      0           0    rd mr pf io de dd sd [vvar]
    7fff47f48000 r-xp 00000000  00:00        0      8              4           4      4     0            4            0             0             0          4         0        0             0              0             0              0               0    0       0      0           0    rd ex mr mw me de sd [vdso]
ffffffffff600000 --xp 00000000  00:00        0      4              4           4      0     0            0            0             0             0          0         0        0             0              0             0              0               0    0       0      0           0                      ex [vsyscall]
                                               ====== ============== =========== ====== ===== ============ ============ ============= ============= ========== ========= ======== ============= ============== ============= ============== =============== ==== ======= ====== =========== 
                                               319324            200         200 253636 41355        80684       138996             0         33956     253636     33956        0         16384              0             0              0               0    0       0      0           1 KB 

@vondele
Copy link
Member

vondele commented Aug 10, 2025

It looks like the large speedups reported for this, are largely an artifact of how it has been measured. Using the above script, if each of the processes executes a bench 16 1 13 default depth, the speedup seems large (nearly 2x):
bench
if however, each of the processes gets a distinct large set of fens f"{command} bench 16 1 100000 x{i % 32:02d} nodes" that same machine reports (11%)
distinct
what's probably happening in the former case is that all processes synchronize on the same positions, and benefit more from sharing the net weights in a L3 cache. I think this issue will be relevant as long as the L3 < net size (150MB). This is also consistent with the fishtest measurements where the Elo gains (24Elo max) are more in line with such speedups (https://github.com/official-stockfish/Stockfish/wiki/Useful-data#elo-from-speedups)

So, we would need to update how fishtest measures nps (currently similar to bench), otherwise we reduce TC by 2x instead of 1.15 or less.

@Sopel97
Copy link
Member Author

Sopel97 commented Aug 10, 2025

That makes sense, it essentially allows higher reuse in L3, but is still limited by it. Might be problematic to calibrate to be fully consistent across machines.

Is there actually a measurable slowdown on 1 instance?

@vondele
Copy link
Member

vondele commented Aug 10, 2025

you mean running a single process, single threaded ? yes, there is a small slowdown.

Result of 200 runs
==================
base (./stockfish.sopel        ) =    1132158  +/- 2364
test (./stockfish.master       ) =    1139999  +/- 2316
diff                             =      +7841  +/- 2318

speedup        = +0.0069
P(speedup > 0) =  1.0000

@AndyGrant
Copy link
Contributor

https://discord.com/channels/435943710472011776/735707599353151579/1396504227592798241

I've said as much before on the measurements being inflatable. A cheap, partial solution, is to just randomize the order of the benched positions, and make sure there is no functional diff with the TT age. But you still get some extra sharing you would not normally get. I see something like a > 100% gain in a bench, 30-60% gain in a randomized bench, 20-40% gain in gameplay.

Its important to note that for gameplay, you have to measure the speedup in terms of (master vs master) vs (patch vs patch), not (master vs patch). This patch speeds up all other engines that are playing. I posted this data once before:

Verbatim Torch, compared to normal Torch, gains +11.51% speed against SF. But SF gains +7.10%.
Verbatim Torch self-play, compared to normal Torch self-play, gains +39.75%.
Verbatim Torch, compared to normal Torch, gains +12.51% speed against Ethereal. But Ethereal gains +2.05%.

We see SF speeds up as well, because when Torch reduces its memory footprint, SF benefits.
We see Ethereal speed up as well, but much less. Likely since Ethereal's memory footprint is very small compared to SF/Torch.

normal-Torch is 3.6% faster when playing itself, than against Stockfish. SF has a large footprint that hurts us.
normal-Torch is 36% faster when playing against Ethereal, than against itself

@vondele
Copy link
Member

vondele commented Aug 11, 2025

I've said as much before on the measurements being inflatable.

right, so here, and in several more instances, you are sharing the graph of inflated numbers, knowing they are a measurement artifact?

https://discord.com/channels/435943710472011776/813919248455827515/1396393406086778982

Edit: and this one is plain wrong https://discord.com/channels/435943710472011776/882956631514689597/1392309734715166832

@AndyGrant
Copy link
Contributor

I've said as much before on the measurements being inflatable.

right, so here, and in several more instances, you are sharing the graph of inflated numbers, knowing they are a measurement artifact?

https://discord.com/channels/435943710472011776/813919248455827515/1396393406086778982

Edit: and this one is plain wrong https://discord.com/channels/435943710472011776/882956631514689597/1392309734715166832

I provided the above information about speedups to help you guys have a productive conversation about what to do for Fishtest. Not every time I post that graphic do I have the time and energy to do a deep dive on the context and the nuance. However, when going into details in conversations with Disservin and Sopel, I do just that. Both of them are aware of the fickle nature of the measurements. Disservin from the link I posted -- Sopel from our shared experience when testing the original NUMA network-duplication patches on CCC.

I don't have any interest in continuing to play out you not liking me. Take that somewhere else instead of detracting from the productive work being done here.

@AndyGrant
Copy link
Contributor

AndyGrant commented Aug 11, 2025

I see a +60% NPS improvement in gameplay, taking total nodes divided by total time for (verbatim vs verbatim) compared to (self vs self) on my 7950x. pgns.zip.

Not even that far off of the biased measurements. Your base NPS gain beats Torch's 40% NPS gain, probably in part to having a L1=3072 network instead of L1=2560. You probably would get even more % if not for the mini network for imbalances. Gain a bit more too if I was not overclocking this memory.

./cutechess-ob -variant standard -concurrency 31 -games 256 \
-engine cmd=./sf-master name=sf-master proto=uci tc=15.0+0.15 \
-engine cmd=./sf-master name=sf-master proto=uci tc=15.0+0.15 \
-openings file=UHO_Lichess_4852_v1.epd format=epd order=random \
-draw movenumber=32 movecount=6 score=8 \
-resign movecount=5 score=600 -pgnout games-master.pgn

Master average NPS: 387,482
Verbatim average NPS: 621,975 (+60.51%)

@vondele
Copy link
Member

vondele commented Aug 11, 2025

I have just pointed out that the graphs you have posted were based on flawed measurements and the statement of 2x throughput increase on fishtest was wrong. Even on hardware where this change has most effect, the measured result is below that.

I'm not exactly sure what the format of your pgns is, but I get more like 40% ...

$ grep -o "{[0-9a-z\/\.\-\ ]*}" games-master.pgn | sed "s/{//" | sed "s/}//g" | awk '{t=t+$3;s=s+$4}END{print 1000*s/t}'
570478
$ grep -o "{[0-9a-z\/\.\-\ ]*}" games-verbatim.pgn | sed "s/{//" | sed "s/}//g" | awk '{t=t+$3;s=s+$4}END{print 1000*s/t}'
784949

--> 1.37

I might be wrong though, maybe you can post what you used to analyze it.

For my own tests on similar hardware, I do get 42% and somewhat beefier hardware 37%:


for bin in sopel master
do

echo " ===== $bin ===== "

start=$(date +%s)
rm -f ${bin}.pgn

./fastchess -concurrency 32 -rounds 320 -games 2 -repeat\
            -srand 42 -openings file=UHO_Lichess_4852_v1.epd format=epd order=random\
            -engine name=${bin}1 cmd=./stockfish.${bin}\
            -engine name=${bin}2 cmd=./stockfish.${bin}\
            -each proto=uci option.Threads=1 option.Hash=16 tc=10+0.1\
            -pgnout file=${bin}.pgn nodes=true nps=true >& out.${bin}

nodes=$(grep -o 'n=[0-9]*' ${bin}.pgn  | cut -d= -f2- | awk '{s=s+$1}END{print s}')
end=$(date +%s)

echo "$((end - start)) seconds for $nodes nodes"
echo $start $end $nodes | awk '{print "nps: ", $3/($2 - $1)}'

done
===== sopel =====
709 seconds for 5309140368 nodes
nps:  7.48821e+06
 ===== master =====
711 seconds for 3740186907 nodes
nps:  5.26046e+06

--> 1.42
 ===== sopel =====
353 seconds for 19509839936 nodes
nps:  5.52687e+07
 ===== master =====
348 seconds for 13978842506 nodes
nps:  4.01691e+07

--> 1.37590

@AndyGrant
Copy link
Contributor

I have just pointed out that the graphs you have posted were based on flawed measurements and the statement of 2x throughput increase on fishtest was wrong. Even on hardware where this change has most effect, the measured result is below that.

DAILY REMINDER THAT SF COULD UP TO DOUBLE THEIR FISHTEST THROUGHPUT

Emphasis on "up to", if you really care to have such a semantic argument about nothing. Not to mention that the data provided is for a different chess engine... Although I imagine you could find an example without a qualifier, but such is a result of a campaign to get you guys to make this change. Which worked. And it is hard to to see how one can be unhappy with a 60% gain.


The grep command you provided do not correctly parse all data in the PGNs. It only parses positions with a score of 0.00. The rest follows.

@vondele
Copy link
Member

vondele commented Aug 11, 2025

DAILY REMINDER THAT SF COULD UP TO DOUBLE THEIR FISHTEST THROUGHPUT

Emphasis on "up to"

ah, that clarifies. I'm obviously still interested to have the 10-40% we might get from this.

The grep command you provided do not correctly parse all data in the PGNs.

thanks, now I can reproduce the numbers you provided.

@robertnurnberg
Copy link
Contributor

So what's currently the best script to accurately measure realistic speedup of this patch in game play conditions on 1,2,4,8 etc threads? Could someone post it here?

@vondele
Copy link
Member

vondele commented Aug 12, 2025

probably the fastchess match in #6173 (comment)

Edit adjusted for the thread count of the machine used for testing, as well as the number of threads used for the games

@robertnurnberg
Copy link
Contributor

Thanks for the quick reply. Will try to produce some numbers tomorrow. Will only do 1th games, but with increasing concurrency. Does the grep command in that script need to be changed as well?

@vondele
Copy link
Member

vondele commented Aug 12, 2025

no the grep + awk just extracts from the pgn the total number of nodes searched, that ought to remain the same.

@robertnurnberg
Copy link
Contributor

So I played with the this a bit. I am trying to improve on the above script (on a mobile!), and hence often interrupted the fastchess tournaments with ctrl-c.

Sadly, I have now reached a stage where the patch is unresponsive to Uci commands on startup. So each new tournament fails.

Also tried compiler command from cli. This now enters interactive shell rather than printing info and quit. Also quit command no longer works.

> ./stockfish.sopel compiler
Stockfish dev-20250809-8274fc5c by the Stockfish developers (see AUTHORS file)
quit
^C

Any idea how I can get out of this? Without rebooting the machine. Of course, something to fix before merge as well, I think.

@vondele
Copy link
Member

vondele commented Aug 13, 2025

yeah, you probably have a lock file that can't get cleaned up because the process that owns it dies (and so you have a deadlock). The workaround right now is to remove files in /dev/shm that you own.

This is a problem with posix semaphores. Probably requires a change to use pthread_mutexattr_setrobust and no longer posix semaphores.

@robertnurnberg
Copy link
Contributor

Yeah, thanks. That worked for me. I deleted two files, one of them a lock file I think.

@Sopel97
Copy link
Member Author

Sopel97 commented Aug 13, 2025

@vondele
Copy link
Member

vondele commented Aug 13, 2025

do we follow this for crapos? https://chromium.googlesource.com/native_client/src/native_client/+/d6d53e83fffcc64b6e0d52acc31c63e0b5a60186/src/shared/platform/osx/nacl_semaphore.c#46

not fully, but hardcoding SEM_NAME_LEN would fix the issue we see in CI.

Having said that, to avoid the issue @robertnurnberg (and I at an earlier stage) experienced, we would have to drop posix semaphores...

@Sopel97
Copy link
Member Author

Sopel97 commented Aug 13, 2025

right, I missed it, sorry. So file locking?

@Viren6
Copy link
Contributor

Viren6 commented Aug 14, 2025

This issue was explored and put into light by @AndyGrant [verbatim.pdf](https://github.com/user-attachments/files/21386224/verbatim.pdf).

Can I get credit somewhere for the initial identification of this problem: official-stockfish/fishtest#2077 and the subsequent demonstration of the solution through the first impl of net sharing in a chess engine: official-monty/Monty#62 (as well as discovering the behaviour of elo gain in multi process SPRT conditions etc.). AGEs work built on top of this

@AndyGrant
Copy link
Contributor

This issue was explored and put into light by @AndyGrant [verbatim.pdf](https://github.com/user-attachments/files/21386224/verbatim.pdf).

Can I get credit somewhere for the initial identification of this problem: official-stockfish/fishtest#2077 and the subsequent demonstration of the solution through the first impl of net sharing in a chess engine: official-monty/Monty#62 (as well as discovering the behaviour of elo gain in multi process SPRT conditions etc.). AGEs work built on top of this

My exploration is indeed predicated on Viren's observations.

@vondele
Copy link
Member

vondele commented Aug 14, 2025

right, I missed it, sorry. So file locking?

well in /dev/shm... this ithe example chatgpt cooked up... though I didn't test it... the key thing is EOWNERDEAD capability

#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <time.h>

#define SHM_NAME "/portable_mutex_demo"
#define MAGIC_INIT 0xdeadbeef

struct shared_area {
    pthread_mutex_t mutex;
    int initialized;
};

/* Helper: fatal error */
static void fatal(const char *msg) {
    perror(msg);
    exit(EXIT_FAILURE);
}

/* Helper: portable lock with owner-death handling or timeout */
static int portable_mutex_lock(pthread_mutex_t *mtx, int timeout_sec) {
#ifdef PTHREAD_MUTEX_ROBUST
    struct timespec ts;
    clock_gettime(CLOCK_REALTIME, &ts);
    ts.tv_sec += timeout_sec;

    int rc = pthread_mutex_timedlock(mtx, &ts);
    if (rc == 0) {
        return 0; // success
    } else if (rc == EOWNERDEAD) {
        // Owner died — repair shared state here if needed
        fprintf(stderr, "[WARN] Owner died, marking mutex consistent\n");
        pthread_mutex_consistent(mtx);
        return 0;
    } else {
        return rc; // ETIMEDOUT or other error
    }
#else
    // No robust support — emulate with timedlock only
    struct timespec ts;
    clock_gettime(CLOCK_REALTIME, &ts);
    ts.tv_sec += timeout_sec;
    int rc = pthread_mutex_timedlock(mtx, &ts);
    return rc;
#endif
}

int main(void) {
    int fd;
    int creator = 0;

    fd = shm_open(SHM_NAME, O_RDWR | O_CREAT | O_EXCL, 0660);
    if (fd >= 0) {
        creator = 1;
        if (ftruncate(fd, sizeof(struct shared_area)) == -1)
            fatal("ftruncate");
    } else if (errno == EEXIST) {
        fd = shm_open(SHM_NAME, O_RDWR, 0);
        if (fd == -1) fatal("shm_open");
    } else {
        fatal("shm_open");
    }

    struct shared_area *area = mmap(NULL, sizeof(*area),
                                    PROT_READ | PROT_WRITE,
                                    MAP_SHARED, fd, 0);
    if (area == MAP_FAILED) fatal("mmap");
    close(fd);

    if (creator) {
        pthread_mutexattr_t attr;
        pthread_mutexattr_init(&attr);
        pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);

#ifdef PTHREAD_MUTEX_ROBUST
        if (pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST) != 0)
            fprintf(stderr, "[WARN] robust mutex not supported here\n");
#endif

        pthread_mutex_init(&area->mutex, &attr);
        pthread_mutexattr_destroy(&attr);

        area->initialized = MAGIC_INIT;
        fprintf(stderr, "[INFO] Creator initialized mutex\n");
    } else {
        while (area->initialized != MAGIC_INIT) {
            usleep(1000);
        }
    }

    // Try to lock
    int rc = portable_mutex_lock(&area->mutex, 5); // 5s timeout
    if (rc == 0) {
        fprintf(stderr, "[INFO] Got lock, critical section...\n");
        sleep(2);
        pthread_mutex_unlock(&area->mutex);
    } else if (rc == ETIMEDOUT) {
        fprintf(stderr, "[ERROR] Timeout acquiring lock\n");
    } else {
        fprintf(stderr, "[ERROR] Lock failed: %s\n", strerror(rc));
    }

    if (creator) shm_unlink(SHM_NAME);
    return 0;
}

@robertnurnberg
Copy link
Contributor

robertnurnberg commented Aug 14, 2025

I have some numbers now. Edited the above script to run multiple concurrency values and compute speed up automatically.

#!/bin/bash

for i in $(seq 0 6)
do
  concurrency=$(( 1 << i ))
  rounds=$(( 10 << i ))

  echo " ===== concurrency: $concurrency ====="

  oldnps=
  for bin in sopel master
  do

  echo " ===== $bin ====="

  start=$(date +%s)
  pgn=${bin}${concurrency}.pgn
  rm -f ${pgn}

  ./fastchess -concurrency ${concurrency} -rounds ${rounds} -games 2 -repeat\
              -srand 42 -openings file=UHO_Lichess_4852_v1.epd format=epd order=random\
              -engine name=${bin}1 cmd=./stockfish.${bin}\
              -engine name=${bin}2 cmd=./stockfish.${bin}\
              -each proto=uci option.Threads=1 option.Hash=16 tc=10+0.1\
              -pgnout file=${pgn} nodes=true nps=true >& out.${bin}${concurrency}

  nodes=$(grep -o 'n=[0-9]*' ${pgn} | cut -d= -f2- | awk '{s=s+$1}END{printf "%.0f\n", s}')

  end=$(date +%s)

  echo "$((end - start)) seconds for $nodes nodes"
  nps=$(echo "$nodes/($end - $start)" | bc -l)
  printf "nps:  %.5e\n" $nps
  if [[ $oldnps ]]; then
    speedup=$(echo "$oldnps/$nps" | bc -l)
    printf "concurrency %d speedup: %.5f\n" $concurrency $speedup
  else
    oldnps=$nps
  fi

  done
done

The numbers on my machine are as follows.

nps:  1.17123e+06
nps:  1.29927e+06
concurrency 1 speedup: 0.90145
nps:  2.31524e+06
nps:  2.50110e+06
concurrency 2 speedup: 0.92569
nps:  4.54807e+06
nps:  4.62977e+06
concurrency 4 speedup: 0.98235
nps:  8.81369e+06
nps:  9.10993e+06
concurrency 8 speedup: 0.96748
nps:  1.61587e+07
nps:  1.66870e+07
concurrency 16 speedup: 0.96834
nps:  2.44734e+07
nps:  2.03335e+07
concurrency 32 speedup: 1.20360
nps:  2.31073e+07
nps:  1.77280e+07
concurrency 64 speedup: 1.30344

So below 32 concurrency sadly no speedup.
Note also that both patch and master have lower nps when going from 32 to 64.

The machine is a AMD Ryzen Threadripper PRO 3995WX 64-Cores, with

Caches (sum of all):
  L1d:                    2 MiB (64 instances)
  L1i:                    2 MiB (64 instances)
  L2:                     32 MiB (64 instances)
  L3:                     256 MiB (16 instances)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-127

Edit01: A second run on the same machine gives this:

nps:  1.27082e+06
nps:  1.29156e+06
concurrency 1 speedup: 0.98394
nps:  2.31350e+06
nps:  2.55148e+06
concurrency 2 speedup: 0.90673
nps:  4.93585e+06
nps:  4.56308e+06
concurrency 4 speedup: 1.08169
nps:  8.94310e+06
nps:  8.96132e+06
concurrency 8 speedup: 0.99797
nps:  1.51881e+07
nps:  1.71890e+07
concurrency 16 speedup: 0.88359
nps:  1.94823e+07
nps:  2.30791e+07
concurrency 32 speedup: 0.84416
nps:  1.98179e+07
nps:  1.76121e+07
concurrency 64 speedup: 1.12524

Edit02: Repeat with fastchess option use-affinity 0-127:

nps:  1.20266e+06
nps:  1.33562e+06
concurrency 1 speedup: 0.90045
nps:  2.23819e+06
nps:  2.25008e+06
concurrency 2 speedup: 0.99471
nps:  4.16543e+06
nps:  4.10663e+06
concurrency 4 speedup: 1.01432
nps:  7.33581e+06
nps:  7.36909e+06
concurrency 8 speedup: 0.99548
nps:  1.35778e+07
nps:  1.20530e+07
concurrency 16 speedup: 1.12651
nps:  1.86878e+07
nps:  1.59461e+07
concurrency 32 speedup: 1.17193
nps:  2.06490e+07
nps:  1.74495e+07
concurrency 64 speedup: 1.18335

Edit03: Repeat with use-affinity 0-63 and 100<<i rounds, so matches last about 2h.

nps:  1.19935e+06
nps:  1.25454e+06
concurrency 1 speedup: 0.95600
nps:  2.24302e+06
nps:  2.32657e+06
concurrency 2 speedup: 0.96409
nps:  4.03640e+06
nps:  4.08200e+06
concurrency 4 speedup: 0.98883
nps:  7.64909e+06
nps:  7.37593e+06
concurrency 8 speedup: 1.03703
nps:  1.11660e+07
nps:  1.25210e+07
concurrency 16 speedup: 0.89178
nps:  1.42894e+07
nps:  1.56303e+07
concurrency 32 speedup: 0.91421
nps:  1.69207e+07
nps:  1.66658e+07
concurrency 64 speedup: 1.01530

@vondele
Copy link
Member

vondele commented Aug 14, 2025

thanks for the numbers and the script (I fixed a small pasto in the script).

@vondele
Copy link
Member

vondele commented Aug 14, 2025

so on a 16c/32t CPU:

 ===== concurrency: 1 =====
 ===== sopel =====
702 seconds for 811291657 nodes
nps:  1.15569e+06
 ===== master =====
699 seconds for 802189922 nodes
nps:  1.14763e+06
concurrency 1 speedup: 1.00702
 ===== concurrency: 2 =====
 ===== sopel =====
738 seconds for 1637764864 nodes
nps:  2.21919e+06
 ===== master =====
710 seconds for 1656871353 nodes
nps:  2.33362e+06
concurrency 2 speedup: 0.95097
 ===== concurrency: 4 =====
 ===== sopel =====
716 seconds for 2991088304 nodes
nps:  4.17750e+06
 ===== master =====
694 seconds for 2861611591 nodes
nps:  4.12336e+06
concurrency 4 speedup: 1.01313
 ===== concurrency: 8 =====
 ===== sopel =====
717 seconds for 5163964657 nodes
nps:  7.20218e+06
 ===== master =====
706 seconds for 4916893110 nodes
nps:  6.96444e+06
concurrency 8 speedup: 1.03414
 ===== concurrency: 16 =====
 ===== sopel =====
720 seconds for 6943171134 nodes
nps:  9.64329e+06
 ===== master =====
711 seconds for 5897928439 nodes
nps:  8.29526e+06
concurrency 16 speedup: 1.16251
 ===== concurrency: 32 =====
 ===== sopel =====
710 seconds for 5422687573 nodes
nps:  7.63759e+06
 ===== master =====
721 seconds for 3871704407 nodes
nps:  5.36991e+06
concurrency 32 speedup: 1.42229
image image

@vondele
Copy link
Member

vondele commented Aug 14, 2025

Since the Monty merge request was pointed out, the number there are:

 ===== concurrency: 16 =====
 ===== mmap =====
760 seconds for 11109540857 nodes
nps:  1.46178e+07
 ===== nommap =====
738 seconds for 8639320938 nodes
nps:  1.17064e+07
concurrency 16 speedup: 1.24870

see also https://discord.com/channels/435943710472011776/813919248455827515/1405555036880113664

@robertnurnberg
Copy link
Contributor

robertnurnberg commented Aug 16, 2025

I have updated #6173 (comment) with some more numbers.

I would be very curious to know if others with AMD Ryzen chips see similarly underwhelming numbers.

Edit: See also https://discord.com/channels/435943710472011776/1405613835020406815 for a discussion of these results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants