Skip to content

Conversation

@Sh3llcod3
Copy link

@Sh3llcod3 Sh3llcod3 commented Oct 10, 2025

Overview

This is a complete and fundamental rewrite of the Async WebSocket. This overhaul replaces the implementation with a fully asynchronous, task-based architecture that is idiomatic, highly performant, and hardened for production use. It now fails predictably and cleans up its resources correctly under all circumstances.

This pull request also addresses several lifecycle and error-handling bugs in the AsyncWebSocket implementation.

The key focus from this rewrite is to achieve the best possible performance while keeping the code clean and breaking changes to a minimum. With that said, there are some breaking changes, so let's get those out of the way.

Breaking Changes

  1. send() Error Handling is now asynchronous:

    • Old: A network error during a send() operation would raise an exception directly from the await ws.send(...) call.
    • New: A network error during a send is now caught by the background writer task. The exception is then placed on the receive queue and will be raised by a concurrent or subsequent call to await ws.recv(). Application-level error handling must be adapted to this new model.
  2. send() is now non-blocking and Queued:

    • Old: await ws.send(...) would block until the data was passed to the OS socket buffer.
    • New: await ws.send(...) returns almost instantly after placing the message in an internal queue. To guarantee that all queued messages have been picked up for sending, callers must now use the new await ws.flush() method.
  3. recv_fragment() Removed: The recv_fragment method is no longer part of the class. Message reassembly is now an internal implementation detail of the read loop.

Performance

The new implementation completely does away with the use of run_in_executor. The original issue I raised #645, found that the overhead from this was massive and taking up most of the time, leading to very low throughput figures. From reading the original code, it became clear that the Curl socket is not blocking and simply returns EAGAIN when it is not ready to be read from/written to.

This implementation handles those EAGAINs and uses a similar behaviour to the aselect(...) to wait on FD availability. Instead of running the libcurl calls in a thread pool to avoid blocking, the code opts for a cooperative yielding approach, where every N number of I/O ops or after a certain time has elapsed, it will yield control to the event loop ensuring other things have a fair chance to run.

This is obviously a double-edged sword. Providing too much time to the event loop will result in lower throughput, while taking up too much event loop time will starve the loop from being able to run the other applications. I've set values which I believe are a good balance of throughput and fairly yielding control.

The current implementation uses background I/O tasks for reading and writing, which decouples the I/O operation from the public API. That means, when you call send(...) or recv(...), you are simply performing a lightweight operation which adds the WS frame into an asyncio queue. When the FD is next available for reading or writing, it will pluck the frame from the queue and write it into the curl socket and vice versa. This tries to be the best of both worlds approach suitable for continnous and sparse messages by engaging CPU time only when it is needed and efficiently waiting at idle state.

Due to the decoupling, it completely eliminates the crashing risk when calling send/recv concurrently. Since it is performing operations on an async queue, which by its very nature is designed for concurrent operations, there is no risk from crashing this way. As long as there is only one instance of the I/O loop tasks, there will only ever be one reader/writer interacting with the curl socket sequentially, removing this risk.

On the write side, we are using some optimization techniques including adaptive batching and send coalesceing (optional) to try and reduce the overhead / number of system calls made. These techniques do actually work, but in this case, the benefits are limited, as there seems to be a hard cap on send throughput (around 40 MiB/s on my machine). During development, I've taken tens (if not hundreds) of different cProfile performance captures and no matter what I do, send() throughput is hard limited by curl's ws_send() call. It is likely that it cannot be improved further.

In case you are wondering: Does the send coalescing actually work? Yes!

Here is a cProfile with the flag turned on, sending 10 GiBs of data with a 65k chunk size:

4385099 function calls (4379080 primitive calls) in 255.745 seconds

   Ordered by: cumulative time
   List reduced from 2431 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    238/1    0.005    0.000  255.745  255.745 {built-in method builtins.exec}
        1    0.000    0.000  255.745  255.745 ./client.py:1(<module>)
        1    0.000    0.000  254.919  254.919 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/uvloop/__init__.py:51(run)
        1    0.000    0.000  254.919  254.919 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:160(run)
        1    0.000    0.000  254.918  254.918 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:61(__exit__)
        1    0.000    0.000  254.918  254.918 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:64(close)
      2/1    0.000    0.000  254.917  254.917 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:1000(_bootstrap)
      2/1    0.000    0.000  254.917  254.917 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:1027(_bootstrap_inner)
      2/1    0.000    0.000  254.917  254.917 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:983(run)
        1    0.214    0.214  254.905  254.905 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:86(run)
    10189    0.722    0.000  252.518    0.025 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/requests/websockets.py:1019(_write_loop)
    10825    0.156    0.000  247.827    0.023 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/requests/websockets.py:1086(_send_payload)
    10188    0.092    0.000  247.613    0.024 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/curl.py:522(ws_send)
    10187  247.406    0.024  247.406    0.024 {built-in method curl_cffi._wrapper.curl_ws_send}
      637    3.077    0.005    3.077    0.005 {method 'join' of 'bytes' objects}
      628    0.001    0.000    1.953    0.003 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/uvloop/__init__.py:54(wrapper)
      629    0.001    0.000    1.951    0.003 ./client.py:78(main)
      629    0.293    0.000    1.951    0.003 ./client.py:30(run_benchmark_curl_cffi)

Here is another cProfile with the same transfer but without the feature:

6694317 function calls (6688299 primitive calls) in 262.812 seconds

   Ordered by: cumulative time
   List reduced from 2418 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    238/1    0.005    0.000  262.812  262.812 {built-in method builtins.exec}
        1    0.000    0.000  262.812  262.812 ./client.py:1(<module>)
        1    0.000    0.000  261.990  261.990 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/uvloop/__init__.py:51(run)
        1    0.000    0.000  261.990  261.990 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:160(run)
        1    0.000    0.000  261.989  261.989 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:61(__exit__)
        1    0.000    0.000  261.988  261.988 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:64(close)
      2/1    0.000    0.000  261.988  261.988 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:1000(_bootstrap)
      2/1    0.000    0.000  261.988  261.988 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:1027(_bootstrap_inner)
      2/1    0.000    0.000  261.988  261.988 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:983(run)
        1    1.920    1.920  261.976  261.976 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:86(run)
   162871    1.041    0.000  257.854    0.002 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/requests/websockets.py:1019(_write_loop)
   325740    1.637    0.000  256.072    0.001 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/requests/websockets.py:1086(_send_payload)
   162870    0.935    0.000  253.749    0.002 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/curl.py:522(ws_send)
   162869  251.539    0.002  251.539    0.002 {built-in method curl_cffi._wrapper.curl_ws_send}
      628    0.001    0.000    1.924    0.003 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/uvloop/__init__.py:54(wrapper)
      629    0.001    0.000    1.923    0.003 ./client.py:78(main)
      629    0.293    0.000    1.923    0.003 ./client.py:30(run_benchmark_curl_cffi)

Now the throughput is hard capped on my machine so the throughput is not that faster, but see the difference?

That's 10188 calls to ws_send() with the feature vs 162870 function calls without. Also, more than 95% of the time is spent in the blocking C function and there is little overhead from the Python

Note: These numbers were taken before the max frame size was set to 64k, you can still get this benefit if you are aggregating frames smaller than this value.

With streaming protocols which do not care about frame boundaries, this works nicely for smaller chunks. Here is the speed of that same transfer with 2048 byte chunk size without and with coalesce_frames=True:

image image

On the Curl FFI side, I've made one key optimization which significantly speeds up receive throughput. Rather than creating a new ffi.buffer(...) object on reach fragment received, it now pre-allocates and reuses that existing buffer. This speeds things up a lot by avoiding repeated memory allocations in a tight loop. This is safe to do, since the buffer is read up to a bounded value, so it does not need to be cleared or allocated each time. Also, a bytes copy is made when the function returns so there are no referencing issues. I tried to make it return a memoryview(ffi.buffer(...)) and it was quite a bit faster, but it ran into referencing issues since the view is pointing to the pre-allocated buffer. This means any changes there will change what all the memoryview instances are pointing to. Although the idea of bytes copying doesn't sound great, in reality, other than returning a no-copy view into the buffer, all other methods are slower. I tried returning bytearray objects or using a pre-allocated bytearray and extending the fragments into it, but that ends up being slower. In the end, having a list of bytes and calling b"".join(chunks) was the fastest way to collect up the fragments into one complete message.

Benchmark Code

Now let's get to the interesting part. The numbers tell their own story.

In order to arrive at these numbers, I am running two different sets of benchmarks.

  1. The first benchmark (client.py and server_ssl.py) simply sends and/or receives in a loop (depending on what code is commented out) and doesn't do anything other than time how long it took.

  2. The second benchmark (benchmark.py) is slightly more realistic. It reads a file from disk to memory and loads its hash. Depending on if it was ran in server mode or client mode, it will receive or send that file and the other end will calculate a hash of the received data and compare against loaded hash of the file on disk and let you know if it matches or not. This will be slower due to the higher overhead.

Neither benchmark is good and the benchmark code is bad. They are simply a rough proof of concept.

I am using btop to monitor throughput. I am running each test more than once, noting that the first run is measurably slower sometimes, possibly due to the CPU cache. Either way a 5~10% variation on these figures should be factored in. I am using a HPE DL-60 Gen 9 rack server. This is an ancient server which clocks up to 1.6 GHz, so your results may vary. One slightly nice aspect is that I have 80 GB RAM on this server, hence I can run the benchmarks with 20 GB files loaded in memory without any issues.

In each test, I will compare the existing code vs new code from this PR.

Files

  1. client.py
  2. server_ssl.py
  3. benchmark.py
  4. pyproject.toml

Slow Case

The slow case is where the chunk size in all the benchmarks have been set to a small value, i.e. 2048 bytes. This increases the number of libcurl calls needed to transfer the same amount of data and slows down the throughput.

Download

client.py:

image image

benchmark.py:

image image

Upload

client.py:

image image

benchmark.py:

image image

Average Case

The average case only considers benchmark.py results with a larger chunk size of 65536 bytes.

Download

image image

Upload

image image

Best Case

These numbers are unrealistic and unlikely to be achievable in a real application. This only considers the client.py benchmark, which effectively just loops and does absolutely nothing with the received/sent data. Any application processing logic will create overhead and slow down the throughput. These figures are included here as an optimistic "what-if there was nothing else, how fast could we go?" situation. The chunk size is 65536 bytes.

Download

image image

My friend also ran this benchmark on his Windows machine which has a modern CPU that is about 10 years newer than my poor server:

--- curl-cffi Benchmark Complete ---
Sent 200.00 GB of data in 91.74 seconds.
Average throughput: 17.44 Gbps.

Upload

image image

Concurrent

Here is a test from the same benchmark, sending and receiving concurrently (without/with frame coalescing):

image image

The code provides lots of tunable parameters to tweak concurrent performance, such as yield_interval, fair_scheduling and yield_mask. All this can be set by an end user to suit their needs.

Closing Note

Should you trust my benchmarks and statements? Absolutely not.

It's always best to run your own benchmarks (and double check the code yourself), since I have no idea what your use case is. I am benchmarking on localhost to eliminate network conditions as a factor, but it may be that real internet conditions impact the numbers significantly, in which case the relevant changes should be made.

Also if you think the current send(...) api does not provide adequate guarantees of transmission, I can change this so that:

  • When send(...) is called, it inserts the content to be sent along with an asyncio.Future object into the send queue
  • The send(...) method then either returns a reference to the Future object or awaits it directly.
  • The write loop picks up this queue item and prepares it for transmission
  • After ws_send(...) is called, the Future result is either set to the number of bytes sent or an Exception object

That would guarantee that the item has been sent out, but it would increase memory pressure and slow down the sending. So I think that should be considered. If that's needed it can be done as part of this PR or another follow up PR. I also need to think through how that'd work for coalesced sends.

There is definitely room for improvement with this code and the architecture overall. I acknowledge that this is just one person's attempt at making a useful library better (although various LLMs were used to review the code and find improvements).

Feel free to suggest improvements. I will do my best to integrate them and ensure the code is in the best state possible.

@lexiforest
Copy link
Owner

Cool, the improvements are significant. Thanks!

@lexiforest lexiforest linked an issue Oct 11, 2025 that may be closed by this pull request
@Sh3llcod3
Copy link
Author

image

Almost there, just need to make it resilient against edge case race conditions, then it should be good to review

@Sh3llcod3
Copy link
Author

Sh3llcod3 commented Oct 15, 2025

@lexiforest Unless I find anything else, the code should be good to review now 😎

I've updated the description.

@lexiforest
Copy link
Owner

Thanks, this is a huge PR, it could take some time to review. :)

@Sh3llcod3
Copy link
Author

Sh3llcod3 commented Oct 15, 2025

Sounds good, I shall await :)

(Also whoops I set the frame size to 1MB, curl max frame size is 64k according to their docs 😅)

@lexiforest
Copy link
Owner

lexiforest commented Oct 16, 2025

Do you mind adding some of you benchmark code here? And also mention it in the readme, if you think it helps.

@Sh3llcod3
Copy link
Author

Added first benchmark, I will add the second one soon.

@Sh3llcod3
Copy link
Author

Sh3llcod3 commented Oct 22, 2025

@lexiforest Wow that took a while, but we got there in the end.

Both benchmarks are there. I've also added instructions on running them into the benchmark README. They should work cross-platform straight away.

I've gone through and checked all the code and ran the usual set of linters and type checkers.

It should now be ready for review 😊

Everyone asks what CI is doing, but no one asks *how* is CI?
@Sh3llcod3
Copy link
Author

Sh3llcod3 commented Oct 23, 2025

I ran the final benchmark code without TLS just for fun:
image

With TLS it's similar to what is in the description.

Thanks to my friend's benchmarks on his Windows machine, I've made the code even more resilient.

@lexiforest I will be travelling till the end of the month, so feel free to make any changes you see fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AsyncWebSocket is ~4.5x slower than AIOHTTP

2 participants