Improve Async WebSocket #650

Sh3llcod3 · 2025-10-10T23:52:13Z

Overview

This is a complete and fundamental rewrite of the Async WebSocket. This overhaul replaces the implementation with a fully asynchronous, task-based architecture that is idiomatic, highly performant, and hardened for production use. It now fails predictably and cleans up its resources correctly under all circumstances.

This pull request also addresses several lifecycle and error-handling bugs in the AsyncWebSocket implementation.

The key focus from this rewrite is to achieve the best possible performance while keeping the code clean and breaking changes to a minimum. With that said, there are some breaking changes, so let's get those out of the way.

Breaking Changes

send() Error Handling is now asynchronous:
- Old: A network error during a send() operation would raise an exception directly from the await ws.send(...) call.
- New: A network error during a send is now caught by the background writer task. The exception is then placed on the receive queue and will be raised by a concurrent or subsequent call to await ws.recv(). Application-level error handling must be adapted to this new model.
send() is now non-blocking and Queued:
- Old: await ws.send(...) would block until the data was passed to the OS socket buffer.
- New: await ws.send(...) returns almost instantly after placing the message in an internal queue. To guarantee that all queued messages have been picked up for sending, callers must now use the new await ws.flush() method.
recv_fragment() Removed: The recv_fragment method is no longer part of the class. Message reassembly is now an internal implementation detail of the read loop.

Performance

The new implementation completely does away with the use of run_in_executor. The original issue I raised #645, found that the overhead from this was massive and taking up most of the time, leading to very low throughput figures. From reading the original code, it became clear that the Curl socket is not blocking and simply returns EAGAIN when it is not ready to be read from/written to.

This implementation handles those EAGAINs and uses a similar behaviour to the aselect(...) to wait on FD availability. Instead of running the libcurl calls in a thread pool to avoid blocking, the code opts for a cooperative yielding approach, where every N number of I/O ops or after a certain time has elapsed, it will yield control to the event loop ensuring other things have a fair chance to run.

This is obviously a double-edged sword. Providing too much time to the event loop will result in lower throughput, while taking up too much event loop time will starve the loop from being able to run the other applications. I've set values which I believe are a good balance of throughput and fairly yielding control.

The current implementation uses background I/O tasks for reading and writing, which decouples the I/O operation from the public API. That means, when you call send(...) or recv(...), you are simply performing a lightweight operation which adds the WS frame into an asyncio queue. When the FD is next available for reading or writing, it will pluck the frame from the queue and write it into the curl socket and vice versa. This tries to be the best of both worlds approach suitable for continnous and sparse messages by engaging CPU time only when it is needed and efficiently waiting at idle state.

Due to the decoupling, it completely eliminates the crashing risk when calling send/recv concurrently. Since it is performing operations on an async queue, which by its very nature is designed for concurrent operations, there is no risk from crashing this way. As long as there is only one instance of the I/O loop tasks, there will only ever be one reader/writer interacting with the curl socket sequentially, removing this risk.

On the write side, we are using some optimization techniques including adaptive batching and send coalesceing (optional) to try and reduce the overhead / number of system calls made. These techniques do actually work, but in this case, the benefits are limited, as there seems to be a hard cap on send throughput (around 40 MiB/s on my machine). During development, I've taken tens (if not hundreds) of different cProfile performance captures and no matter what I do, send() throughput is hard limited by curl's ws_send() call. It is likely that it cannot be improved further.

In case you are wondering: Does the send coalescing actually work? Yes!

Here is a cProfile with the flag turned on, sending 10 GiBs of data with a 65k chunk size:

4385099 function calls (4379080 primitive calls) in 255.745 seconds

   Ordered by: cumulative time
   List reduced from 2431 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    238/1    0.005    0.000  255.745  255.745 {built-in method builtins.exec}
        1    0.000    0.000  255.745  255.745 ./client.py:1(<module>)
        1    0.000    0.000  254.919  254.919 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/uvloop/__init__.py:51(run)
        1    0.000    0.000  254.919  254.919 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:160(run)
        1    0.000    0.000  254.918  254.918 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:61(__exit__)
        1    0.000    0.000  254.918  254.918 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:64(close)
      2/1    0.000    0.000  254.917  254.917 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:1000(_bootstrap)
      2/1    0.000    0.000  254.917  254.917 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:1027(_bootstrap_inner)
      2/1    0.000    0.000  254.917  254.917 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:983(run)
        1    0.214    0.214  254.905  254.905 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:86(run)
    10189    0.722    0.000  252.518    0.025 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/requests/websockets.py:1019(_write_loop)
    10825    0.156    0.000  247.827    0.023 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/requests/websockets.py:1086(_send_payload)
    10188    0.092    0.000  247.613    0.024 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/curl.py:522(ws_send)
    10187  247.406    0.024  247.406    0.024 {built-in method curl_cffi._wrapper.curl_ws_send}
      637    3.077    0.005    3.077    0.005 {method 'join' of 'bytes' objects}
      628    0.001    0.000    1.953    0.003 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/uvloop/__init__.py:54(wrapper)
      629    0.001    0.000    1.951    0.003 ./client.py:78(main)
      629    0.293    0.000    1.951    0.003 ./client.py:30(run_benchmark_curl_cffi)

Here is another cProfile with the same transfer but without the feature:

6694317 function calls (6688299 primitive calls) in 262.812 seconds

   Ordered by: cumulative time
   List reduced from 2418 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    238/1    0.005    0.000  262.812  262.812 {built-in method builtins.exec}
        1    0.000    0.000  262.812  262.812 ./client.py:1(<module>)
        1    0.000    0.000  261.990  261.990 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/uvloop/__init__.py:51(run)
        1    0.000    0.000  261.990  261.990 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:160(run)
        1    0.000    0.000  261.989  261.989 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:61(__exit__)
        1    0.000    0.000  261.988  261.988 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:64(close)
      2/1    0.000    0.000  261.988  261.988 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:1000(_bootstrap)
      2/1    0.000    0.000  261.988  261.988 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:1027(_bootstrap_inner)
      2/1    0.000    0.000  261.988  261.988 /home/.pyenv/versions/3.13.7/lib/python3.13/threading.py:983(run)
        1    1.920    1.920  261.976  261.976 /home/.pyenv/versions/3.13.7/lib/python3.13/asyncio/runners.py:86(run)
   162871    1.041    0.000  257.854    0.002 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/requests/websockets.py:1019(_write_loop)
   325740    1.637    0.000  256.072    0.001 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/requests/websockets.py:1086(_send_payload)
   162870    0.935    0.000  253.749    0.002 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/curl_cffi/curl.py:522(ws_send)
   162869  251.539    0.002  251.539    0.002 {built-in method curl_cffi._wrapper.curl_ws_send}
      628    0.001    0.000    1.924    0.003 /home/.cache/pypoetry/virtualenvs/ws-bench-_cej2YY4-py3.13/lib/python3.13/site-packages/uvloop/__init__.py:54(wrapper)
      629    0.001    0.000    1.923    0.003 ./client.py:78(main)
      629    0.293    0.000    1.923    0.003 ./client.py:30(run_benchmark_curl_cffi)

Now the throughput is hard capped on my machine so the throughput is not that faster, but see the difference?

That's 10188 calls to ws_send() with the feature vs 162870 function calls without. Also, more than 95% of the time is spent in the blocking C function and there is little overhead from the Python

Note: These numbers were taken before the max frame size was set to 64k, you can still get this benefit if you are aggregating frames smaller than this value.

With streaming protocols which do not care about frame boundaries, this works nicely for smaller chunks. Here is the speed of that same transfer with 2048 byte chunk size without and with coalesce_frames=True:

On the Curl FFI side, I've made one key optimization which significantly speeds up receive throughput. Rather than creating a new ffi.buffer(...) object on reach fragment received, it now pre-allocates and reuses that existing buffer. This speeds things up a lot by avoiding repeated memory allocations in a tight loop. This is safe to do, since the buffer is read up to a bounded value, so it does not need to be cleared or allocated each time. Also, a bytes copy is made when the function returns so there are no referencing issues. I tried to make it return a memoryview(ffi.buffer(...)) and it was quite a bit faster, but it ran into referencing issues since the view is pointing to the pre-allocated buffer. This means any changes there will change what all the memoryview instances are pointing to. Although the idea of bytes copying doesn't sound great, in reality, other than returning a no-copy view into the buffer, all other methods are slower. I tried returning bytearray objects or using a pre-allocated bytearray and extending the fragments into it, but that ends up being slower. In the end, having a list of bytes and calling b"".join(chunks) was the fastest way to collect up the fragments into one complete message.

Benchmark Code

Now let's get to the interesting part. The numbers tell their own story.

In order to arrive at these numbers, I am running two different sets of benchmarks.

The first benchmark (client.py and server_ssl.py) simply sends and/or receives in a loop (depending on what code is commented out) and doesn't do anything other than time how long it took.
The second benchmark (benchmark.py) is slightly more realistic. It reads a file from disk to memory and loads its hash. Depending on if it was ran in server mode or client mode, it will receive or send that file and the other end will calculate a hash of the received data and compare against loaded hash of the file on disk and let you know if it matches or not. This will be slower due to the higher overhead.

Neither benchmark is good and the benchmark code is bad. They are simply a rough proof of concept.

I am using btop to monitor throughput. I am running each test more than once, noting that the first run is measurably slower sometimes, possibly due to the CPU cache. Either way a 5~10% variation on these figures should be factored in. I am using a HPE DL-60 Gen 9 rack server. This is an ancient server which clocks up to 1.6 GHz, so your results may vary. One slightly nice aspect is that I have 80 GB RAM on this server, hence I can run the benchmarks with 20 GB files loaded in memory without any issues.

In each test, I will compare the existing code vs new code from this PR.

Files

Slow Case

The slow case is where the chunk size in all the benchmarks have been set to a small value, i.e. 2048 bytes. This increases the number of libcurl calls needed to transfer the same amount of data and slows down the throughput.

Download

client.py:

benchmark.py:

Upload

client.py:

benchmark.py:

Average Case

The average case only considers benchmark.py results with a larger chunk size of 65536 bytes.

Download

Upload

Best Case

These numbers are unrealistic and unlikely to be achievable in a real application. This only considers the client.py benchmark, which effectively just loops and does absolutely nothing with the received/sent data. Any application processing logic will create overhead and slow down the throughput. These figures are included here as an optimistic "what-if there was nothing else, how fast could we go?" situation. The chunk size is 65536 bytes.

Download

My friend also ran this benchmark on his Windows machine which has a modern CPU that is about 10 years newer than my poor server:

--- curl-cffi Benchmark Complete ---
Sent 200.00 GB of data in 91.74 seconds.
Average throughput: 17.44 Gbps.

Upload

Concurrent

Here is a test from the same benchmark, sending and receiving concurrently (without/with frame coalescing):

The code provides lots of tunable parameters to tweak concurrent performance, such as yield_interval, fair_scheduling and yield_mask. All this can be set by an end user to suit their needs.

Closing Note

Should you trust my benchmarks and statements? Absolutely not.

It's always best to run your own benchmarks (and double check the code yourself), since I have no idea what your use case is. I am benchmarking on localhost to eliminate network conditions as a factor, but it may be that real internet conditions impact the numbers significantly, in which case the relevant changes should be made.

Also if you think the current send(...) api does not provide adequate guarantees of transmission, I can change this so that:

When send(...) is called, it inserts the content to be sent along with an asyncio.Future object into the send queue
The send(...) method then either returns a reference to the Future object or awaits it directly.
The write loop picks up this queue item and prepares it for transmission
After ws_send(...) is called, the Future result is either set to the number of bytes sent or an Exception object

That would guarantee that the item has been sent out, but it would increase memory pressure and slow down the sending. So I think that should be considered. If that's needed it can be done as part of this PR or another follow up PR. I also need to think through how that'd work for coalesced sends.

There is definitely room for improvement with this code and the architecture overall. I acknowledge that this is just one person's attempt at making a useful library better (although various LLMs were used to review the code and find improvements).

Feel free to suggest improvements. I will do my best to integrate them and ensure the code is in the best state possible.

…CPU time

> "Ah it must be better to use events, surely these things are light weight" > Looks under hood > self._get_loop().create_future() > self._waiters.append(fut) > Bruh

lexiforest · 2025-10-11T02:41:52Z

Cool, the improvements are significant. Thanks!

Sh3llcod3 · 2025-10-11T18:15:49Z

Almost there, just need to make it resilient against edge case race conditions, then it should be good to review

Sh3llcod3 · 2025-10-15T00:13:42Z

@lexiforest Unless I find anything else, the code should be good to review now 😎

I've updated the description.

lexiforest · 2025-10-15T01:07:09Z

Thanks, this is a huge PR, it could take some time to review. :)

Sh3llcod3 · 2025-10-15T21:18:39Z

Sounds good, I shall await :)

(Also whoops I set the frame size to 1MB, curl max frame size is 64k according to their docs 😅)

lexiforest · 2025-10-16T03:17:22Z

Do you mind adding some of you benchmark code here? And also mention it in the readme, if you think it helps.

Sh3llcod3 · 2025-10-17T21:44:55Z

Added first benchmark, I will add the second one soon.

Sh3llcod3 · 2025-10-22T23:24:18Z

@lexiforest Wow that took a while, but we got there in the end.

Both benchmarks are there. I've also added instructions on running them into the benchmark README. They should work cross-platform straight away.

I've gone through and checked all the code and ran the usual set of linters and type checkers.

It should now be ready for review 😊

Everyone asks what CI is doing, but no one asks *how* is CI?

Sh3llcod3 · 2025-10-23T18:53:04Z

I ran the final benchmark code without TLS just for fun:

With TLS it's similar to what is in the description.

Thanks to my friend's benchmarks on his Windows machine, I've made the code even more resilient.

@lexiforest I will be travelling till the end of the month, so feel free to make any changes you see fit.

- Also: flushed 😳

Sh3llcod3 added 22 commits October 4, 2025 01:07

Start working on WS improvements

19ebcce

Add in cooperative yielding to prevent one coroutine taking too much …

144e6bb

…CPU time

Start to cleanup the code a bit

cabb59d

Further changes

f481532

This is the fastest code yet.

31d916b

Reuse exisiting initialized object

da43868

Start cleaning up the code

2efe3f8

Continue cleaning up the code

ad4751d

Further optimizations

ac9455a

I think this is it on the optimization front

f123bdb

Move the curlopt to session

25fd220

Improve robustness

fed0cbb

Finish the review

3fc339e

Make sends fairer

f4bf9d6

Optimize further

e23799f

Make queue size larger

2741960

Remove closed check

dfcf71d

Turns out the real optimization was the friends we made along the way

b20bcf4

Register the reader once

2d0359b

WHY is this faster, I don't get it man, what is life anymore

01bc8d7

Finish cleanup & review

5576cfc

FUTURE FUTURE FUTURE

6f53e22

> "Ah it must be better to use events, surely these things are light weight" > Looks under hood > self._get_loop().create_future() > self._waiters.append(fut) > Bruh

Merge branch 'lexiforest:main' into Improve-AsyncWebsocket

ca5bf47

lexiforest linked an issue Oct 11, 2025 that may be closed by this pull request

AsyncWebSocket is ~4.5x slower than AIOHTTP #645

Open

Sh3llcod3 added 4 commits October 11, 2025 14:51

Cleanup code and attempt to fix tests

9c08b7d

Fix race condition in close

b90b1bf

Set it back

031ac79

Every. Last. Optimization.

b26b9a1

Sh3llcod3 added 5 commits October 13, 2025 23:55

Increase RECV buffer

9017019

Turns out setting buffer sizes was not a good idea after all

8a471f5

Code is ready to review

989b198

Make it even more robust

f8a769e

Finalize for review

1baa219

Remove the curlopt that I thought I removed

a855875

Whoops - set curl max frame size properly

7cf312c

Sh3llcod3 added 2 commits October 17, 2025 22:18

Write docs, client and server for first benchmark

7ef2d11

Lint

2a9f787

Sh3llcod3 added 10 commits October 18, 2025 14:20

Add fair scheduling changes and optimizations

bb270d1

Move yield mask to parameter

1c26dcb

Optimize further

0d3d39b

We are reaching levels of optimization that shouldn't even be possible

ee0b802

Improve benchmark README

4c8b3df

Start adding in second benchmark

445121f

Continue writing second benchmark

d498d7f

Complete initial version of second benchmark

0e2e03b

Complete the benchmarks

b2fe11d

Finalize the README

60146a2

What CI/CD doing?

e3db8d7

Everyone asks what CI is doing, but no one asks *how* is CI?

Sh3llcod3 added 3 commits October 23, 2025 23:07

We like the robustness

728f048

Improve docstrings

6cbc79c

- Also: flushed 😳

Update docstrings

007c1d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve Async WebSocket #650

Improve Async WebSocket #650

Uh oh!

Sh3llcod3 commented Oct 10, 2025 •

edited

Loading

Uh oh!

lexiforest commented Oct 11, 2025

Uh oh!

Sh3llcod3 commented Oct 11, 2025

Uh oh!

Sh3llcod3 commented Oct 15, 2025 •

edited

Loading

Uh oh!

lexiforest commented Oct 15, 2025

Uh oh!

Sh3llcod3 commented Oct 15, 2025 •

edited

Loading

Uh oh!

lexiforest commented Oct 16, 2025 •

edited

Loading

Uh oh!

Sh3llcod3 commented Oct 17, 2025

Uh oh!

Sh3llcod3 commented Oct 22, 2025 •

edited

Loading

Uh oh!

Sh3llcod3 commented Oct 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Improve Async WebSocket #650

Are you sure you want to change the base?

Improve Async WebSocket #650

Uh oh!

Conversation

Sh3llcod3 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Breaking Changes

Performance

Benchmark Code

Files

Slow Case

Download

Upload

Average Case

Download

Upload

Best Case

Download

Upload

Concurrent

Closing Note

Uh oh!

lexiforest commented Oct 11, 2025

Uh oh!

Sh3llcod3 commented Oct 11, 2025

Uh oh!

Sh3llcod3 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexiforest commented Oct 15, 2025

Uh oh!

Sh3llcod3 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexiforest commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sh3llcod3 commented Oct 17, 2025

Uh oh!

Sh3llcod3 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sh3llcod3 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sh3llcod3 commented Oct 10, 2025 •

edited

Loading

Sh3llcod3 commented Oct 15, 2025 •

edited

Loading

Sh3llcod3 commented Oct 15, 2025 •

edited

Loading

lexiforest commented Oct 16, 2025 •

edited

Loading

Sh3llcod3 commented Oct 22, 2025 •

edited

Loading

Sh3llcod3 commented Oct 23, 2025 •

edited

Loading