Skip to content

feat: Optimize neqo-transport #2828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

larseggert
Copy link
Collaborator

Similar to #2827.

Copy link

codecov bot commented Aug 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.93%. Comparing base (6942acc) to head (147e311).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2828      +/-   ##
==========================================
- Coverage   94.93%   94.93%   -0.01%     
==========================================
  Files         115      115              
  Lines       34425    34426       +1     
  Branches    34425    34426       +1     
==========================================
- Hits        32682    32681       -1     
  Misses       1736     1736              
- Partials        7        9       +2     
Components Coverage Δ
neqo-common 97.73% <ø> (ø)
neqo-crypto 89.91% <ø> (ø)
neqo-http3 93.72% <ø> (ø)
neqo-qpack 95.45% <ø> (ø)
neqo-transport 95.94% <100.00%> (-0.02%) ⬇️
neqo-udp 89.85% <ø> (ø)

Copy link

github-actions bot commented Aug 6, 2025

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 853e4be.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Copy link

github-actions bot commented Aug 6, 2025

Client/server transfer results

Performance differences relative to 6942acc.

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params) Mean ± σ Min Max MiB/s ± σ Δ main Δ main
google vs. google 450.6 ± 3.9 444.5 465.0 71.0 ± 8.2
google vs. neqo (cubic, paced) 271.5 ± 4.6 262.7 280.2 117.9 ± 7.0 0.9 0.3%
msquic vs. msquic 130.0 ± 18.6 110.6 219.2 246.1 ± 1.7
msquic vs. neqo (cubic, paced) 164.4 ± 23.4 126.5 215.9 194.7 ± 1.4 💔 13.6 9.0%
neqo vs. google (cubic, paced) 754.5 ± 4.7 747.5 767.0 42.4 ± 6.8 💚 -2.4 -0.3%
neqo vs. msquic (cubic, paced) 173.4 ± 21.1 147.4 213.2 184.6 ± 1.5 💔 18.6 12.0%
neqo vs. neqo (cubic) 87.4 ± 3.8 82.2 94.9 366.3 ± 8.4 💚 -3.1 -3.4%
neqo vs. neqo (cubic, paced) 88.3 ± 4.2 80.6 97.8 362.4 ± 7.6 💚 -4.3 -4.7%
neqo vs. neqo (reno) 87.8 ± 4.3 78.9 99.4 364.4 ± 7.4 💚 -3.4 -3.8%
neqo vs. neqo (reno, paced) 87.6 ± 4.1 80.4 103.8 365.2 ± 7.8 💚 -5.3 -5.7%
neqo vs. quiche (cubic, paced) 191.2 ± 4.1 185.8 204.3 167.3 ± 7.8 -0.3 -0.2%
neqo vs. s2n (cubic, paced) 223.4 ± 4.2 214.4 231.3 143.2 ± 7.6 💔 2.4 1.1%
quiche vs. neqo (cubic, paced) 157.6 ± 5.3 147.8 186.7 203.1 ± 6.0 💔 11.8 8.1%
quiche vs. quiche 144.1 ± 5.2 136.9 169.1 222.1 ± 6.2
s2n vs. neqo (cubic, paced) 170.4 ± 4.3 162.5 179.5 187.8 ± 7.4 💚 -2.4 -1.4%
s2n vs. s2n 247.3 ± 24.8 234.0 350.1 129.4 ± 1.3

Download data for profiler.firefox.com or download performance comparison data.

Copy link

github-actions bot commented Aug 6, 2025

Benchmark results

Performance differences relative to 6942acc.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.
       time:   [195.00 ms 195.32 ms 195.69 ms]
       thrpt:  [511.02 MiB/s 511.99 MiB/s 512.83 MiB/s]
change:
       time:   [−5.2922% −5.0819% −4.8476%] (p = 0.00 < 0.05)
       thrpt:  [+5.0946% +5.3540% +5.5879%]

Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: Change within noise threshold.
       time:   [302.03 ms 303.61 ms 305.16 ms]
       thrpt:  [32.770 Kelem/s 32.937 Kelem/s 33.109 Kelem/s]
change:
       time:   [−1.8754% −1.1915% −0.4966%] (p = 0.00 < 0.05)
       thrpt:  [+0.4991% +1.2059% +1.9112%]
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.
       time:   [28.149 ms 28.310 ms 28.489 ms]
       thrpt:  [35.102   B/s 35.323   B/s 35.525   B/s]
change:
       time:   [−0.1431% +0.5977% +1.4399%] (p = 0.14 > 0.05)
       thrpt:  [−1.4195% −0.5942% +0.1433%]

Found 8 outliers among 100 measurements (8.00%)
8 (8.00%) high severe

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
       time:   [192.20 ms 193.88 ms 196.32 ms]
       thrpt:  [509.36 MiB/s 515.78 MiB/s 520.28 MiB/s]
change:
       time:   [−5.9614% −5.1597% −3.9955%] (p = 0.00 < 0.05)
       thrpt:  [+4.1617% +5.4404% +6.3393%]

Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe

decode 4096 bytes, mask ff: No change in performance detected.
       time:   [11.603 µs 11.629 µs 11.663 µs]
       change: [−0.9220% −0.3198% +0.2356%] (p = 0.31 > 0.05)

Found 16 outliers among 100 measurements (16.00%)
2 (2.00%) low severe
4 (4.00%) low mild
3 (3.00%) high mild
7 (7.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.
       time:   [3.0040 ms 3.0128 ms 3.0223 ms]
       change: [−1.0120% −0.4049% +0.1488%] (p = 0.18 > 0.05)

Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
7 (7.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.
       time:   [19.363 µs 19.417 µs 19.482 µs]
       change: [−0.3080% +0.0641% +0.4382%] (p = 0.75 > 0.05)

Found 19 outliers among 100 measurements (19.00%)
2 (2.00%) low severe
3 (3.00%) low mild
2 (2.00%) high mild
12 (12.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.
       time:   [5.0946 ms 5.1095 ms 5.1259 ms]
       change: [−0.1741% +0.2253% +0.6016%] (p = 0.26 > 0.05)

Found 20 outliers among 100 measurements (20.00%)
1 (1.00%) low mild
19 (19.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.
       time:   [5.5439 µs 5.5918 µs 5.6508 µs]
       change: [−0.2087% +0.9037% +2.2182%] (p = 0.17 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) high mild
10 (10.00%) high severe

decode 1048576 bytes, mask 3f: Change within noise threshold.
       time:   [1.7791 ms 1.7918 ms 1.8060 ms]
       change: [+0.8316% +1.6140% +2.3437%] (p = 0.00 < 0.05)

Found 22 outliers among 100 measurements (22.00%)
22 (22.00%) high severe

coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [89.013 ns 89.354 ns 89.693 ns]
       change: [−0.2473% +0.2300% +0.7670%] (p = 0.36 > 0.05)

Found 10 outliers among 100 measurements (10.00%)
8 (8.00%) high mild
2 (2.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [106.29 ns 106.64 ns 107.01 ns]
       change: [−0.8938% −0.2761% +0.2952%] (p = 0.38 > 0.05)

Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) high mild
11 (11.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [105.60 ns 105.96 ns 106.42 ns]
       change: [−1.0931% +0.5319% +3.0132%] (p = 0.74 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
2 (2.00%) low mild
1 (1.00%) high mild
8 (8.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [89.666 ns 89.779 ns 89.910 ns]
       change: [−0.2372% +0.5914% +1.4468%] (p = 0.20 > 0.05)

Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.
       time:   [107.88 ms 107.95 ms 108.03 ms]
       change: [−0.6885% −0.5791% −0.4780%] (p = 0.00 < 0.05)

Found 28 outliers among 100 measurements (28.00%)
11 (11.00%) low severe
4 (4.00%) high mild
13 (13.00%) high severe

sent::Packets::take_ranges: :green_heart: Performance has improved.
       time:   [5.0169 µs 5.0845 µs 5.1430 µs]
       change: [−44.394% −38.209% −28.473%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

transfer/pacing-false/varying-seeds: Change within noise threshold.
       time:   [36.501 ms 36.570 ms 36.640 ms]
       change: [−2.7021% −2.3694% −2.0743%] (p = 0.00 < 0.05)

Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild

transfer/pacing-true/varying-seeds: Change within noise threshold.
       time:   [37.649 ms 37.778 ms 37.916 ms]
       change: [−1.9376% −1.4611% −0.9861%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

transfer/pacing-false/same-seed: Change within noise threshold.
       time:   [36.652 ms 36.706 ms 36.764 ms]
       change: [−1.9786% −1.7004% −1.4319%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

transfer/pacing-true/same-seed: Change within noise threshold.
       time:   [38.146 ms 38.228 ms 38.326 ms]
       change: [−2.0171% −1.7383% −1.4342%] (p = 0.00 < 0.05)

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe

Download data for profiler.firefox.com or download performance comparison data.

@@ -3211,7 +3214,7 @@ impl Connection {
.streams_mut()
.inbound_frame(space, offset, data)?;
if self.crypto.streams().data_ready(space) {
let mut buf = Vec::new();
let mut buf = Vec::with_capacity(16384); // Typical handshake message size
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously way to big, but I wonder if 1500 makes sense.

@@ -630,7 +630,7 @@ impl Loss {
return (Vec::new(), Vec::new());
};
let loss_delay = primary_path.borrow().rtt().loss_delay();
let mut lost = Vec::new();
let mut lost = Vec::with_capacity(8); // Typically few packets are lost at once
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might make sense, but I wish we had some data on loss events rather than a guess.

@@ -873,7 +873,7 @@ impl Loss {
let loss_delay = primary_path.borrow().rtt().loss_delay();
let confirmed = self.confirmed();

let mut lost_packets = Vec::new();
let mut lost_packets = Vec::with_capacity(16); // Pre-allocate for typical PTO scenarios
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@larseggert larseggert marked this pull request as ready for review August 7, 2025 09:35
Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot to hit Submit. Are you suggesting that this is a performance win, so we should take it, even if we have no idea why? Jus' Vibe'n?

@@ -630,7 +630,7 @@ impl Loss {
return (Vec::new(), Vec::new());
};
let loss_delay = primary_path.borrow().rtt().loss_delay();
let mut lost = Vec::new();
let mut lost = Vec::with_capacity(8); // Typically few packets are lost at once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did Claude come up with this number on its own? Because I have questions. It's a reasonable thing to do, but a number as small as 8 is probably low enough that the first allocation will be that large anyway. And if there are NO lost packets, you made things slower.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is @claude's number.

@@ -3211,7 +3214,7 @@ impl Connection {
.streams_mut()
.inbound_frame(space, offset, data)?;
if self.crypto.streams().data_ready(space) {
let mut buf = Vec::new();
let mut buf = Vec::with_capacity(16384); // Typical handshake message size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number I know to be wrong. I know that we have certificates (which as a client we never send...) that can be large-ish, but not that large. In practice, the size we might pre-allocate here would be around 2k, not 16k.

@@ -873,7 +873,7 @@ impl Loss {
let loss_delay = primary_path.borrow().rtt().loss_delay();
let confirmed = self.confirmed();

let mut lost_packets = Vec::new();
let mut lost_packets = Vec::with_capacity(16); // Pre-allocate for typical PTO scenarios
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no evidence for the number.

Comment on lines +2572 to +2575
let mut send_buffer = Vec::with_capacity(min(
DatagramBatch::MAX,
path.borrow().plpmtu() * max_datagrams.get(),
));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty clever, until you realize that this is just copied from a few lines down. Also, nope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous attempt with no sign of impact: #2782

@larseggert
Copy link
Collaborator Author

larseggert commented Aug 7, 2025

Are you suggesting that this is a performance win, so we should take it, even if we have no idea why?

There seem to be some modest improvements here from the preallocations, but the preallocation amounts should be informed by some data.

Jus' Vibe'n?

Kinda. I'm asking @claude to do various things to see what it can do, and the (few) ones where the results look like they might the beginning of something I do a draft PR for, mostly so that benches run. (I tried to ask @claude to run benches locally to verify improvements - and even make new benches for each specific change it proposes - but it kinda fails hard when confronted with that ask.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants