-
Notifications
You must be signed in to change notification settings - Fork 138
feat: Optimize neqo-transport
#2828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Similar to mozilla#2827.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2828 +/- ##
==========================================
- Coverage 94.93% 94.93% -0.01%
==========================================
Files 115 115
Lines 34425 34426 +1
Branches 34425 34426 +1
==========================================
- Hits 32682 32681 -1
Misses 1736 1736
- Partials 7 9 +2
|
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to 853e4be. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Client/server transfer resultsPerformance differences relative to 6942acc. Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.
Download data for |
Benchmark resultsPerformance differences relative to 6942acc. 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.time: [195.00 ms 195.32 ms 195.69 ms] thrpt: [511.02 MiB/s 511.99 MiB/s 512.83 MiB/s] change: time: [−5.2922% −5.0819% −4.8476%] (p = 0.00 < 0.05) thrpt: [+5.0946% +5.3540% +5.5879%] 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: Change within noise threshold.time: [302.03 ms 303.61 ms 305.16 ms] thrpt: [32.770 Kelem/s 32.937 Kelem/s 33.109 Kelem/s] change: time: [−1.8754% −1.1915% −0.4966%] (p = 0.00 < 0.05) thrpt: [+0.4991% +1.2059% +1.9112%] 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.time: [28.149 ms 28.310 ms 28.489 ms] thrpt: [35.102 B/s 35.323 B/s 35.525 B/s] change: time: [−0.1431% +0.5977% +1.4399%] (p = 0.14 > 0.05) thrpt: [−1.4195% −0.5942% +0.1433%] 1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.time: [192.20 ms 193.88 ms 196.32 ms] thrpt: [509.36 MiB/s 515.78 MiB/s 520.28 MiB/s] change: time: [−5.9614% −5.1597% −3.9955%] (p = 0.00 < 0.05) thrpt: [+4.1617% +5.4404% +6.3393%] decode 4096 bytes, mask ff: No change in performance detected.time: [11.603 µs 11.629 µs 11.663 µs] change: [−0.9220% −0.3198% +0.2356%] (p = 0.31 > 0.05) decode 1048576 bytes, mask ff: No change in performance detected.time: [3.0040 ms 3.0128 ms 3.0223 ms] change: [−1.0120% −0.4049% +0.1488%] (p = 0.18 > 0.05) decode 4096 bytes, mask 7f: No change in performance detected.time: [19.363 µs 19.417 µs 19.482 µs] change: [−0.3080% +0.0641% +0.4382%] (p = 0.75 > 0.05) decode 1048576 bytes, mask 7f: No change in performance detected.time: [5.0946 ms 5.1095 ms 5.1259 ms] change: [−0.1741% +0.2253% +0.6016%] (p = 0.26 > 0.05) decode 4096 bytes, mask 3f: No change in performance detected.time: [5.5439 µs 5.5918 µs 5.6508 µs] change: [−0.2087% +0.9037% +2.2182%] (p = 0.17 > 0.05) decode 1048576 bytes, mask 3f: Change within noise threshold.time: [1.7791 ms 1.7918 ms 1.8060 ms] change: [+0.8316% +1.6140% +2.3437%] (p = 0.00 < 0.05) coalesce_acked_from_zero 1+1 entries: No change in performance detected.time: [89.013 ns 89.354 ns 89.693 ns] change: [−0.2473% +0.2300% +0.7670%] (p = 0.36 > 0.05) coalesce_acked_from_zero 3+1 entries: No change in performance detected.time: [106.29 ns 106.64 ns 107.01 ns] change: [−0.8938% −0.2761% +0.2952%] (p = 0.38 > 0.05) coalesce_acked_from_zero 10+1 entries: No change in performance detected.time: [105.60 ns 105.96 ns 106.42 ns] change: [−1.0931% +0.5319% +3.0132%] (p = 0.74 > 0.05) coalesce_acked_from_zero 1000+1 entries: No change in performance detected.time: [89.666 ns 89.779 ns 89.910 ns] change: [−0.2372% +0.5914% +1.4468%] (p = 0.20 > 0.05) RxStreamOrderer::inbound_frame(): Change within noise threshold.time: [107.88 ms 107.95 ms 108.03 ms] change: [−0.6885% −0.5791% −0.4780%] (p = 0.00 < 0.05) sent::Packets::take_ranges: :green_heart: Performance has improved.time: [5.0169 µs 5.0845 µs 5.1430 µs] change: [−44.394% −38.209% −28.473%] (p = 0.00 < 0.05) transfer/pacing-false/varying-seeds: Change within noise threshold.time: [36.501 ms 36.570 ms 36.640 ms] change: [−2.7021% −2.3694% −2.0743%] (p = 0.00 < 0.05) transfer/pacing-true/varying-seeds: Change within noise threshold.time: [37.649 ms 37.778 ms 37.916 ms] change: [−1.9376% −1.4611% −0.9861%] (p = 0.00 < 0.05) transfer/pacing-false/same-seed: Change within noise threshold.time: [36.652 ms 36.706 ms 36.764 ms] change: [−1.9786% −1.7004% −1.4319%] (p = 0.00 < 0.05) transfer/pacing-true/same-seed: Change within noise threshold.time: [38.146 ms 38.228 ms 38.326 ms] change: [−2.0171% −1.7383% −1.4342%] (p = 0.00 < 0.05) Download data for |
@@ -3211,7 +3214,7 @@ impl Connection { | |||
.streams_mut() | |||
.inbound_frame(space, offset, data)?; | |||
if self.crypto.streams().data_ready(space) { | |||
let mut buf = Vec::new(); | |||
let mut buf = Vec::with_capacity(16384); // Typical handshake message size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is obviously way to big, but I wonder if 1500 makes sense.
@@ -630,7 +630,7 @@ impl Loss { | |||
return (Vec::new(), Vec::new()); | |||
}; | |||
let loss_delay = primary_path.borrow().rtt().loss_delay(); | |||
let mut lost = Vec::new(); | |||
let mut lost = Vec::with_capacity(8); // Typically few packets are lost at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might make sense, but I wish we had some data on loss events rather than a guess.
@@ -873,7 +873,7 @@ impl Loss { | |||
let loss_delay = primary_path.borrow().rtt().loss_delay(); | |||
let confirmed = self.confirmed(); | |||
|
|||
let mut lost_packets = Vec::new(); | |||
let mut lost_packets = Vec::with_capacity(16); // Pre-allocate for typical PTO scenarios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I forgot to hit Submit. Are you suggesting that this is a performance win, so we should take it, even if we have no idea why? Jus' Vibe'n?
@@ -630,7 +630,7 @@ impl Loss { | |||
return (Vec::new(), Vec::new()); | |||
}; | |||
let loss_delay = primary_path.borrow().rtt().loss_delay(); | |||
let mut lost = Vec::new(); | |||
let mut lost = Vec::with_capacity(8); // Typically few packets are lost at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did Claude come up with this number on its own? Because I have questions. It's a reasonable thing to do, but a number as small as 8 is probably low enough that the first allocation will be that large anyway. And if there are NO lost packets, you made things slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is @claude's number.
@@ -3211,7 +3214,7 @@ impl Connection { | |||
.streams_mut() | |||
.inbound_frame(space, offset, data)?; | |||
if self.crypto.streams().data_ready(space) { | |||
let mut buf = Vec::new(); | |||
let mut buf = Vec::with_capacity(16384); // Typical handshake message size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number I know to be wrong. I know that we have certificates (which as a client we never send...) that can be large-ish, but not that large. In practice, the size we might pre-allocate here would be around 2k, not 16k.
@@ -873,7 +873,7 @@ impl Loss { | |||
let loss_delay = primary_path.borrow().rtt().loss_delay(); | |||
let confirmed = self.confirmed(); | |||
|
|||
let mut lost_packets = Vec::new(); | |||
let mut lost_packets = Vec::with_capacity(16); // Pre-allocate for typical PTO scenarios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, no evidence for the number.
let mut send_buffer = Vec::with_capacity(min( | ||
DatagramBatch::MAX, | ||
path.borrow().plpmtu() * max_datagrams.get(), | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty clever, until you realize that this is just copied from a few lines down. Also, nope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous attempt with no sign of impact: #2782
There seem to be some modest improvements here from the preallocations, but the preallocation amounts should be informed by some data.
Kinda. I'm asking @claude to do various things to see what it can do, and the (few) ones where the results look like they might the beginning of something I do a draft PR for, mostly so that benches run. (I tried to ask @claude to run benches locally to verify improvements - and even make new benches for each specific change it proposes - but it kinda fails hard when confronted with that ask.) |
Similar to #2827.