fix(ledbat): allow congestion avoidance to run when slowdown is skipped #2574

sanity · 2026-01-03T22:23:04Z

Problem

GET requests from technic.locut.us (137ms RTT to nova) were taking 21-29 seconds instead of the expected ~7 seconds. Telemetry showed:

peak_cwnd reaching 43KB (good - slow start working)
final_cwnd stuck at 10-21KB (bad - never recovering)
Only 1 slowdown triggered per transfer (subsequent ones skipped)
Throughput stuck at 88-117 Kbps instead of expected ~800 Kbps

Root Cause

In handle_congestion_state(), the return value of start_slowdown() was being ignored in both CongestionAvoidance and WaitingForSlowdown states:

// BUG: Both states ignored the return value
if now_nanos >= next_slowdown {
    self.start_slowdown(now_nanos, base_delay);  // Returns false if skipped!
}
return true;  // Always blocked congestion avoidance

When start_slowdown() skips because cwnd is too small to meaningfully reduce, it returns false. But both states unconditionally returned true, preventing update_cwnd() from running.

Fixes in This PR

Fix 1: Use start_slowdown() return value (LEDBAT)

Now both CongestionAvoidance and WaitingForSlowdown states properly handle the return value:

If start_slowdown() returns true → slowdown started, stay in state machine
If it returns false → slowdown skipped, transition to CongestionAvoidance and let cwnd grow

Fix 2: Increase ssthresh for high-BDP paths (LEDBAT)

Increased DEFAULT_SSTHRESH from 100KB to 1MB. With 135ms RTT (e.g., US to EU):

100KB ssthresh limits throughput to ~6 Mbit/s
1MB ssthresh allows ~60 Mbit/s

Fix 3: Increase ping tolerance (Transport)

Increased MAX_UNANSWERED_PINGS from 2 to 5, extending tolerance from 10s to 25s.
This fixes flaky test failures where connections were prematurely closed due to
pong responses being delayed under CI load.

Fix 4: Integration test timeouts

Fixed test_put_then_immediate_subscribe_succeeds_locally_regression_2326 timeouts
to match CI-stable patterns.

Testing

Regression Test Added

Added test_skipped_slowdown_allows_congestion_avoidance_to_run() that:

Sets up cwnd below skip threshold with a scheduled slowdown due NOW
Simulates multiple RTTs of ACKs
Verifies cwnd grows (proving congestion avoidance ran)

Code Review

Three parallel subagent reviews completed:

Code-first review: Verified code matches description
Testing review: Identified gaps, all addressed
Skeptical review: Found same bug in WaitingForSlowdown state → fixed

All Tests Pass

125 LEDBAT unit tests pass
test_small_network_get_failure passes 8/8 runs (was failing ~20%)
Integration tests pass

[AI-assisted - Claude]

When a scheduled LEDBAT++ slowdown was skipped (because cwnd was too small to meaningfully reduce), handle_congestion_state() incorrectly returned `true`, causing the caller to skip congestion avoidance. This prevented cwnd from ever growing, causing transfers to remain stuck at minimum throughput. ## Problem Telemetry from technic.locut.us (137ms RTT to nova) showed: - GET requests taking 21-29 seconds instead of expected ~7 seconds - peak_cwnd reaching 43KB but final_cwnd stuck at 10-21KB - Only 1 slowdown triggered per transfer (subsequent ones skipped) - Throughput stuck at 88-117 Kbps instead of ~800 Kbps The root cause: after the first slowdown dropped cwnd to ~10.75KB, subsequent slowdowns were correctly skipped (cwnd below threshold), but the skip path returned `true` anyway, preventing congestion avoidance from running to grow cwnd back. ## Fix Use the return value of `start_slowdown()`: only return `true` from `handle_congestion_state()` if the slowdown actually started. When the slowdown is skipped, return `false` to let congestion avoidance run and grow cwnd. ## Testing - Added regression test that verifies cwnd grows when slowdown is skipped due to small cwnd - All 125 LEDBAT tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions · 2026-01-03T22:37:16Z

⚠️ Performance Benchmark Regressions Detected

The following benchmarks show performance regression compared to the baseline:

See workflow summary for details

Note: This is informational only and does not block the PR. Please review if the regression is expected or needs investigation.

View full benchmark results

The `test_put_then_immediate_subscribe_succeeds_locally_regression_2326` test was flaky on CI due to timeouts that were too aggressive compared to similar passing tests. Changes: - timeout_secs: 60 → 300 (matching other two-node tests) - startup_wait_secs: 10 → 15 (matching similar gateway-peer tests) - Post-startup connection wait: 3s → 5s (with explanatory comment) - Subscribe timeout: 10s → 30s (matching test_multiple_clients_subscription) These values align with other CI-stable tests like `test_update_broadcast_propagation_issue_2301` which have the same node topology and similar operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions · 2026-01-03T23:18:01Z

⚠️ Performance Benchmark Regressions Detected

The following benchmarks show performance regression compared to the baseline:

See workflow summary for details

Note: This is informational only and does not block the PR. Please review if the regression is expected or needs investigation.

View full benchmark results

With 135ms RTT (e.g., US to EU), 100KB ssthresh limits effective throughput to ~6 Mbit/s. Increasing to 1MB allows ~60 Mbit/s, which is more appropriate for modern high-bandwidth connections. Also fixes test_harness_timeout_resets_state_correctly to use explicit ssthresh since it's not testing ssthresh behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions · 2026-01-03T23:34:45Z

⚠️ Performance Benchmark Regressions Detected

The following benchmarks show performance regression compared to the baseline:

See workflow summary for details

Note: This is informational only and does not block the PR. Please review if the regression is expected or needs investigation.

View full benchmark results

…on closure Increased MAX_UNANSWERED_PINGS from 2 to 5, extending the tolerance window from 10 seconds to 25 seconds. This prevents premature connection closure when pong responses are delayed due to: - CI environment load spikes - Network congestion during data transfer - CPU scheduling delays Root cause: The bidirectional liveness check would close connections if more than 2 pings went unanswered. With 5-second ping intervals, any delay over 10 seconds (e.g., during heavy CI load) would trigger connection closure and subsequent "decryption failed" errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Found by code review: the same bug fixed for CongestionAvoidance state also existed in WaitingForSlowdown state. When start_slowdown() skips because cwnd is too small, WaitingForSlowdown was returning true and blocking congestion avoidance indefinitely. Now when slowdown is skipped in WaitingForSlowdown, we transition to CongestionAvoidance and return false, allowing cwnd growth. Also fixes stale comment (100 KB -> 1 MB). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions · 2026-01-04T02:14:54Z

⚠️ Performance Benchmark Regressions Detected

The following benchmarks show performance regression compared to the baseline:

See workflow summary for details

Note: This is informational only and does not block the PR. Please review if the regression is expected or needs investigation.

View full benchmark results

sanity and others added 3 commits January 3, 2026 19:47

Merge branch 'main' into fix-ledbat-skipped-slowdown

aa930b3

sanity merged commit 12f0fb8 into main Jan 4, 2026
9 checks passed

sanity deleted the fix-ledbat-skipped-slowdown branch January 4, 2026 02:38

This was referenced Jan 4, 2026

LEDBAT death spiral persists after v0.1.78 - transfers still taking 18s+ #2578

Closed

fix(ledbat): prevent ssthresh death spiral on high-BDP paths #2579

Closed

CI improvement: Address patterns in production bugs escaping CI #2615

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(ledbat): allow congestion avoidance to run when slowdown is skipped #2574

fix(ledbat): allow congestion avoidance to run when slowdown is skipped #2574

Uh oh!

sanity commented Jan 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix(ledbat): allow congestion avoidance to run when slowdown is skipped #2574

fix(ledbat): allow congestion avoidance to run when slowdown is skipped #2574

Uh oh!

Conversation

sanity commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fixes in This PR

Fix 1: Use start_slowdown() return value (LEDBAT)

Fix 2: Increase ssthresh for high-BDP paths (LEDBAT)

Fix 3: Increase ping tolerance (Transport)

Fix 4: Integration test timeouts

Testing

Regression Test Added

Code Review

All Tests Pass

Uh oh!

github-actions bot commented Jan 3, 2026

⚠️ Performance Benchmark Regressions Detected

Uh oh!

github-actions bot commented Jan 3, 2026

⚠️ Performance Benchmark Regressions Detected

Uh oh!

github-actions bot commented Jan 3, 2026

⚠️ Performance Benchmark Regressions Detected

Uh oh!

github-actions bot commented Jan 4, 2026

⚠️ Performance Benchmark Regressions Detected

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanity commented Jan 3, 2026 •

edited

Loading