-
-
Notifications
You must be signed in to change notification settings - Fork 112
fix(ledbat): allow congestion avoidance to run when slowdown is skipped #2574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When a scheduled LEDBAT++ slowdown was skipped (because cwnd was too small to meaningfully reduce), handle_congestion_state() incorrectly returned `true`, causing the caller to skip congestion avoidance. This prevented cwnd from ever growing, causing transfers to remain stuck at minimum throughput. ## Problem Telemetry from technic.locut.us (137ms RTT to nova) showed: - GET requests taking 21-29 seconds instead of expected ~7 seconds - peak_cwnd reaching 43KB but final_cwnd stuck at 10-21KB - Only 1 slowdown triggered per transfer (subsequent ones skipped) - Throughput stuck at 88-117 Kbps instead of ~800 Kbps The root cause: after the first slowdown dropped cwnd to ~10.75KB, subsequent slowdowns were correctly skipped (cwnd below threshold), but the skip path returned `true` anyway, preventing congestion avoidance from running to grow cwnd back. ## Fix Use the return value of `start_slowdown()`: only return `true` from `handle_congestion_state()` if the slowdown actually started. When the slowdown is skipped, return `false` to let congestion avoidance run and grow cwnd. ## Testing - Added regression test that verifies cwnd grows when slowdown is skipped due to small cwnd - All 125 LEDBAT tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
The `test_put_then_immediate_subscribe_succeeds_locally_regression_2326` test was flaky on CI due to timeouts that were too aggressive compared to similar passing tests. Changes: - timeout_secs: 60 → 300 (matching other two-node tests) - startup_wait_secs: 10 → 15 (matching similar gateway-peer tests) - Post-startup connection wait: 3s → 5s (with explanatory comment) - Subscribe timeout: 10s → 30s (matching test_multiple_clients_subscription) These values align with other CI-stable tests like `test_update_broadcast_propagation_issue_2301` which have the same node topology and similar operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
With 135ms RTT (e.g., US to EU), 100KB ssthresh limits effective throughput to ~6 Mbit/s. Increasing to 1MB allows ~60 Mbit/s, which is more appropriate for modern high-bandwidth connections. Also fixes test_harness_timeout_resets_state_correctly to use explicit ssthresh since it's not testing ssthresh behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
…on closure Increased MAX_UNANSWERED_PINGS from 2 to 5, extending the tolerance window from 10 seconds to 25 seconds. This prevents premature connection closure when pong responses are delayed due to: - CI environment load spikes - Network congestion during data transfer - CPU scheduling delays Root cause: The bidirectional liveness check would close connections if more than 2 pings went unanswered. With 5-second ping intervals, any delay over 10 seconds (e.g., during heavy CI load) would trigger connection closure and subsequent "decryption failed" errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Found by code review: the same bug fixed for CongestionAvoidance state also existed in WaitingForSlowdown state. When start_slowdown() skips because cwnd is too small, WaitingForSlowdown was returning true and blocking congestion avoidance indefinitely. Now when slowdown is skipped in WaitingForSlowdown, we transition to CongestionAvoidance and return false, allowing cwnd growth. Also fixes stale comment (100 KB -> 1 MB). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Problem
GET requests from technic.locut.us (137ms RTT to nova) were taking 21-29 seconds instead of the expected ~7 seconds. Telemetry showed:
peak_cwndreaching 43KB (good - slow start working)final_cwndstuck at 10-21KB (bad - never recovering)Root Cause
In
handle_congestion_state(), the return value ofstart_slowdown()was being ignored in bothCongestionAvoidanceandWaitingForSlowdownstates:When
start_slowdown()skips because cwnd is too small to meaningfully reduce, it returnsfalse. But both states unconditionally returnedtrue, preventingupdate_cwnd()from running.Fixes in This PR
Fix 1: Use start_slowdown() return value (LEDBAT)
Now both
CongestionAvoidanceandWaitingForSlowdownstates properly handle the return value:start_slowdown()returns true → slowdown started, stay in state machineCongestionAvoidanceand let cwnd growFix 2: Increase ssthresh for high-BDP paths (LEDBAT)
Increased
DEFAULT_SSTHRESHfrom 100KB to 1MB. With 135ms RTT (e.g., US to EU):Fix 3: Increase ping tolerance (Transport)
Increased
MAX_UNANSWERED_PINGSfrom 2 to 5, extending tolerance from 10s to 25s.This fixes flaky test failures where connections were prematurely closed due to
pong responses being delayed under CI load.
Fix 4: Integration test timeouts
Fixed
test_put_then_immediate_subscribe_succeeds_locally_regression_2326timeoutsto match CI-stable patterns.
Testing
Regression Test Added
Added
test_skipped_slowdown_allows_congestion_avoidance_to_run()that:Code Review
Three parallel subagent reviews completed:
All Tests Pass
test_small_network_get_failurepasses 8/8 runs (was failing ~20%)[AI-assisted - Claude]