fix(ledbat): prevent ssthresh death spiral on high-BDP paths #2579

sanity · 2026-01-04T06:08:07Z

Problem

Despite multiple LEDBAT fixes in v0.1.78 (#2574, #2549, #2532, #2510, #2467, #2472), large contract transfers from nova to technic still take 18+ seconds instead of the expected 2-7 seconds. The logs show ssthresh stuck at 5KB when it should be 1MB.

Root cause: The RTO (retransmission timeout) handler at ledbat.rs:1403 calculates:

let new_ssthresh = (old_cwnd / 2).max(self.min_cwnd * 2);

With min_cwnd = 2_848 bytes (2 × MSS), the floor is only 5,696 bytes (~5.6KB).

Death spiral sequence:

Initial: ssthresh=1MB, cwnd=38KB
Timeout occurs → ssthresh = max(cwnd/2, 5.7KB) ≈ 19KB
cwnd resets to min_cwnd (2.8KB)
Next timeout → ssthresh = max(2.8KB/2, 5.7KB) = 5.7KB
Repeat — ssthresh stays at 5.7KB forever
Slow start exits after only 5KB of growth → extremely slow transfers

This matches the "new_ssthresh_kb=5" seen in production logs.

Why CI Didn't Catch This

Existing timeout tests verified that ssthresh = old_cwnd/2, but didn't test the floor behavior. The floor of 2*min_cwnd was fine for low-RTT paths but catastrophic for high-BDP paths (135ms+ RTT) where transfers need ssthresh in the MB range to achieve reasonable throughput.

The new test test_repeated_timeouts_maintain_usable_ssthresh_floor is general: it simulates a high-BDP path and verifies that after repeated timeouts, the controller can still achieve reasonable throughput (≥5 Mbit/s). This catches any future regression that collapses ssthresh to a tiny value.

Approach

Store the initial ssthresh value in a new initial_ssthresh field and use it as the floor in on_timeout():

let new_ssthresh = (old_cwnd / 2).max(self.initial_ssthresh);

This ensures that even after repeated timeouts, slow start can still ramp up to the configured ssthresh (typically 1MB), enabling reasonable throughput on high-BDP paths.

Testing

Added regression test test_repeated_timeouts_maintain_usable_ssthresh_floor
- Verifies ssthresh stays ≥100KB after 10 consecutive timeouts
- Tests that theoretical slow start rate is ≥5 Mbit/s on 135ms RTT path
- The test fails before this fix (ssthresh=5KB) and passes after
Updated existing timeout tests to expect max(cwnd/2, initial_ssthresh) behavior
All 126 LEDBAT tests pass
Clippy and fmt pass

Fixes

Closes #2578

[AI-assisted - Claude]

The RTO handler was using `max(cwnd/2, 2*min_cwnd)` as the ssthresh floor, which is only ~5KB. This caused a death spiral on high-latency paths where: 1. Timeout occurs before cwnd can grow 2. ssthresh = max(small_cwnd/2, 5KB) = 5KB 3. Slow start exits almost immediately 4. Repeat - ssthresh stays at 5KB forever Now uses `initial_ssthresh` (typically 1MB) as the floor, ensuring slow start can still achieve reasonable throughput after recovery. Adds regression test that verifies repeated timeouts don't collapse ssthresh below a usable floor for high-BDP paths. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-04T06:22:39Z

⚠️ Performance Benchmark Regressions Detected

The following benchmarks show performance regression compared to the baseline:

See workflow summary for details

Note: This is informational only and does not block the PR. Please review if the regression is expected or needs investigation.

View full benchmark results

iduartgomez · 2026-01-04T08:35:09Z

LEDBAT++ Compliance Concern

Using initial_ssthresh (1MB) as floor means after timeout:
∙ cwnd = 2.8KB
∙ ssthresh = 1MB (the floor dominates)
∙ Slow start grows exponentially to 1MB before exiting

This is more aggressive than the spec intends. The spec’s small floor (5KB) ensures conservative recovery after severe congestion.

Potential Risk

On low-BDP paths (short RTT, limited bandwidth), a 1MB ssthresh floor could cause:
∙ Slow start overshoots available bandwidth
∙ Brief congestion bursts before delay-based exit triggers

iduartgomez

A more conservative fix might use max(cwnd/2, min(initial_ssthresh, estimated_BDP)), but that adds complexity.

iduartgomez · 2026-01-04T13:58:57Z

Should be superseded by #2582 and #2580

sanity · 2026-01-04T15:15:08Z

Thanks for the review and the better solution, Nacho.

You're right that using initial_ssthresh (1MB) as a hard floor is overly aggressive and doesn't respect LEDBAT++ spec intent. The concerns about low-BDP paths are valid - a fixed 1MB floor could cause slow start to overshoot available bandwidth.

Your approach in #2580 and #2582 is superior:

feat(ledbat): add min_ssthresh config and comprehensive property tests #2580: Configurable min_ssthresh allows opt-in to higher floors while respecting the spec default when not configured
feat(ledbat): add adaptive min_ssthresh with BDP proxy and path change detection #2582: Adaptive floor based on observed BDP during slow start exit is smarter - it uses actual path characteristics rather than a hardcoded value, with path change detection to invalidate stale estimates

Closing this PR since the issue is now properly fixed by your merged PRs.

[AI-assisted - Claude]

sanity · 2026-01-04T15:15:16Z

Superseded by #2580 and #2582 which provide a more spec-compliant and adaptive solution.

sanity · 2026-01-04T15:17:34Z

Update: Per Nacho's guidance on Matrix:

feat(ledbat): add min_ssthresh config and comprehensive property tests #2580 adds configurable ledbat_min_ssthresh with default recommendations in docs/architecture/transport/configuration/bandwidth-configuration.md
feat(ledbat): add adaptive min_ssthresh with BDP proxy and path change detection #2582 adds an adaptive algorithm that dynamically calculates the floor based on observed path characteristics

He also noted there are throughput benchmarks (not run automatically in CI due to stability/time) that could be tuned/expanded to reproduce these high-BDP path scenarios. Worth considering for future regression prevention.

[AI-assisted - Claude]

iduartgomez requested changes Jan 4, 2026

View reviewed changes

sanity closed this Jan 4, 2026

iduartgomez deleted the fix/2578-ledbat-ssthresh-floor branch January 4, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(ledbat): prevent ssthresh death spiral on high-BDP paths #2579

fix(ledbat): prevent ssthresh death spiral on high-BDP paths #2579

Uh oh!

sanity commented Jan 4, 2026

Uh oh!

github-actions bot commented Jan 4, 2026

Uh oh!

iduartgomez commented Jan 4, 2026 •

edited

Loading

Uh oh!

iduartgomez left a comment

Uh oh!

iduartgomez commented Jan 4, 2026 •

edited

Loading

Uh oh!

sanity commented Jan 4, 2026

Uh oh!

sanity commented Jan 4, 2026

Uh oh!

sanity commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix(ledbat): prevent ssthresh death spiral on high-BDP paths #2579

fix(ledbat): prevent ssthresh death spiral on high-BDP paths #2579

Uh oh!

Conversation

sanity commented Jan 4, 2026

Problem

Why CI Didn't Catch This

Approach

Testing

Fixes

Uh oh!

github-actions bot commented Jan 4, 2026

⚠️ Performance Benchmark Regressions Detected

Uh oh!

iduartgomez commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Potential Risk

Uh oh!

iduartgomez left a comment

Choose a reason for hiding this comment

Uh oh!

iduartgomez commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanity commented Jan 4, 2026

Uh oh!

sanity commented Jan 4, 2026

Uh oh!

sanity commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iduartgomez commented Jan 4, 2026 •

edited

Loading

iduartgomez commented Jan 4, 2026 •

edited

Loading