Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Jan 4, 2026

Problem

Despite multiple LEDBAT fixes in v0.1.78 (#2574, #2549, #2532, #2510, #2467, #2472), large contract transfers from nova to technic still take 18+ seconds instead of the expected 2-7 seconds. The logs show ssthresh stuck at 5KB when it should be 1MB.

Root cause: The RTO (retransmission timeout) handler at ledbat.rs:1403 calculates:

let new_ssthresh = (old_cwnd / 2).max(self.min_cwnd * 2);

With min_cwnd = 2_848 bytes (2 × MSS), the floor is only 5,696 bytes (~5.6KB).

Death spiral sequence:

  1. Initial: ssthresh=1MB, cwnd=38KB
  2. Timeout occurs → ssthresh = max(cwnd/2, 5.7KB) ≈ 19KB
  3. cwnd resets to min_cwnd (2.8KB)
  4. Next timeout → ssthresh = max(2.8KB/2, 5.7KB) = 5.7KB
  5. Repeat — ssthresh stays at 5.7KB forever
  6. Slow start exits after only 5KB of growth → extremely slow transfers

This matches the "new_ssthresh_kb=5" seen in production logs.

Why CI Didn't Catch This

Existing timeout tests verified that ssthresh = old_cwnd/2, but didn't test the floor behavior. The floor of 2*min_cwnd was fine for low-RTT paths but catastrophic for high-BDP paths (135ms+ RTT) where transfers need ssthresh in the MB range to achieve reasonable throughput.

The new test test_repeated_timeouts_maintain_usable_ssthresh_floor is general: it simulates a high-BDP path and verifies that after repeated timeouts, the controller can still achieve reasonable throughput (≥5 Mbit/s). This catches any future regression that collapses ssthresh to a tiny value.

Approach

Store the initial ssthresh value in a new initial_ssthresh field and use it as the floor in on_timeout():

let new_ssthresh = (old_cwnd / 2).max(self.initial_ssthresh);

This ensures that even after repeated timeouts, slow start can still ramp up to the configured ssthresh (typically 1MB), enabling reasonable throughput on high-BDP paths.

Testing

  • Added regression test test_repeated_timeouts_maintain_usable_ssthresh_floor
    • Verifies ssthresh stays ≥100KB after 10 consecutive timeouts
    • Tests that theoretical slow start rate is ≥5 Mbit/s on 135ms RTT path
    • The test fails before this fix (ssthresh=5KB) and passes after
  • Updated existing timeout tests to expect max(cwnd/2, initial_ssthresh) behavior
  • All 126 LEDBAT tests pass
  • Clippy and fmt pass

Fixes

Closes #2578

[AI-assisted - Claude]

The RTO handler was using `max(cwnd/2, 2*min_cwnd)` as the ssthresh
floor, which is only ~5KB. This caused a death spiral on high-latency
paths where:
1. Timeout occurs before cwnd can grow
2. ssthresh = max(small_cwnd/2, 5KB) = 5KB
3. Slow start exits almost immediately
4. Repeat - ssthresh stays at 5KB forever

Now uses `initial_ssthresh` (typically 1MB) as the floor, ensuring
slow start can still achieve reasonable throughput after recovery.

Adds regression test that verifies repeated timeouts don't collapse
ssthresh below a usable floor for high-BDP paths.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 4, 2026

⚠️ Performance Benchmark Regressions Detected

The following benchmarks show performance regression compared to the baseline:

See workflow summary for details

Note: This is informational only and does not block the PR. Please review if the regression is expected or needs investigation.

View full benchmark results

@iduartgomez
Copy link
Collaborator

iduartgomez commented Jan 4, 2026

LEDBAT++ Compliance Concern

Using initial_ssthresh (1MB) as floor means after timeout:
∙ cwnd = 2.8KB
∙ ssthresh = 1MB (the floor dominates)
∙ Slow start grows exponentially to 1MB before exiting

This is more aggressive than the spec intends. The spec’s small floor (5KB) ensures conservative recovery after severe congestion.

Potential Risk

On low-BDP paths (short RTT, limited bandwidth), a 1MB ssthresh floor could cause:
∙ Slow start overshoots available bandwidth
∙ Brief congestion bursts before delay-based exit triggers

Copy link
Collaborator

@iduartgomez iduartgomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more conservative fix might use max(cwnd/2, min(initial_ssthresh, estimated_BDP)), but that adds complexity.

@iduartgomez
Copy link
Collaborator

iduartgomez commented Jan 4, 2026

Should be superseded by #2582 and #2580

@sanity
Copy link
Collaborator Author

sanity commented Jan 4, 2026

Thanks for the review and the better solution, Nacho.

You're right that using initial_ssthresh (1MB) as a hard floor is overly aggressive and doesn't respect LEDBAT++ spec intent. The concerns about low-BDP paths are valid - a fixed 1MB floor could cause slow start to overshoot available bandwidth.

Your approach in #2580 and #2582 is superior:

Closing this PR since the issue is now properly fixed by your merged PRs.

[AI-assisted - Claude]

@sanity
Copy link
Collaborator Author

sanity commented Jan 4, 2026

Superseded by #2580 and #2582 which provide a more spec-compliant and adaptive solution.

@sanity sanity closed this Jan 4, 2026
@sanity
Copy link
Collaborator Author

sanity commented Jan 4, 2026

Update: Per Nacho's guidance on Matrix:

  1. feat(ledbat): add min_ssthresh config and comprehensive property tests #2580 adds configurable ledbat_min_ssthresh with default recommendations in docs/architecture/transport/configuration/bandwidth-configuration.md

  2. feat(ledbat): add adaptive min_ssthresh with BDP proxy and path change detection #2582 adds an adaptive algorithm that dynamically calculates the floor based on observed path characteristics

He also noted there are throughput benchmarks (not run automatically in CI due to stability/time) that could be tuned/expanded to reproduce these high-BDP path scenarios. Worth considering for future regression prevention.

[AI-assisted - Claude]

@iduartgomez iduartgomez deleted the fix/2578-ledbat-ssthresh-floor branch January 4, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LEDBAT death spiral persists after v0.1.78 - transfers still taking 18s+

3 participants