Skip to content

Conversation

cody-littley
Copy link
Contributor

Why are these changes needed?

For both the relays and API servers, enforce a maximum connection age. The goal of this change is to ensure that clients aren't stuck talking to a small subset of relays/API servers after the number of those servers has been increased.

I will not merge this PR until it has been tested on preprod.

@cody-littley cody-littley requested review from dmanc and litt3 July 28, 2025 16:42
@cody-littley cody-littley self-assigned this Jul 28, 2025
Copy link

github-actions bot commented Jul 28, 2025

The latest Buf updates on your PR. Results from workflow Buf Proto / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJul 29, 2025, 6:32 PM

Copy link

claude bot commented Jul 29, 2025

Claude finished @litt3's task —— View job


Nitpicker PR Review

Used the nitpicker subagent to review this PR. Here is the verbatim output:

Todo List:

  • Launch nitpicker subagent to review PR changes
  • Update comment with verbatim output from nitpicker subagent

Now I have a complete picture of the changes made. Let me provide a comprehensive code review:

Code Review: Max Connection Age Feature

Overview

This PR implements max connection age configuration for both relay servers and API servers to prevent clients from being stuck with a small subset of servers after scaling up. The implementation uses gRPC's built-in keepalive functionality.

Detailed Findings

1. Code Correctness and Potential Bugs

✅ Generally correct implementation - The code properly integrates gRPC keepalive parameters.

⚠️ Potential Issues:

  1. Missing validation in config parsing (disperser/cmd/apiserver/lib/config.go:103-105, relay/cmd/lib/config.go:104-106):

    • No validation that MaxConnectionAge > MaxConnectionAgeGrace
    • No validation for negative durations
    • Consider adding validation like:
    if config.MaxConnectionAge > 0 && config.MaxConnectionAgeGrace >= config.MaxConnectionAge {
        return Config{}, fmt.Errorf("MaxConnectionAgeGrace must be less than MaxConnectionAge")
    }
  2. Inconsistent zero value handling: The flags documentation states "If zero, then the server will not close connections based on age" but there's no explicit validation that this behavior is correct with gRPC keepalive implementation.

2. Security Considerations

✅ No immediate security issues identified

⚠️ Considerations:

  • Connection churning could potentially be used in DoS attacks if configured with very short durations
  • The current defaults (5 minutes max age, 30 second grace) seem reasonable for preventing abuse

3. Performance Implications

✅ Good performance design

⚠️ Considerations:

  1. Connection overhead: Frequent connection cycling will increase CPU/memory usage due to TLS handshakes and connection setup
  2. Default values analysis:
    • API Server: 5min max age, 30s grace, 1min idle - reasonable for API usage patterns
    • Relay Server: Same values - may need tuning based on chunk transfer patterns
  3. No metrics: Consider adding metrics to track connection churn rates

4. Configuration Management

✅ Consistent configuration pattern across both services

⚠️ Issues:

  1. Environment variable naming inconsistency (disperser/cmd/apiserver/flags/flags.go:54, relay/cmd/flags/flags.go:337):

    • Uses _SECONDS suffix in env var names but the values are parsed as time.Duration
    • This could confuse operators who might expect to specify raw seconds vs duration strings
    • Consider changing to MAX_CONNECTION_AGE or documenting the expected format
  2. Flag placement: In disperser flags, the new flags are inserted mid-array rather than at the end, which could affect flag ordering in help output

5. Code Style and Maintainability

✅ Good consistency between relay and disperser implementations

⚠️ Minor issues:

  1. Import organization (disperser/apiserver/server_v2.go): The diff shows import reordering that seems unrelated to the feature
  2. Comment typo (disperser/cmd/apiserver/flags/flags.go:75): There's a "// sigh" comment that seems unprofessional
  3. Documentation inconsistency: Some comments refer to MaxConnectionAgeSeconds but the field is MaxConnectionAge

6. Error Handling

✅ No new error paths introduced - uses existing gRPC error handling

7. Testing Considerations

❌ Critical Gap - No Tests

The PR lacks:

  1. Unit tests for configuration parsing and validation
  2. Integration tests to verify keepalive behavior
  3. Tests for edge cases (zero values, very small durations)
  4. Tests to verify the intended behavior of preventing client stickiness

Recommended test additions:

func TestMaxConnectionAgeConfig(t *testing.T) {
    // Test zero values
    // Test negative values  
    // Test grace period >= max age
    // Test successful parsing
}

func TestKeepaliveConfiguration(t *testing.T) {
    // Integration test to verify gRPC server accepts keepalive params
}

8. Documentation Needs

⚠️ Missing documentation:

  1. No update to configuration documentation explaining the new flags
  2. No explanation of the intended behavior and use case
  3. Operational considerations (how to tune these values) not documented

9. Architecture and Consistency

✅ Excellent consistency - Both relay and disperser implementations follow identical patterns:

  • Same flag names and defaults
  • Same configuration structure
  • Same gRPC keepalive integration approach

10. Specific Code Issues

disperser/server_config.go:20-21: Comment refers to "MaxConnectionAgeSeconds" but field is named "MaxConnectionAge"

relay/config.go:83: Same comment inconsistency

Overall Assessment

Strengths:

  • Clean, consistent implementation across both services
  • Proper use of gRPC keepalive functionality
  • Reasonable default values
  • Follows existing code patterns well

Critical Issues to Address:

  1. Add configuration validation
  2. Add comprehensive tests
  3. Fix environment variable naming/documentation
  4. Add operational documentation

Recommendation: The implementation is functionally sound but needs testing and validation improvements before merge. The core functionality appears correct for the stated goal of preventing client stickiness to scaled-down server pools.

Risk Level: Medium - mostly due to lack of tests and validation, not fundamental implementation issues.


litt3
litt3 previously approved these changes Jul 29, 2025
Copy link
Contributor

@litt3 litt3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

dmanc
dmanc previously approved these changes Jul 29, 2025
@cody-littley cody-littley dismissed stale reviews from dmanc and litt3 via 145a958 July 29, 2025 18:31
@cody-littley cody-littley added this pull request to the merge queue Jul 30, 2025
Merged via the queue into master with commit 0b01ca5 Jul 30, 2025
45 of 46 checks passed
@cody-littley cody-littley deleted the max-connection-age branch July 30, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants