Skip to content

Conversation

pschork
Copy link
Contributor

@pschork pschork commented Jan 17, 2025

Current reachability check only confirms that the port is open.

This new logic performs additional validation using grpc reflection to ensure we do not mislead operators when there are configuration issues preventing us from connecting to the node.

If port is open, we now verify the expected service is available based on port specification using grpc reflection. This will catch situations where port is open, but service backend is configured wrong or unreachable.

We now include a status message for dispersal and retrieval checks to help operators debug any reachability check failures.

all services available

{
  "operator_id": "39638352822b2eecf33f2ea0e6893a576edbcc202b4925c3ab2875677945bf58",
  "dispersal_socket": "98.84.160.141:32007",
  "retrieval_socket": "98.84.160.141:32008",
  "dispersal_online": true,
  "retrieval_online": true,
  "dispersal_status": "node.Dispersal is available",
  "retrieval_status": "node.Retrieval is available"
}

ports closed or unreachable

{
  "operator_id": "39638352822b2eecf33f2ea0e6893a576edbcc202b4925c3ab2875677945bf58",
  "dispersal_socket": "98.84.160.141:32002",
  "retrieval_socket": "98.84.160.141:32001",
  "dispersal_online": false,
  "retrieval_online": false,
  "dispersal_status": "port closed or unreachable for 98.84.160.141:32002",
  "retrieval_status": "port closed or unreachable for 98.84.160.141:32001"
}

port open but grpc fails to connect

previously this would return dispersal_online: true

{
  "operator_id": "39638352822b2eecf33f2ea0e6893a576edbcc202b4925c3ab2875677945bf58",
  "dispersal_socket": "98.84.160.141:32001",
  "retrieval_socket": "98.84.160.141:32008",
  "dispersal_online": false,
  "retrieval_online": true,
  "dispersal_status": "rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded",
  "retrieval_status": "available"
}

ports open but serving wrong grpc services

previously this would return dispersal_online: true and retrieval_online: true

{
  "operator_id": "39638352822b2eecf33f2ea0e6893a576edbcc202b4925c3ab2875677945bf58",
  "dispersal_socket": "98.84.160.141:32007",
  "retrieval_socket": "98.84.160.141:32008",
  "dispersal_online": false,
  "retrieval_online": false,
  "dispersal_status": "grpc available but node.Dispersal service not found at 98.84.160.141:32007",
  "retrieval_status": "grpc available but node.Retrieval service not found at 98.84.160.141:32008"
}

Checks

  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

@pschork pschork force-pushed the pschork/reachability_v2 branch 3 times, most recently from d520923 to c266b48 Compare January 17, 2025 22:22
@pschork pschork requested a review from ian-shim January 17, 2025 22:23
@pschork pschork marked this pull request as ready for review January 17, 2025 22:24
Copy link
Contributor

@ian-shim ian-shim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, can you help with integrating v2 dispersal port into reachability check?

Reachability check first checks if port is open

If port is open, verify the expected service is available based on port specification using grpc reflection
This will catch situations where port is open, but service backend is configured wrong or unreachable.
@pschork pschork force-pushed the pschork/reachability_v2 branch from c266b48 to 0daae8e Compare January 22, 2025 01:55
@pschork pschork merged commit 10f0219 into master Jan 22, 2025
9 checks passed
@pschork pschork deleted the pschork/reachability_v2 branch March 6, 2025 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants