Skip to content

Conversation

fracasula
Copy link
Collaborator

@fracasula fracasula commented Aug 5, 2025

Description

This PR adds some functionalities around snapshotting and it makes the operator smarter during scaling operations to minimize the amount of data that needs to be moved around.

EXAMPLE OF A SCALING OPERATION

  1. The developer can call /hashRangeMovements to better plan for a scale operation, whether a scale down or a scale up.
    • The /hashRangeMovements endpoint will depict which hash ranges will have to be moved, showing which node will have to upload and which will have to download the hash range snapshots.
  2. Once the developer validates the scaling plan, they can call /hashRangeMovements again with upload=true.
    • This will ask the old nodes to upload the snapshots of the hash ranges that will have to be moved.
  3. The developer merges a devops PR that updates the StatefulSet variables that have the information on the new cluster size (mounted via ConfigMap into a file so that it doesn't trigger a rolling restart of the old pods).
    • It is imperative that the ConfigMap mounts on a file and it doesn't define any new environment variable, otherwise it will trigger a restart of the old nodes.
  4. The merged devops PR triggers an update on the ConfigMap, this means that the old pods will have the new cluster size upon restart if they were to crash.
    • In the scenario were an old node were to crash before a Scale + ScaleComplete were to happen, the old node might propagate the new cluster size to the client upon restart. This means that the client might call a new node that might not have loaded a snapshot just yet, leading to some duplicates. If we don't want such duplicates we might have to use badger to store the state of the scaling operations.
  5. The merged devops PR might also trigger the creation of new pods (e.g. ScaleUp). Once the new pods are up and running, the operator sends a LoadSnapshots to the new pods.
  6. The developer can now call /autoScale.
    • The /autoScale will have the new nodes download the snapshots first, then it will trigger a Scale + ScaleComplete to finalize the scaling operation.

CHANGELOG

  • fix "snapshot already in progress" error after ctx cancellation
  • use one stream per hash range (badger optimization)
  • read files timestamps (i.e. since) from S3 to populate them during startup so that they are not lost after a node restart
  • add support to CreateSnapshots to only create snapshots of a selected set of hash ranges (instead of creating snapshots of all the hash ranges managed by that node)
  • add "full sync" to CreateSnapshots so that a new snapshot that contains everything is created (old snapshots e.g. from incremental updates need to be removed after a "full sync")
  • add option to LoadSnapshots to only download a selected set of hash ranges (instead of downloading all the snapshots managed by that node)
  • transform CreateSnapshots and LoadSnapshots to single node calls (add nodeID parameter)
  • split client into simple client + operator
  • change client/operator Scale and ScaleComplete to just send a single gRPC request to a single node
  • make it so that we can have max control via the operator API on the cluster state
  • add a client/operator AutoScale method that can be used to orchestrate a scaling operation. the AutoScale will automatically make use of CreateSnapshots, LoadSnapshots, Scale and ScaleComplete calls to bring the cluster to the desired state ( and only move the required snapshots)
  • add a /hashRangeMovements endpoint in the operator to preview hash ranges movements to better plan scale operations
  • operator should revert scaling operations if one of the nodes returns an error during Scale

OUTSTANDING TODOs

  • devop changes so that keydb will read from the ConfigMap file

TO BE DISCUSSED (WIP)

  • restore daily automatic snapshots (one snapshot a day in "full sync" mode) should be created so that when a scaling happens we can do an incremental on top of that one
    • we should try to make daily snapshot light, like create one hash range with 1 routine, wait some interval and proceed to the next

Linear Ticket

< Linear_Link >

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@fracasula fracasula changed the title fix: snapshot already in progress chore: snapshotting improvements Aug 5, 2025
@fracasula fracasula changed the title chore: snapshotting improvements chore: snapshots improvements Aug 5, 2025
Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 64.34109% with 322 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.12%. Comparing base (47a9f62) to head (b4d0e6f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
cmd/operator/server.go 57.61% 89 Missing and 39 partials ⚠️
internal/operator/operator.go 61.11% 89 Missing and 16 partials ⚠️
node/node.go 82.97% 14 Missing and 10 partials ⚠️
internal/hash/hash.go 68.85% 9 Missing and 10 partials ⚠️
cmd/operator/main.go 0.00% 16 Missing ⚠️
internal/cache/badger/badger.go 77.94% 11 Missing and 4 partials ⚠️
proto/keydb.pb.go 0.00% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #18      +/-   ##
==========================================
+ Coverage   47.70%   53.12%   +5.41%     
==========================================
  Files          12       14       +2     
  Lines        2660     3183     +523     
==========================================
+ Hits         1269     1691     +422     
- Misses       1306     1346      +40     
- Partials       85      146      +61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fracasula fracasula changed the title chore: snapshots improvements chore: snapshots and scaling operations Aug 6, 2025
@fracasula fracasula force-pushed the snapshotting branch 2 times, most recently from 09ef119 to 78919c2 Compare August 8, 2025 12:52
@fracasula fracasula marked this pull request as ready for review August 8, 2025 16:42
// When the cluster_size changes the operator can use this field to tell all nodes which addresses are to be broadcast
// to clients
repeated string nodesAddresses = 2;
Copy link
Contributor

@mihir20 mihir20 Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the numbering in protos will be backwards incompatible change. If we merge this we for sure need to update rudder-server.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we just have to make sure keydb is disabled so that old clients won't try anything.

@fracasula fracasula merged commit e37dab9 into main Aug 18, 2025
17 checks passed
@fracasula fracasula deleted the snapshotting branch August 18, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants