chore: snapshots and scaling operations #18

fracasula · 2025-08-05T11:23:17Z

Description

This PR adds some functionalities around snapshotting and it makes the operator smarter during scaling operations to minimize the amount of data that needs to be moved around.

EXAMPLE OF A SCALING OPERATION

The developer can call /hashRangeMovements to better plan for a scale operation, whether a scale down or a scale up.
- The /hashRangeMovements endpoint will depict which hash ranges will have to be moved, showing which node will have to upload and which will have to download the hash range snapshots.
Once the developer validates the scaling plan, they can call /hashRangeMovements again with upload=true.
- This will ask the old nodes to upload the snapshots of the hash ranges that will have to be moved.
The developer merges a devops PR that updates the StatefulSet variables that have the information on the new cluster size (mounted via ConfigMap into a file so that it doesn't trigger a rolling restart of the old pods).
- It is imperative that the ConfigMap mounts on a file and it doesn't define any new environment variable, otherwise it will trigger a restart of the old nodes.
The merged devops PR triggers an update on the ConfigMap, this means that the old pods will have the new cluster size upon restart if they were to crash.
- In the scenario were an old node were to crash before a Scale + ScaleComplete were to happen, the old node might propagate the new cluster size to the client upon restart. This means that the client might call a new node that might not have loaded a snapshot just yet, leading to some duplicates. If we don't want such duplicates we might have to use badger to store the state of the scaling operations.
The merged devops PR might also trigger the creation of new pods (e.g. ScaleUp). Once the new pods are up and running, the operator sends a LoadSnapshots to the new pods.
The developer can now call /autoScale.
- The /autoScale will have the new nodes download the snapshots first, then it will trigger a Scale + ScaleComplete to finalize the scaling operation.

CHANGELOG

OUTSTANDING TODOs

devop changes so that keydb will read from the ConfigMap file

TO BE DISCUSSED (WIP)

restore daily automatic snapshots (one snapshot a day in "full sync" mode) should be created so that when a scaling happens we can do an incremental on top of that one
- we should try to make daily snapshot light, like create one hash range with 1 routine, wait some interval and proceed to the next

Linear Ticket

< Linear_Link >

Security

The code changed/added as part of this pull request won't create any security issues with how the software is being used.

codecov · 2025-08-05T13:19:36Z

Codecov Report

❌ Patch coverage is 64.34109% with 322 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.12%. Comparing base (47a9f62) to head (b4d0e6f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
cmd/operator/server.go	57.61%	89 Missing and 39 partials ⚠️
internal/operator/operator.go	61.11%	89 Missing and 16 partials ⚠️
node/node.go	82.97%	14 Missing and 10 partials ⚠️
internal/hash/hash.go	68.85%	9 Missing and 10 partials ⚠️
cmd/operator/main.go	0.00%	16 Missing ⚠️
internal/cache/badger/badger.go	77.94%	11 Missing and 4 partials ⚠️
proto/keydb.pb.go	0.00%	15 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #18      +/-   ##
==========================================
+ Coverage   47.70%   53.12%   +5.41%     
==========================================
  Files          12       14       +2     
  Lines        2660     3183     +523     
==========================================
+ Hits         1269     1691     +422     
- Misses       1306     1346      +40     
- Partials       85      146      +61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mihir20 · 2025-08-13T12:05:43Z

proto/keydb.proto

  // When the cluster_size changes the operator can use this field to tell all nodes which addresses are to be broadcast
  // to clients
-  repeated string nodesAddresses = 2;


changing the numbering in protos will be backwards incompatible change. If we merge this we for sure need to update rudder-server.

Yeah we just have to make sure keydb is disabled so that old clients won't try anything.

cmd/operator/main.go

cmd/operator/server.go

node/node.go

…ations with errgroup

$fracasula$

$@fracasula$ fracasula changed the title ~~fix: snapshot already in progress~~ chore: snapshotting improvements Aug 5, 2025

$@fracasula$ fracasula changed the title ~~chore: snapshotting improvements~~ chore: snapshots improvements Aug 5, 2025

$@fracasula$ fracasula force-pushed the snapshotting branch from 9ba2cce to ed341ab Compare August 5, 2025 14:55

fracasula added 3 commits August 5, 2025 16:56

$@fracasula$

fix: snapshot already in progress error

fae2148

$@fracasula$

chore: one stream per hash range

fa82717

$@fracasula$

chore: populating since during startup

9f58c3e

$@fracasula$ fracasula force-pushed the snapshotting branch from ed341ab to 9f58c3e Compare August 5, 2025 14:57

$@fracasula$

feat: selected snapshots

2125c1c

$@fracasula$ fracasula force-pushed the snapshotting branch from fd4bd27 to 2125c1c Compare August 5, 2025 16:15

fracasula added 2 commits August 6, 2025 14:13

$@fracasula$

feat: create snapshots full sync capabilities

acea225

$@fracasula$

chore: deleting old files before proceeding to next hash range

9efbdae

$@fracasula$ fracasula force-pushed the snapshotting branch from 8e6f5f7 to 9efbdae Compare August 6, 2025 12:51

fracasula added 2 commits August 6, 2025 15:45

$@fracasula$

chore: load snapshot to support selected hash ranges

bd18453

$@fracasula$

chore: scale does not download snapshots anymore

1021ad6

$@fracasula$ fracasula changed the title ~~chore: snapshots improvements~~ chore: snapshots and scaling operations Aug 6, 2025

fracasula added 8 commits August 6, 2025 17:51

$@fracasula$

chore: [email protected]

8e5c3d7

$@fracasula$

chore: operator API to call single nodes

70f6d21

$@fracasula$

chore: operator and client split

db9e622

$@fracasula$

test: operator server and hashing

8530b34

$@fracasula$

chore: opreator auto scale

0639e64

$@fracasula$

chore: use clusterSize for consistency

0b8190a

$@fracasula$

chore: simplifying movements logic

fb1ba45

$@fracasula$

chore: /hashRangeMovements operator endpoint

c7dbe76

$@fracasula$ fracasula force-pushed the snapshotting branch from 82743b0 to c7dbe76 Compare August 8, 2025 10:35

fracasula added 2 commits August 8, 2025 13:59

$@fracasula$

chore: /hashRangeMovements can upload

be4cbbf

$@fracasula$

test: file assertions in operator

49f76c8

$@fracasula$ fracasula force-pushed the snapshotting branch 2 times, most recently from 09ef119 to 78919c2 Compare August 8, 2025 12:52

$@fracasula$

chore: use go-kit minio container

38e0c9e

$@fracasula$ fracasula force-pushed the snapshotting branch from 78919c2 to 38e0c9e Compare August 8, 2025 12:53

fracasula added 2 commits August 8, 2025 16:00

$@fracasula$

chore: split uploads

56089ae

$@fracasula$

chore: removing TODO

5d0a6e6

$@fracasula$ fracasula marked this pull request as ready for review August 8, 2025 16:42

$@fracasula$ fracasula requested review from ktgowtham, mihir20 and satishrudderstack August 8, 2025 16:42

mihir20 reviewed Aug 14, 2025

View reviewed changes

mihir20 approved these changes Aug 18, 2025

View reviewed changes

fracasula added 3 commits August 18, 2025 09:04

$@fracasula$

chore: using default properly

62f05fb

$@fracasula$

chore: clean up redundant method checks and parallelize snapshot oper…

34c18e9

…ations with errgroup

$@fracasula$

chore: use GetHashRangeMovements for scaling down as well

d39603e

$@fracasula$ fracasula force-pushed the snapshotting branch from c61a6ec to d39603e Compare August 18, 2025 07:34

fracasula added 3 commits August 18, 2025 10:05

$@fracasula$

chore: adding comment

37092f1

$@fracasula$

chore: populate since map regardless of download flag

5a106da

$@fracasula$

chore: adding test for since map population on start-up

b4d0e6f

$@fracasula$ fracasula merged commit e37dab9 into main Aug 18, 2025
17 checks passed

$@fracasula$ fracasula deleted the snapshotting branch August 18, 2025 08:25

devops-github-rudderstack mentioned this pull request Aug 18, 2025

chore: release 0.2.1-alpha #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: snapshots and scaling operations #18

chore: snapshots and scaling operations #18

Uh oh!

$@fracasula$ fracasula commented Aug 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

mihir20 Aug 13, 2025 •

edited

Loading

Uh oh!

$@fracasula$ fracasula Aug 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chore: snapshots and scaling operations #18

chore: snapshots and scaling operations #18

Uh oh!

Conversation

fracasula commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

EXAMPLE OF A SCALING OPERATION

CHANGELOG

OUTSTANDING TODOs

TO BE DISCUSSED (WIP)

Linear Ticket

Security

Uh oh!

codecov bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mihir20 Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fracasula Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

$@fracasula$ fracasula commented Aug 5, 2025 •

edited

Loading

codecov bot commented Aug 5, 2025 •

edited

Loading

mihir20 Aug 13, 2025 •

edited

Loading

$@fracasula$ fracasula Aug 18, 2025