-
Notifications
You must be signed in to change notification settings - Fork 0
chore: snapshots and scaling operations #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #18 +/- ##
==========================================
+ Coverage 47.70% 53.12% +5.41%
==========================================
Files 12 14 +2
Lines 2660 3183 +523
==========================================
+ Hits 1269 1691 +422
- Misses 1306 1346 +40
- Partials 85 146 +61 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
09ef119
to
78919c2
Compare
// When the cluster_size changes the operator can use this field to tell all nodes which addresses are to be broadcast | ||
// to clients | ||
repeated string nodesAddresses = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing the numbering in protos will be backwards incompatible change. If we merge this we for sure need to update rudder-server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we just have to make sure keydb is disabled so that old clients won't try anything.
c61a6ec
to
d39603e
Compare
Description
This PR adds some functionalities around snapshotting and it makes the operator smarter during scaling operations to minimize the amount of data that needs to be moved around.
EXAMPLE OF A SCALING OPERATION
/hashRangeMovements
to better plan for a scale operation, whether a scale down or a scale up./hashRangeMovements
endpoint will depict which hash ranges will have to be moved, showing which node will have to upload and which will have to download the hash range snapshots./hashRangeMovements
again withupload=true
.StatefulSet
variables that have the information on the new cluster size (mounted viaConfigMap
into a file so that it doesn't trigger a rolling restart of the old pods).ConfigMap
mounts on a file and it doesn't define any new environment variable, otherwise it will trigger a restart of the old nodes.ConfigMap
, this means that the old pods will have the new cluster size upon restart if they were to crash.Scale
+ScaleComplete
were to happen, the old node might propagate the new cluster size to the client upon restart. This means that the client might call a new node that might not have loaded a snapshot just yet, leading to some duplicates. If we don't want such duplicates we might have to use badger to store the state of the scaling operations.ScaleUp
). Once the new pods are up and running, the operator sends aLoadSnapshots
to the new pods./autoScale
./autoScale
will have the new nodes download the snapshots first, then it will trigger aScale
+ScaleComplete
to finalize the scaling operation.CHANGELOG
since
) from S3 to populate them during startup so that they are not lost after a node restartCreateSnapshots
to only create snapshots of a selected set of hash ranges (instead of creating snapshots of all the hash ranges managed by that node)CreateSnapshots
so that a new snapshot that contains everything is created (old snapshots e.g. from incremental updates need to be removed after a "full sync")LoadSnapshots
to only download a selected set of hash ranges (instead of downloading all the snapshots managed by that node)CreateSnapshots
andLoadSnapshots
to single node calls (addnodeID
parameter)Scale
andScaleComplete
to just send a single gRPC request to a single nodeAutoScale
method that can be used to orchestrate a scaling operation. theAutoScale
will automatically make use ofCreateSnapshots
,LoadSnapshots
,Scale
andScaleComplete
calls to bring the cluster to the desired state ( and only move the required snapshots)/hashRangeMovements
endpoint in the operator to preview hash ranges movements to better plan scale operationsScale
OUTSTANDING TODOs
ConfigMap
fileTO BE DISCUSSED (WIP)
Linear Ticket
< Linear_Link >
Security