-
Notifications
You must be signed in to change notification settings - Fork 240
feat: litt unlock command #1823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest Buf updates on your PR. Results from workflow Buf Proto / buf (pull_request).
|
@@ -136,6 +136,10 @@ type Config struct { | |||
// Directories do not need to be on the same filesystem. | |||
LittDBStoragePaths []string | |||
|
|||
// If true, then purge LittDB locks on startup. Potentially useful to get rid of zombie lock files, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't quite understand the scenario in which zombie lock files appear? Is it due to some sort of ungraceful termination with containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded the documentation here:
// If true, then purge LittDB locks on startup. Potentially useful to get rid of zombie lock files,
// but also dangerous (multiple LittDB processes operating on the same files can lead to data corruption).
//
// When LittDB starts up, it attempts to create lock files. When a validator is forcefully shut down, lock files
// may be left behind. At startup time, if LittDB observes existing lock files, it first checks to see
// if the process that created the lock files is still running. The lock files contain the creator's PID, and so
// LittDB checks to see if there is any process with that PID still running.
//
// Although it should be rare, it's possible that another process may be started with the same PID as the
// PID used to create the lock files. When this happens, LittDB will be prevented from starting up out of
// fear of another process trying to access the same files, even though the original process that created the
// lock files is no longer running. If that happens, this flag is a safe way to force LittDB to start up
// without being blocked by those lock files. BE VERY CERTAIN THAT THE OTHER PROCESS IS ACTUALLY DEAD!
// If two instances of LittDB are running on the same files, it WILL lead to data corruption.
//
// An alternate way to clear the LittDB lock files is via the LittDB CLI with the "litt unlock" command.
// Run "litt unlock --help" for more information.
LittUnsafePurgeLocks bool
@@ -267,6 +277,15 @@ func buildCLIParser(logger logging.Logger) *cli.App { | |||
}, | |||
Action: nil, // syncCommand, // TODO this will be added in a follow up PR | |||
}, | |||
{ | |||
Name: "unlock", | |||
Usage: "Manually delete LittDB lock files. Dangerous if used improperly, use with caution.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add some sort of confirmation flow where the user has to say "yes" and if they want to skip there's the force-unlock
or --force
flag for skipping that flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. You either must type I know what I am doing
, or include a --force
tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes seems like it is attempting to serve 2 purposes
- deleting zombie locks that prevent the node from starting
- allowing operators to backup littdb
In the case of 1, is it possible for the operator to just manually delete the zombie lock file?
This change is not needed for operators to back up LittDB. I'm currently merging a series of PRs that add this functionality, this change is tangential to that one.
Yes, it's completely possible to do this manually without this tool. The reason why I automate this process is that the lock file is buried in the LittDB file structure, and there may be multiple lock files that need to be deleted. If the user configures LittDB to store its data in A secondary use case for this utility is if an operator can't easily modify the validator's file system. The other day when Daniel and I encountered a zombie lock, it was a pain to find and delete the lock file. We ended up just deleting the entire volume, since it was just preprod data and we don't care about data loss there. If we would have had a flag (like the one added by this PR), we could have just redeployed the validator with the |
Signed-off-by: Cody Littley <[email protected]>
Why are these changes needed?
Adds the ability to override LittDB lock files in the advent of an emergency.
https://linear.app/eigenlabs/issue/EGDA-1795/flag-to-ignore-littdb-lock-files