Skip to content

Conversation

@djeebus
Copy link
Contributor

@djeebus djeebus commented Oct 16, 2025

This allows us to test and standardize how we start, monitor, cancel, and clean up our goroutines. This PR only adds it to orchestrator, but we should add this to api, client-proxy, and the rest as well.


Note

Adds a shared supervisor to manage background tasks and cleanup, and migrates orchestrator to use it instead of custom errgroup/signal/closer logic.

  • Shared Library:
    • packages/shared/pkg/supervisor: New task supervisor with Task, Options, and Run to start background tasks, handle OS signals, cancel via context, and run reverse-order cleanups; includes tests.
  • Orchestrator:
    • packages/orchestrator/main.go: Replace errgroup/signal/closer scaffolding with a []supervisor.Task for telemetry, loggers (with cleanupLogger), feature flags, limiter, Redis/pubsub, sandbox observer, proxy, NBD device pool, network pool, hyperloop server, cmux, HTTP, and gRPC servers, plus drain steps; invoke supervisor.Run for lifecycle and error handling.
    • Minor cleanup of unused imports and logs.

Written by Cursor Bugbot for commit f09069b. This will update automatically on new commits. Configure here.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

tasks = append(tasks, supervisor.Task{
Name: "template manager drain",
Cleanup: tmpl.Wait,
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Reverse Cleanup Order Causes Shutdown Delays

The supervisor package runs cleanup tasks in reverse order of addition, which inadvertently reverses the intended shutdown sequence. The template manager drain now executes before the orchestrator service is marked as draining and its propagation delay. This also delays the service's draining status change until the cleanup phase, potentially routing requests to the instance longer than intended.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, the orchestrator and template manager are not actually related to each other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to happen AFTER the orchestrator draining status

@djeebus djeebus requested a review from dobrac October 21, 2025 21:57
cancelCloseCtx()
}
tasks = append(tasks, supervisor.Task{
Name: "orchestrator drain",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to happen as the first action when shutting down the orchestrator or the template manager

tasks = append(tasks, supervisor.Task{
Name: "template manager drain",
Cleanup: tmpl.Wait,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to happen AFTER the orchestrator draining status

err = supervisor.Run(ctx, supervisor.Options{
ForceStop: config.ForceStop,
Tasks: tasks,
Logger: globalLogger,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit strange, the logger is closed by the supervisor, but it's also used there

ctx, cancel := context.WithCancel(ctx)
defer cancel()

startTask := func(ctx context.Context, task Task) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this function be extracted outside of the body?


type Options struct {
ForceStop bool
Tasks []Task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a []*Task to prevent copy of the Task functions? (but maybe it's ok, not sure now)

taskLogger := baseLogger.Named(task.Name)
if task.Cleanup != nil {
taskLogger.Info("running cleanup")
if err := task.Cleanup(ctx); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the close context should be different from the background tasks context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants