Basic stuck job detection #1097

brandur · 2025-11-25T09:23:30Z

Here, try to make some inroads on a feature we've been talking about for
a while: detection of stuck jobs.

Unfortunately in Go it's quite easy to accidentally park a job by using
a select on a channel that won't return and forgetting a separate
branch for <-ctx.Done() so that it won't respect job timeouts either.

Here, add in some basic detection for that case. Eventually we'd like to
give users some options for what to do in case jobs become stuck, but
here we do only the simplest things for now: log when we detect a stuck
job and count the number of stuck jobs in a producer's stats loop.

In the future we may want to have some additional intelligence like
having producers move stuck jobs to a separate bucket up to a certain
limit before crashing (the next best option because it's not possible to
manually kill goroutines).

brandur · 2025-11-25T09:35:55Z

@bgentry Thoughts on this basic shape? I figure I'd follow up with another change that actually does something about the stuck jobs.

bgentry · 2025-12-03T15:20:12Z

internal/jobexecutor/job_executor.go

+				// In case the executor ever becomes unstuck, inform the
+				// producer. However, if we got all the way here there's a good
+				// chance this will never happen (the worker is really stuck and
+				// will never return).
+				defer e.ProducerCallbacks.Unstuck()


Doesn't this just run immediately after the above warning log? It's deferring within the inner go func() closure which merely exits after both these defers are added to the stack, and there's nothing to block on the jobs actually becoming unstuck. Or am I missing something?

Yeah, there was another <- ctx.Done() that was missing here. I've added that in, and also improved the test case so that it'll fail in the event that wait is missing.

I also put in some more logging in the test case so you can run it and verify manually (in case you want/need to) that it's working as expected. e.g.:

$ go test ./internal/jobexecutor -run TestJobExecutor_Execute/StuckDetectionActivates -test.v === RUN TestJobExecutor_Execute === PAUSE TestJobExecutor_Execute === CONT TestJobExecutor_Execute === RUN TestJobExecutor_Execute/StuckDetectionActivates === PAUSE TestJobExecutor_Execute/StuckDetectionActivates === CONT TestJobExecutor_Execute/StuckDetectionActivates riverdbtest.go:216: Dropped 1 expired postgres schema(s) in 14.537458ms riverdbtest.go:293: TestSchemaOpts.disableReuse is set; schema not checked in for reuse job_executor_test.go:715: Generated postgres schema "jobexecutor_2025_12_08t09_03_56_schema_01" with migrations [1 2 3 4 5 6] on line "main" in 63.787208ms [1 generated] [0 reused] job_executor_test.go:715: TestTx using postgres schema: jobexecutor_2025_12_08t09_03_56_schema_01 job_executor_test.go:724: Job executor reported stuck logger.go:256: time=2025-12-08T09:03:56.218-05:00 level=WARN msg="jobexecutor.JobExecutor: Job appears to be stuck" job_id=1 kind=jobexecutor_test timeout=5ms job_executor_test.go:739: Job executor still stuck after wait (this is expected) logger.go:256: time=2025-12-08T09:03:56.229-05:00 level=INFO msg="jobexecutor.JobExecutor: Job became unstuck" duration=17.011ms job_id=1 kind=jobexecutor_test job_executor_test.go:728: Job executor reported unstuck (after being stuck) --- PASS: TestJobExecutor_Execute (0.00s) --- PASS: TestJobExecutor_Execute/StuckDetectionActivates (0.12s) PASS ok github.com/riverqueue/river/internal/jobexecutor 0.299s

bgentry · 2025-12-03T15:22:21Z

internal/jobexecutor/job_executor.go

+		ctx, cancel := context.WithCancel(ctx)
+		defer cancel()
+


AFAICT ctx is the main job context, which could be cancelled under a variety of circumstances (aggressive client shutdown, manual cancellation attempt via UI, etc). That would lead the case <- ctx.Done() below to exit even if the job is actually stuck.

Am I misunderstanding this?

Yeah, on second thought, it makes sense to have a context.WithoutCancel(...) on that thing. Added, and put in a new test case that checks it's doing the right thing.

Here, try to make some inroads on a feature we've been talking about for a while: detection of stuck jobs. Unfortunately in Go it's quite easy to accidentally park a job by using a `select` on a channel that won't return and forgetting a separate branch for `<-ctx.Done()` so that it won't respect job timeouts either. Here, add in some basic detection for that case. Eventually we'd like to give users some options for what to do in case jobs become stuck, but here we do only the simplest things for now: log when we detect a stuck job and count the number of stuck jobs in a producer's stats loop. In the future we may want to have some additional intelligence like having producers move stuck jobs to a separate bucket up to a certain limit before crashing (the next best option because it's not possible to manually kill goroutines).

brandur · 2025-12-08T14:16:35Z

@bgentry K, I think we should be up and running now. Mind taking another look?

brandur force-pushed the brandur-basic-stuck-detection branch 3 times, most recently from 74f2aeb to 03bc984 Compare November 25, 2025 09:35

brandur requested a review from bgentry November 25, 2025 09:35

brandur force-pushed the brandur-basic-stuck-detection branch from 03bc984 to c85f4df Compare November 25, 2025 14:36

bgentry reviewed Dec 3, 2025

View reviewed changes

brandur force-pushed the brandur-basic-stuck-detection branch from c85f4df to 9a9702c Compare December 8, 2025 14:05

brandur force-pushed the brandur-basic-stuck-detection branch from 9a9702c to f03d348 Compare December 8, 2025 14:09

brandur requested a review from bgentry December 8, 2025 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Basic stuck job detection #1097

Basic stuck job detection #1097

Uh oh!

brandur commented Nov 25, 2025

Uh oh!

brandur commented Nov 25, 2025

Uh oh!

bgentry Dec 3, 2025

Uh oh!

brandur Dec 8, 2025

Uh oh!

bgentry Dec 3, 2025

Uh oh!

brandur Dec 8, 2025

Uh oh!

brandur commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Basic stuck job detection #1097

Are you sure you want to change the base?

Basic stuck job detection #1097

Uh oh!

Conversation

brandur commented Nov 25, 2025

Uh oh!

brandur commented Nov 25, 2025

Uh oh!

bgentry Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

brandur Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

bgentry Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

brandur Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

brandur commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants