You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add failed_permanent metric for worker monitoring (#2107)
* feat: add last failure timestamp metric for worker monitoring
Add a Prometheus Gauge metric to track the timestamp of the last failure
for each worker. This complements the existing failed job counter by
providing visibility into when failures last occurred for monitoring and
alerting purposes.
Changes:
- Added workerLastFailureGauge metric in metrics.ts
- Updated all 9 workers to set the gauge on failure:
- crawler, feed, webhook, assetPreProcessing
- inference, adminMaintenance, ruleEngine
- video, search
* refactor: track both all failures and permanent failures with counter
Remove the gauge metric and use the existing counter to track both:
- All failures (including retry attempts): status="failed"
- Permanent failures (retries exhausted): status="failed_permanent"
This provides better visibility into retry behavior and permanent vs
temporary failures without adding a separate metric.
Changes:
- Removed workerLastFailureGauge from metrics.ts
- Updated all 9 workers to track failed_permanent when numRetriesLeft == 0
- Maintained existing failed counter for all failure attempts
* style: format worker files with prettier
---------
Co-authored-by: Claude <[email protected]>
0 commit comments