Skip to content

Feature: Kubernetes Auto-remediation #5136

@abhibhaw

Description

@abhibhaw

🔖 Feature description

Imagine you get an error event in your monitoring stack, you know the fix and you start implementing the solution, but till then you’re done with the work that has already impacted your business. The delay in resolving these incidents can lead to prolonged service disruptions, revenue loss, and customer dissatisfaction. Furthermore, it ties up valuable engineering resources that could be better utilized for other tasks.

To address such issues, we need an automated remediation system that can detect and resolve known error events without human intervention.

🎤 Pitch / Usecases

  1. Automatically increase the max replica value by 1 when when current replica is equals to max replicas
  2. Automatically send a webhook notification when the application gets degraded
  3. Automatically increase memory of pod if a production pod gets OOMKILLED
  4. Send me an email when the application was in pending / degraded state for more than 120 sec.

🔄️ Alternative

Configure alerts and manually take actions

👀 Have you spent some time to check if this issue has been raised before?

  • I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

AB#9821

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions