-
Notifications
You must be signed in to change notification settings - Fork 237
Update state when batch dropped #1789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dc57a33 to
e790e33
Compare
e790e33 to
248c80e
Compare
248c80e to
73da7a4
Compare
| // without executing other success-related callbacks. This is used when a batch | ||
| // fails after exhausting all retry attempts to prevent reprocessing the same | ||
| // batch after restart. | ||
| func (b *logEventBatch) updateState() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we iterating backwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just following the same pattern for the done() function
amazon-cloudwatch-agent/plugins/outputs/cloudwatchlogs/internal/pusher/batch.go
Lines 179 to 186 in 73da7a4
| func (b *logEventBatch) done() { | |
| b.updateState() | |
| for i := len(b.doneCallbacks) - 1; i >= 0; i-- { | |
| done := b.doneCallbacks[i] | |
| done() | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are the integ tests? or did you mean you were relying on the existing integ tests
| // fails after exhausting all retry attempts to prevent reprocessing the same | ||
| // batch after restart. | ||
| func (b *logEventBatch) updateState() { | ||
| for i := len(b.stateCallbacks) - 1; i >= 0; i-- { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit – why not use range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing integ tests and replied to the same question above
|
This PR was marked stale due to lack of activity. |
Description of the issue
The Amazon CloudWatch Agent currently only updates the state file when it successfully pushes a batch of logs to the CloudWatch Logs service. When a "poison pill" batch fails after exhausting all retry attempts, the state file is not updated. This causes the agent to reprocess the same problematic batch after restart, potentially creating an infinite loop of failed attempts.
Description of changes
This PR implements poison pill handling for the CloudWatch Logs output plugin to improve agent resilience:
Key Changes:
logEventBatchto separate state-updating callbacks from success callbacksupdateState()method that only executes state file updates without other success metricssender.Send()to callupdateState()when retry attempts are exhaustedHow it works:
This approach leverages the existing retry mechanism without modification and ensures the agent moves past problematic log entries after restart.
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Requirements
Before commiting your code, please do the following steps.
make fmtandmake fmt-shmake lintIntegration Tests
To run integration tests against this PR, add the
ready for testinglabel.