Skip to content

Conversation

@duhminick
Copy link
Contributor

@duhminick duhminick commented Jul 22, 2025

Description of the issue

The Amazon CloudWatch Agent currently only updates the state file when it successfully pushes a batch of logs to the CloudWatch Logs service. When a "poison pill" batch fails after exhausting all retry attempts, the state file is not updated. This causes the agent to reprocess the same problematic batch after restart, potentially creating an infinite loop of failed attempts.

Description of changes

This PR implements poison pill handling for the CloudWatch Logs output plugin to improve agent resilience:

Key Changes:

  • Modified logEventBatch to separate state-updating callbacks from success callbacks
  • Added updateState() method that only executes state file updates without other success metrics
  • Updated sender.Send() to call updateState() when retry attempts are exhausted

How it works:

  1. When a batch fails after exhausting all retry attempts, it's identified as a poison pill
  2. The agent updates the state file to mark the batch's range as processed (preventing reprocessing)

This approach leverages the existing retry mechanism without modification and ensures the agent moves past problematic log entries after restart.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  1. Unit tests
  2. Integration tests

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@duhminick duhminick force-pushed the dominic-poison-pill branch from dc57a33 to e790e33 Compare July 22, 2025 22:17
@duhminick duhminick added the ready for testing Indicates this PR is ready for integration tests to run label Jul 22, 2025
@duhminick duhminick force-pushed the dominic-poison-pill branch from e790e33 to 248c80e Compare July 23, 2025 02:38
@duhminick duhminick force-pushed the dominic-poison-pill branch from 248c80e to 73da7a4 Compare July 23, 2025 02:40
@duhminick duhminick marked this pull request as ready for review July 23, 2025 02:40
@duhminick duhminick requested a review from a team as a code owner July 23, 2025 02:40
// without executing other success-related callbacks. This is used when a batch
// fails after exhausting all retry attempts to prevent reprocessing the same
// batch after restart.
func (b *logEventBatch) updateState() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we iterating backwards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following the same pattern for the done() function

func (b *logEventBatch) done() {
b.updateState()
for i := len(b.doneCallbacks) - 1; i >= 0; i-- {
done := b.doneCallbacks[i]
done()
}
}

Copy link
Contributor

@the-mann the-mann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are the integ tests? or did you mean you were relying on the existing integ tests

// fails after exhausting all retry attempts to prevent reprocessing the same
// batch after restart.
func (b *logEventBatch) updateState() {
for i := len(b.stateCallbacks) - 1; i >= 0; i-- {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit – why not use range?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing integ tests and replied to the same question above

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Aug 1, 2025
@github-actions github-actions bot removed the Stale label Aug 13, 2025
@duhminick duhminick merged commit 2da9c43 into main Aug 13, 2025
182 of 184 checks passed
@duhminick duhminick deleted the dominic-poison-pill branch August 13, 2025 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants