-
Notifications
You must be signed in to change notification settings - Fork 5.7k
feat(inputs.nats_consumer): Ack messages when accepted on the output #17792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(inputs.nats_consumer): Ack messages when accepted on the output #17792
Conversation
# Conflicts: # plugins/inputs/nats_consumer/nats_consumer.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Hipska for the PR, however, there are major issues!
- The
n.undeliveredlist of in-flight metrics is never written to! - Usually I would assume a separate
onDeliveryfunction thatACKs the message and removes the message from thesemchannel instead of mixing this into thereceiverfunction. TheonDeliveryfunction should then just run in its own go-routine. - The
n.undeliveredlist of in-flight metrics requires locking... - Unit-tests are missing!
| JsStream string `toml:"jetstream_stream"` | ||
| PendingMessageLimit int `toml:"pending_message_limit"` | ||
| PendingBytesLimit int `toml:"pending_bytes_limit"` | ||
| PendingBytesLimit int `toml:"pending_bytes_limit" deprecated:"1.37.0;1.40.0;unused"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm actually this setting is used in line 160, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, result of wrong merge. Will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm still not fixed I think so unresolving my comment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you still see it used in current code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you are removing this feature from the code and then deprecate the function without giving a reason on why you remove this! Furthermore, even if there is a good reason for removal it does not belong to this PR!
|
I based the telegraf/plugins/inputs/amqp_consumer/amqp_consumer.go Lines 412 to 415 in 851bea2
telegraf/plugins/inputs/amqp_consumer/amqp_consumer.go Lines 471 to 494 in 851bea2
I added locking as requested, but I see the above mentioned code also doesn't have locking.. |
1d31e7e to
935a5eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Hipska! The code reads much better with your update. Please drop removing the pending bytes limit feature as it does NOT belong to this PR. Furthermore, there are some corner-cases to consider, see my comments in the code...
| JsStream string `toml:"jetstream_stream"` | ||
| PendingMessageLimit int `toml:"pending_message_limit"` | ||
| PendingBytesLimit int `toml:"pending_bytes_limit"` | ||
| PendingBytesLimit int `toml:"pending_bytes_limit" deprecated:"1.37.0;1.40.0;unused"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm still not fixed I think so unresolving my comment...
| func (n *NatsConsumer) Start(acc telegraf.Accumulator) error { | ||
| n.sem = make(semaphore, n.MaxUndeliveredMessages) | ||
| n.acc = acc.WithTracking(n.MaxUndeliveredMessages) | ||
| n.undelivered = make(map[telegraf.TrackingID]*nats.Msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
| n.undelivered = make(map[telegraf.TrackingID]*nats.Msg) | |
| n.undelivered = make(map[telegraf.TrackingID]*nats.Msg, PendingMessageLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very sure, I would use MaxUndeliveredMessages instead if needed.
And I'm starting to think they are actually meaning the same..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I wonder if those two parameters don't mean the same thing. But yeah if the mean different things I agree to use MaxUndeliveredMessages as this is what is meant here.
| func (n *NatsConsumer) waitForDelivery(parentCtx context.Context) { | ||
| for { | ||
| select { | ||
| case <-parentCtx.Done(): | ||
| return | ||
| case track := <-n.acc.Delivered(): | ||
| <-n.sem | ||
| msg := n.removeDelivered(track.ID()) | ||
|
|
||
| if msg != nil { | ||
| if track.Delivered() { | ||
| err := msg.Ack() | ||
| if err != nil { | ||
| n.Log.Errorf("Failed to Ack message on subject %s: %v", msg.Subject, err) | ||
| } | ||
| } else { | ||
| err := msg.Nak() | ||
| if err != nil { | ||
| n.Log.Errorf("Failed to Nak message on subject %s: %v", msg.Subject, err) | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| func (n *NatsConsumer) removeDelivered(id telegraf.TrackingID) *nats.Msg { | ||
| n.Lock() | ||
| defer n.Unlock() | ||
|
|
||
| msg, ok := n.undelivered[id] | ||
| if !ok { | ||
| return nil | ||
| } | ||
| delete(n.undelivered, id) | ||
| return msg | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
| func (n *NatsConsumer) waitForDelivery(parentCtx context.Context) { | |
| for { | |
| select { | |
| case <-parentCtx.Done(): | |
| return | |
| case track := <-n.acc.Delivered(): | |
| <-n.sem | |
| msg := n.removeDelivered(track.ID()) | |
| if msg != nil { | |
| if track.Delivered() { | |
| err := msg.Ack() | |
| if err != nil { | |
| n.Log.Errorf("Failed to Ack message on subject %s: %v", msg.Subject, err) | |
| } | |
| } else { | |
| err := msg.Nak() | |
| if err != nil { | |
| n.Log.Errorf("Failed to Nak message on subject %s: %v", msg.Subject, err) | |
| } | |
| } | |
| } | |
| } | |
| } | |
| } | |
| func (n *NatsConsumer) removeDelivered(id telegraf.TrackingID) *nats.Msg { | |
| n.Lock() | |
| defer n.Unlock() | |
| msg, ok := n.undelivered[id] | |
| if !ok { | |
| return nil | |
| } | |
| delete(n.undelivered, id) | |
| return msg | |
| } | |
| func (n *NatsConsumer) handleDelivery(ctx context.Context) { | |
| for { | |
| select { | |
| case <-ctx.Done(): | |
| // Plugin is stopping | |
| return | |
| case track := <-n.acc.Delivered(): | |
| // Get the delivered message and remove it from the internal tracking | |
| // mechanism | |
| n.Lock() | |
| msg, found := n.undelivered[id] | |
| delete(n.undelivered, id) | |
| defer n.Unlock() | |
| // Make space for a new message to be received despite any error case | |
| <-n.sem | |
| if !found { | |
| n.Log.Errorf("received delivery event for unknown message %v", id) | |
| continue | |
| } | |
| // Acknowledge the message | |
| if track.Delivered() { | |
| if err := msg.Ack(); err != nil { | |
| n.Log.Errorf("Failed to Ack message on subject %s: %v", msg.Subject, err) | |
| } | |
| } else { | |
| if err := msg.Nak(); err != nil { | |
| n.Log.Errorf("Failed to Nak message on subject %s: %v", msg.Subject, err) | |
| } | |
| } | |
| } | |
| } | |
| } |
I'm not sure we should NAK the message here if this means the message is re-queued to be honest. The background is that if only a single metric is rejected (for whatever reason) the whole NATS message is NAK'ed. If the underlying reason for rejecting the metric is permanent, this may trigger an infinite loop for the message as it will always be NAK'ed and thus may render the input dysfunctional if enough of those messages are consumed (i.e. more than the allowed max pending).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the whole purpose of tracking metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is. The purpose of tracking metrics is to allow an input to know if a message was delivered or is "in flight". If you receive a delivery event it means that the metric was processed by the endpoint (it might be a processor as well, just for clarity). The Delivered function tells you if all messages were delivered successfully.
Now there are different ways to handle "unsuccessful" messages. In PR #15796 we implemented NAK'ing messages for AMQP if delivery fails ensuring that the messages are not re-enqueued in the topic.
Coming back to this PR: If delivery fails, i.e. the metric could not be handled by the output plugin (e.g. serialization fails, format is invalid etc or the endpoint got the metric but something went wrong, and you NAK the message, you must make sure that this message is not sent again, otherwise you end up in an infinite loop of getting the message -> delivery fails -> NAK message - back to square one. Now if you not only have one of those metrics but they do appear sporadically, you will end up with a dysfunctional input only being busy with NAK'ed messages...
So please either make sure the NAK'ed messages are not re-enqueued in the topic or do not NAK those messages!
|
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Hipska! There are still some issues here:
- Why do you remove the
PendingBytesLimitoption? You did not provide any reason for doing so and this does not belong to this PR. Please keep the option! - You are leaking messages which are not acknowledged if they are non-Jetstream.
NAK'ing messages might re-queue the failed message ending up in an infinite loop blocking the plugin.
Please address those issues!
| func (n *NatsConsumer) Start(acc telegraf.Accumulator) error { | ||
| n.sem = make(semaphore, n.MaxUndeliveredMessages) | ||
| n.acc = acc.WithTracking(n.MaxUndeliveredMessages) | ||
| n.undelivered = make(map[telegraf.TrackingID]*nats.Msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| n.undelivered = make(map[telegraf.TrackingID]*nats.Msg) | |
| n.undelivered = make(map[telegraf.TrackingID]*nats.Msg, n.MaxUndeliveredMessages) |
| // set the subscription pending limits | ||
| err = sub.SetPendingLimits(n.PendingMessageLimit, n.PendingBytesLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you remove this feature?
| for _, s := range n.jsSubs { | ||
| if msg.Sub == s { | ||
| n.handleJetstreamMessage(msg) | ||
| break L | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this? It wasn't required before... This reads like yet another thing folded into this PR!
| err := msg.Nak() | ||
| if err != nil { | ||
| n.Log.Errorf("Failed to Nak JetStream message on subject %s: %v", msg.Subject, err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't make my concerns go away if you do not address them. :-) How can we make sure that messages with permanent errors are not re-enqueued/re-send by NATS?
| for _, m := range metrics { | ||
| m.AddTag("subject", msg.Subject) | ||
| } | ||
| return n.acc.AddTrackingMetricGroup(metrics), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not adding the tracking ID to the n.undelivered lookup so you will never be able to ACK this message!
Summary
NATS Jetstream has support for message acknowledging, so make use of that.
Checklist
Related issues
resolves #17791