Skip to content

Conversation

@lampajr
Copy link
Member

@lampajr lampajr commented Jun 25, 2025

I am afraid that sending dataset event in different TX callbacks we are sending events completely asynchronously and when there is a huge amount of them we might risk to hit some race condition that could cause wrong credit calculation on amqp client:

2025-06-25 10:59:45,580 645a1466d4ef /usr/lib/jvm/java-17-openjdk-17.0.15.0.6-2.el9.x86_64/bin/java[7] ERROR [io.sma.rea.mes.amqp] (vert.x-eventloop-thread-1) SRMSG16225: Failure reported for channel `dataset-event-out`, closing client: java.lang.IllegalArgumentException: Invalid request number, must be greater than 0
	at io.smallrye.mutiny.helpers.Subscriptions.getInvalidRequestException(Subscriptions.java:24)
	at io.smallrye.mutiny.operators.multi.MultiOperatorProcessor.request(MultiOperatorProcessor.java:117)
	at io.smallrye.mutiny.operators.multi.MultiOnRequestInvoke$MultiOnRequestInvokeOperator.request(MultiOnRequestInvoke.java:34)
	at io.smallrye.reactive.messaging.amqp.AmqpCreditBasedSender.setCreditsAndRequest(AmqpCreditBasedSender.java:163)
	at io.smallrye.reactive.messaging.amqp.AmqpCreditBasedSender.lambda$onNoMoreCredit$7(AmqpCreditBasedSender.java:218)
	at io.vertx.mutiny.core.Context.lambda$runOnContext$1(Context.java:136)
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:270)
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:252)
	at io.vertx.core.impl.ContextInternal.lambda$runOnContext$0(ContextInternal.java:50)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

This is just a guess as I don't know how to prove it 🫤

Changes proposed

Instead of adding one TX callback for each dataset to change, I am proposing to add a single TX callback that will send all those events synchronously.

This way the messages are sent in order, and sequentially so that there shouldn't be any race condition anymore.

Check List (Check all the applicable boxes)

  • My code follows the code style of this project.
  • My change requires changes to the documentation.
  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@lampajr lampajr requested review from johnaohara, stalep and willr3 June 25, 2025 12:39
@lampajr lampajr marked this pull request as ready for review June 25, 2025 12:39
@lampajr lampajr self-assigned this Jun 25, 2025
@lampajr
Copy link
Member Author

lampajr commented Jun 25, 2025

Is there anything I am missing and for which the previous implementation is needed?

@johnaohara
Copy link
Member

I think it makes sense to move emitting all events into one TX callback, i don't know how the callback mechanism scales.

I am not sure if it will resolve the error you are seeing though, that appears related to unavailable credits, and the number of messages pushed through the AMQ client will be the same

@lampajr
Copy link
Member Author

lampajr commented Jun 25, 2025

Thanks @johnaohara

I am not sure if it will resolve the error you are seeing though, that appears related to unavailable credits, and the number of messages pushed through the AMQ client will be the same

I agree, but the error seems related to some wrong calculation Invalid request number, must be greater than 0 that shouldn't happen even if I send more messages than credits, in that case it should just wait (if I understood correctly).

That's why I am thinking that the wrong calculation is caused by some race condition, but I cannot confirm so..

@johnaohara
Copy link
Member

I agree, but the error seems related to some wrong calculation Invalid request number, must be greater than 0 that shouldn't happen even if I send more messages than credits, in that case it should just wait (if I understood correctly).

That's why I am thinking that the wrong calculation is caused by some race condition, but I cannot confirm so..

This may change the timing, and therefore not trigger the race cond. but if there is a race it looks like it is in smallrye reactive messaging and would still be present.

I agree that this should apply back pressure on the client to slow down emitting messages, maybe the smallrye reactive messaging team should take a look?

@stalep
Copy link
Member

stalep commented Jun 25, 2025

With emitting all events in on the same TX would there be a potential issue with timeout?

+1 on looping in someone with some messaging background here :)

@johnaohara
Copy link
Member

With emitting all events in on the same TX would there be a potential issue with timeout?

This change is not changing the TX semantics, but is registering one callback instead of n callbacks when the tx is committed (or rolled back). The callback occurs after the tx has succeeded/failed

@lampajr
Copy link
Member Author

lampajr commented Jun 25, 2025

This change is not changing the TX semantics, but is registering one callback instead of n callbacks when the tx is committed (or rolled back). The callback occurs after the tx has succeeded/failed

yeah exactly

+1 on looping in someone with some messaging background here :)

anyone you know of? otherwise I can open an issue on quarkus directly, wdyt?

@stalep
Copy link
Member

stalep commented Jun 25, 2025

perhaps @gemmellr or @ozangunalp?

@lampajr
Copy link
Member Author

lampajr commented Jul 7, 2025

Meanwhile, @stalep @johnaohara wdyt about this fix? Is it good to go as it is?

Copy link
Collaborator

@willr3 willr3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change looks good but I am not familiar with this part of Horreum nor do I believe we have adequate testing to shore up my confidence. In which case I say let's merge it and see what happens :)

Copy link
Member

@johnaohara johnaohara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lampajr lampajr merged commit e51f1e3 into Hyperfoil:master Jul 14, 2025
4 checks passed
@lampajr lampajr deleted the change_send_dataset_label_recal branch July 14, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants