-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
We have had unexplained intermittent timeouts occurring on various Rest services and eventually managed to reproduce it under load and isolate the cause.
I've added a reproducer which shows that under high throughput you can create a race condition where a call to getEntityStream in a JAX-RS Container response filter for a response with no body (returns a 204) will eventually cause all current TCP connections to the server to become unresponsive. Clients need to terminate and reconnect in order to send requests. All in-progress requests or new requests on the open TCP connections will result in no HTTP response from the server. Instead clients will just receive an ACK to the request.
Expected behavior
The entity stream call should return an empty stream all the time and not cause the server to stop responding.
Actual behavior
In my reproducer, after a minute or two one will see the server stops processing requests.
How to Reproduce?
git clone https://github.com/bcluap/quarkus-examples.git
cd quarkus-examples/resteasy-reactive
mvn clean install
java -jar ./target/quarkus-app/quarkus-run.jar
Then run a load test like this:
wrk --timeout=10s -d600 -t1 -c1 'http://localhost:8000/test'
The server will log "HERE" over and over and eventually stop. The load test client experiences timeouts for all future requests. Only fresh TCP connections get any response from the server.
Output of uname -a
or ver
Linux paul-xps 5.19.0-45-generic #46~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 7 15:06:04 UTC 20 x86_64 x86_64 x86_64 GNU/Linux
Output of java -version
openjdk version "20.0.1" 2023-04-18 OpenJDK Runtime Environment Temurin-20.0.1+9 (build 20.0.1+9) OpenJDK 64-Bit Server VM Temurin-20.0.1+9 (build 20.0.1+9, mixed mode, sharing)
GraalVM version (if different from Java)
NA
Quarkus version or git rev
3.1.3.Final
Build tool (ie. output of mvnw --version
or gradlew --version
)
mvn 3.9.3
Additional information
Can reproduce on my laptop and AWS ECS. The lock up occurs normally within a minute of the load test kicking off. Commenting out the responseContext.getEntityStream(); in the filter prevents the issue.
Note this only happens when there is no response. A thread dump during the lock up shows that the server is not doing anything and not locking on anything. Its as though the event loop has lost all knowledge of the TCP connections.
If the jax-rs method returns void or returns a null String then the same behaviour is seen. It does not happen if data is returned in the body.