Skip to content

Resteasy Reactive: Server becomes unresponsive due to race condition on ContainerResponseContext.getEntityStream() #34632

@bcluap

Description

@bcluap

Describe the bug

We have had unexplained intermittent timeouts occurring on various Rest services and eventually managed to reproduce it under load and isolate the cause.

I've added a reproducer which shows that under high throughput you can create a race condition where a call to getEntityStream in a JAX-RS Container response filter for a response with no body (returns a 204) will eventually cause all current TCP connections to the server to become unresponsive. Clients need to terminate and reconnect in order to send requests. All in-progress requests or new requests on the open TCP connections will result in no HTTP response from the server. Instead clients will just receive an ACK to the request.

Expected behavior

The entity stream call should return an empty stream all the time and not cause the server to stop responding.

Actual behavior

In my reproducer, after a minute or two one will see the server stops processing requests.

How to Reproduce?

git clone https://github.com/bcluap/quarkus-examples.git
cd quarkus-examples/resteasy-reactive
mvn clean install
java -jar ./target/quarkus-app/quarkus-run.jar

Then run a load test like this:
wrk --timeout=10s -d600 -t1 -c1 'http://localhost:8000/test'

The server will log "HERE" over and over and eventually stop. The load test client experiences timeouts for all future requests. Only fresh TCP connections get any response from the server.

Output of uname -a or ver

Linux paul-xps 5.19.0-45-generic #46~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 7 15:06:04 UTC 20 x86_64 x86_64 x86_64 GNU/Linux

Output of java -version

openjdk version "20.0.1" 2023-04-18 OpenJDK Runtime Environment Temurin-20.0.1+9 (build 20.0.1+9) OpenJDK 64-Bit Server VM Temurin-20.0.1+9 (build 20.0.1+9, mixed mode, sharing)

GraalVM version (if different from Java)

NA

Quarkus version or git rev

3.1.3.Final

Build tool (ie. output of mvnw --version or gradlew --version)

mvn 3.9.3

Additional information

Can reproduce on my laptop and AWS ECS. The lock up occurs normally within a minute of the load test kicking off. Commenting out the responseContext.getEntityStream(); in the filter prevents the issue.

Note this only happens when there is no response. A thread dump during the lock up shows that the server is not doing anything and not locking on anything. Its as though the event loop has lost all knowledge of the TCP connections.

If the jax-rs method returns void or returns a null String then the same behaviour is seen. It does not happen if data is returned in the body.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions