Skip to content

Conversation

johnaohara
Copy link
Member

There is a race cond accessing waitingForDrain in VertxBlockingOutput.awaitWriteable(), where request.response().drainHandler() can be called from a seperate thread, but access to the private boolean waitingForDrain is not guarded in request.response().drainHandler() .

I have used the same locking pattern used for request.response().exceptionHandler() and request.response().endHandler().

However, I have concerns that this could result in a deadlock if we are holding the intrinsic lock on request.connection() VertxBlockingOutput.awaitWriteable() and a call to request.response().drainHandler() in a seperate thread would block waiting for the lock.

Would it be better for waitingForDrain to be defined as volatile?

@johnaohara
Copy link
Member Author

I think this race is causing the worker pool threads to lock in #5443

Copy link
Collaborator

@stuartwdouglas stuartwdouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already called under lock.

@johnaohara
Copy link
Member Author

johnaohara commented Nov 14, 2019

That is what I thought when I first looked, but I am assuming that the thread that calls request.response().drainHandler() will be a different thread that called VertxBlockingOutput.write(). We register the DrainHandler in awaitWriteable(), but don't invoke it in the same thread. A call by another thread to request.response().drainHandler() will not be holding the lock.

@stuartwdouglas
Copy link
Collaborator

Both threads will be holding the same lock though, there is
assert Thread.holdsLock(request.connection()); in awaitWritable()

@stuartwdouglas
Copy link
Collaborator

Actually I think isWriteable might be able to change in the background. I probably need to re-check it after the drain handler is registered.

@johnaohara
Copy link
Member Author

johnaohara commented Nov 14, 2019

The locking semantics is synchronized, they both can't hold the same lock and both progress at the same time, one will block waiting for the other to proceed at the lock boundary.

@johnaohara
Copy link
Member Author

Although there is an assert to verify that the lock is held by the thread instantiating the new DrainHandler, there is no assert or locking within the drain handler handle() method, which has to be called from a separate thread.

@johnaohara
Copy link
Member Author

Thinking about this, this isn't an atomicity issue, it is a visibility issue, and as I mentioned we are at risk of a deadlock with synchronized locking. I think the correct fix here is to change private volatile boolean waitingForDrain and remove the synchronization in the event handlers. If a thread that is writing has locked on request.connection(), then none of the event handlers could proceed,

@johnaohara
Copy link
Member Author

johnaohara commented Nov 14, 2019

Also, synchronzation does not make provisions for timeouts, leaving applications deadlocked if there is an error. I think a locking impl with timeouts and retries makes sense here, if for some reason the obj monitor does not receive the notification, we could time out and re-check the write queue as we have access to that through the request api request.response().writeQueueFull()

@johnaohara
Copy link
Member Author

johnaohara commented Nov 14, 2019

After further investigation, it looks like what is happening is;

  1. When the client closes a connection without having fully read the response an exception io.vertx.core.VertxException: Connection was closed is thrown from the event loop
  2. This exceptionHandler and the endHandler are then both called from the event loop, so context has switched from worker pool to event loop
  3. Most worker pool threads catch the exception and thrown an IOException from awaitWriteable()

I am debugging further to map out the life cycle of the connection after an VertxException is thrown in the event loop

I think that protected Throwable throwable also needs guarding as this calls to exceptionHandler from the event loop and are not guarded by synchronized

@johnaohara
Copy link
Member Author

@stuartwdouglas we were not registering a closeHandler for the response and notifying the worker thread when the connection is closed from client-side while there is data remaining to be written. I have updated my PR accordingly.

@johnaohara johnaohara changed the title Fix race condition on VertxBlockingOutput.waitingForDrain Register response.closeHandler() in VertxBlockingOutput and test for clients closing connections Nov 14, 2019
@patriot1burke
Copy link
Contributor

@johnaohara I'm not seeing your deadlock scenario. You can only have deadlock if there are two separate things being locked. There are no locks being used, only synchronization wait/notify. There is only synchronization on request.connection() so its impossible to have deadlock.

What I think the possible error is that there is no synchronized block in the drain handler. Maybe this isn't true, but I thought notify had to be called within a synchronized block. Apologies for missing this

@johnaohara
Copy link
Member Author

@johnaohara I'm not seeing your deadlock scenario. You can only have deadlock if there are two separate things being locked. There are no locks being used, only synchronization wait/notify. There is only synchronization on request.connection() so its impossible to have deadlock.

When I wrote this I had misunderstood what was happening.

What I was concerned about was a livelock scenario, where you have a worker thread obtaining a lock on request().connection() [1], and while that lock is held [2] another thread (event-loop) tries to acquire a lock on the same object [3]. My concern was in this case call to exceptionHandler(), endHandler() come from the event-loop, which is a separate thread, and the event-loop wouldn't be able to progress.

What I think the possible error is that there is no synchronized block in the drain handler. Maybe this isn't true, but I thought notify had to be called within a synchronized block. Apologies for missing this

What was missing was registering a closeHandler() that is called from the event-loop when the client closes the connection and also interrupts the worker thread waiting to write data

1 -


2 -
3 -

@stuartwdouglas
Copy link
Collaborator

I think what we actually need is #5491

@johnaohara
Copy link
Member Author

#5491 still finishes with the worker pool threads locked.

@johnaohara
Copy link
Member Author

@stuartwdouglas have rebased on top of #5491 and added check for a closed connection, this doesn't hang when connections are closed in the middle of writing

@johnaohara
Copy link
Member Author

Both be218e4 and #5491 do not reset the buffer, can we guarantee that buffers won't be pooled? and we won't be leaving buffers in an invalid state?

@stuartwdouglas
Copy link
Collaborator

buffer.clear() does not reset the buffer, it is buffer.release() that does this.

It looks like if you call write with the response closed the buffer will not be freed, once it is passed into Netty though it should be safe. I am adding try/catch to the write methods to handle this.

@johnaohara
Copy link
Member Author

buffer.release() decrements the reference count, and deallocates the buffer when the ref count reaches 0, or in the case of a pulled buffer, it will be returned to the pool. buffer.clear() set the readIndex to the writeIndex, which is in effect an empty buffer. This has the same effect of reading all the data in the buffer. I don't think we want to release the buffer at this point in the code path.

I had another branch that does not try to write the buffer to the response if the connection has been closed

@stuartwdouglas
Copy link
Collaborator

I have updated quarkus-http with these changes: quarkusio/quarkus-http@4a8fde5

Do you want to update this PR or do you want me to?

@johnaohara
Copy link
Member Author

@stuartwdouglas have updated to throw an IOException if the client has closed the connection, and release the buffer if en exception occurs. The stack trace is a lot clearer now as well

@stuartwdouglas
Copy link
Collaborator

Can you squash the commits?

@johnaohara
Copy link
Member Author

I can, but data.release() is not correct here, the ref count is 0, decrementing causes netty to throw a io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1 exception. I think we need to just call data.clear()

@stuartwdouglas stuartwdouglas added the triage/waiting-for-ci Ready to merge when CI successfully finishes label Nov 15, 2019
@johnaohara johnaohara merged commit b85aa59 into quarkusio:master Nov 15, 2019
@johnaohara johnaohara added this to the 1.1.0 milestone Nov 15, 2019
@gsmet gsmet removed the backport? label Nov 15, 2019
@gsmet gsmet modified the milestones: 1.1.0, 1.0.0.Final Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/resteasy-classic kind/bug Something isn't working triage/waiting-for-ci Ready to merge when CI successfully finishes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants