-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix CursorResourceManager.close
#1440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| boolean doClose = withLock(lock, () -> { | ||
| State localState = state; | ||
| if (localState == State.OPERATION_IN_PROGRESS) { | ||
| if (localState.inProgress()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the original code:
If AsyncCommandBatchCursor.close is called concurrently more than once while AsyncCommandBatchCursor.next is running (not necessarily the next method itself, but the asynchronous logic of this method), then the following state transitions were possible:
- One of the
closecalls observesState.OPERATION_IN_PROGRESS, and changes it toState.CLOSE_PENDING. So far so good. - The other
closeobservesState.CLOSE_PENDINGand changes it toState.CLOSED, which breaks one of the invariant that are supposed to be maintained byCursorResourceManager: while an operation is in progress,state.inProgress()must betrue, i.e.,statemust be eitherOPERATION_IN_PROGRESSorCLOSE_PENDING(impliesOPERATION_IN_PROGRESS).
| // Calling `close` twice is the key to reproducing. | ||
| // It does not matter whether we call `close` twice from the same thread or not. | ||
| ForkJoinPool.commonPool().execute(() -> cursor.close()); | ||
| ForkJoinPool.commonPool().execute(() -> cursor.close()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue this PR fixes is 100% a bug. And it manifests itself with an error that is very similar to the user-reported error (following is the error I reproduced locally):
java.lang.AssertionError: null
at com.mongodb.assertions.Assertions.assertTrue(Assertions.java:158)
// note that with this reproducer, sometimes the assertion on line 224 succeeds, and the one on line 225 fails
at com.mongodb.internal.operation.CursorResourceManager.setServerCursor(CursorResourceManager.java:224)
at com.mongodb.internal.operation.AsyncCommandBatchCursor.lambda$getMoreLoop$2(AsyncCommandBatchCursor.java:186)
at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:47)
at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.lambda$executeAsync$0(DefaultServer.java:249)
at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:47)
at com.mongodb.internal.connection.CommandProtocolImpl.lambda$executeAsync$0(CommandProtocolImpl.java:88)
at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.lambda$sendAndReceiveAsync$1(DefaultConnectionPool.java:774)
at com.mongodb.internal.connection.UsageTrackingInternalConnection.lambda$sendAndReceiveAsync$1(UsageTrackingInternalConnection.java:150)
at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:47)
at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:65)
at com.mongodb.internal.async.SingleResultCallback.complete(SingleResultCallback.java:67)
at com.mongodb.internal.async.AsyncSupplier.lambda$onErrorIf$5(AsyncSupplier.java:122)
at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:65)
at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:65)
at com.mongodb.internal.connection.InternalStreamConnection.lambda$sendCommandMessageAsync$7(InternalStreamConnection.java:613)
at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback$MessageCallback.onResult(InternalStreamConnection.java:894)
However, none of this guarantees that the bug the PR fixes is the bug the user is facing, because theoretically, I can imagine other bugs leading to the same error. However, I was unable to find them in our code, so I am saying that the fix proposed in this PR is my current best solution to JAVA-5516.
rozza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| try { | ||
| assertDoesNotThrow(() -> { | ||
| cursorResourceManager.close(); | ||
| cursorResourceManager.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good example of close being executed concurrently with an operation being in progress, without any parallelism involved. Examples like this emphasize how concurrency and parallelism are different.
| }; | ||
| cursorResourceManager.tryStartOperation(); | ||
| try { | ||
| assertDoesNotThrow(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the fix in CursorResourceManager, this assertion fails.
|
@rozza, I replaced the bug reproducer with a simpler one that is also deterministic. Please take a look. |
rozza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - LGTM!
JAVA-5516
JAVA-5516
|
Thanks for the fix. |
mongodb#1440 to 4.11.x JAVA-5516
|
If it can help anyone, I think this bug would cause conection not to be freed, which could lead to the saturation of the pool. We had this problem using Spring 3.3.4 (WebFlux) fixed in Spring Boot 3.4.0 when limiting find result with Flux.take(n). |
This PR is an alternative to #1439.
The commit b7be8c8 introduces a draft reproducer of the bug reported in JAVA-5516. Then the following commit fixes it. The reproducer is not guaranteed to reproduce the error, but usually does so in just a few attempts on my machine. With the fix I was unable to reproduce the error.
The commit 5105484 introduces a reproducer that we can actually check into the codebase.
JAVA-5516