Fix shared provider unload crash #5553
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixed version of original change: #5523 - Fixed a build error with ORT_MINIMAL_BUILD and a build test failure in Python
Short Description: On Linux, when calling dlclose() a library won't always be unloaded immediately (and have its global variables be destroyed). This causes a problem with shared providers, since they hold a KernelRegistry object in a static and they need to call into the core OnnxRuntime code to destroy it.
The fix adds a Shutdown() method to the shared providers and calls this on OrtEnv destruction in the core onnxruntime code. This way these globals are destroyed in a deterministic way before library unloading gets involved.
Long Description:Here's what would happen before this change on Linux:
Customer code dynamically loads onnxruntime.so
Customer code then adds TensorRT provider to session
Onnxruntime dynamically loads onnxruntime_provider_tensorrt.so due to (2)
Customer code destroys all onnxruntime resources and unloads the onnxruntime.so library
onnxruntime global variables are destroyed, one of which unloads onnxruntime_provider_tensorrt.so
onnxruntime fully unloads
onnxruntime_provider_tensorrt.so global variables are destroyed, and try to call into onnxruntime.so code (to release a KernelRegistry object)
Crash!
On Windows, unloading onnxruntime_provider_tensorrt.so causes it to unload immediately, but on Linux it doesn't unload until after onnxruntime.so unloads.
Even if dlclose() unloaded and shutdown the onnxruntime_provider_tensorrt.so library immediately the shutdown order is still undefined and could result in bad behavior. My change makes the shutdown order deterministic so that the OS ordering of global destruction and when the library unloads will work in both cases.
Secondly.. unloading libraries during library unload is bad app behavior on windows (should never FreeLibrary during DllMain). So this is just the right thing to do in general.
My first fix was just changing the code so that we unload libraries on OrtEnv destruction. This isn’t enough to fix the issue as on linux dlclose() doesn’t guarantee when it unloads a library, it just marks it for unload at some later point when it feels like it. So the explicit Shutdown() method is needed in the provider for that.
#5384 - Original customer bug