-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Description
For the last few days, I've observed the aarch64 CI job (which we run on an x86_64 box, using QEMU for emulation), failing with errors like the following during test collection:
___________ ERROR collecting tests/python_package_test/test_basic.py ___________
ImportError while importing test module '/LightGBM/tests/python_package_test/test_basic.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/root/miniforge/envs/test-env/lib/python3.12/importlib/__init__.py:90: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/python_package_test/test_basic.py:12: in <module>
from sklearn.datasets import dump_svmlight_file, load_svmlight_file, make_blobs
/root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/__init__.py:97: in <module>
from .utils._show_versions import show_versions
/root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/utils/_show_versions.py:15: in <module>
from ._openmp_helpers import _openmp_parallelism_enabled
E ImportError: /root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/utils/../../../../libgomp.so.1: cannot allocate memory in static TLS block
Reproducible example
This is happening across several different PRs, with changesets that are very unlikely to be causing this, suggesting it's some other change in the environment. For example:
- PR: [ci] [R-package] run 'R CMD check' as a foreground task #6508
- build: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=16492&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=6c24f9ed-c6ce-5b46-2a4e-317f1b2c686c
Environment info
N/A
Additional Comments
"TLS" in this error refers to "thread-local storage".
There is lots of prior discussion on similar issues:
- https://bugzilla.redhat.com/show_bug.cgi?id=1722181
- "It seems that scikit-learn has not been built correctly." Nvidia Jetson scikit-learn/scikit-learn#28362
- cannot allocate memory in static TLS block dmlc/xgboost#8488
- cannot allocate memory in static TLS block dmlc/xgboost#7110 (comment)
- ImportError: dlopen: cannot load any more object with static TLS pytorch/pytorch#2575
- aarch64: libgomp.so.1: cannot allocate memory in static TLS block opencv/opencv#14884
- Require
libgomp
on Linux conda-forge/scikit-learn-feedstock#220 - Start building
linux-aarch64
nightlies dask-contrib/dask-sql#1144
All of those are about using libgomp
on aarch64
.
From https://bugzilla.redhat.com/show_bug.cgi?id=1722181:
The GNU TLS2 model which I'm afraid aarch64 uses unfortunately eats from the same TLS preallocated pool as libraries that require static TLS like libgomp, where it is performance critical to have it as static TLS.
On opencv/opencv#14884, there's some discussion about this specifically being caused by bundled libgomp
in multiple Python packages, and there are suggestions that importing those libraries earlier (and therefore loading their libgomp
earlier) can resolve this.
These also have some helpful information: