-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Open
Labels
gpu (CUDA)Issue is related to the CUDA GPU variant.Issue is related to the CUDA GPU variant.question
Description
Description
I'm trying to use LightGBM on a CUDA multi GPU NVidia V100 system and when device_type is set to cuda I'm getting a segmentation fault, if device_type=gpu it works fine. I'm using latest checkout from master build of LightGBM.
gdb) where
#0 0x00000ede46d7cfc9 in LightGBM::CUDARegressionObjectiveInterface<LightGBM::RegressionL2loss>::Init(LightGBM::Metadata const&, int) () from /usr/local/lib/lib_lightgbm.so
#1 0x00000ede46466302 in LightGBM::Booster::CreateObjectiveAndMetrics (this=0xedd2d9c2800) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:213
#2 0x00000ede4643b33f in LightGBM::Booster::Booster (this=0xedd2d9c2800, train_data=0xedd2d1e4680, parameters=0xedd30d4a900 "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=255 num_leaves=256 min_data_in_leaf=100 learning_rate=0.01 num_iterations=5000 feature_fraction=0.8 bagging_fraction=0.8 b"...) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:183
#3 LGBM_BoosterCreate (train_data=0xedd2d1e4680, parameters=0xedd30d4a900 "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=255 num_leaves=256 min_data_in_leaf=100 learning_rate=0.01 num_iterations=5000 feature_fraction=0.8 bagging_fraction=0.8 b"..., out=0x7ffc34df6c28) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:1944
#4 0x00000ede996a988d in svr::kernel::kernel_gbm<double>::init (this=this@entry=0xedd3263fcd0, X_t=..., Y=...) at /usr/include/c++/14/bits/basic_string.h:227
#5 0x00000ede99825dd2 in _ZN3svr9datamodel9OnlineSVR4tuneEv._omp_fn.0(void) () at /mnt/faststore/repo/tempus-core/SVRRoot/OnlineSVR/src/onlinesvr_tune_fast.cpp:145
(gdb) list
69 in ./nptl/pthread_mutex_trylock.c
(gdb) up
#1 0x00000ede46466302 in LightGBM::Booster::CreateObjectiveAndMetrics (this=0xedd2d9c2800) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:213
213 objective_fun_->Init(train_data_->metadata(), train_data_->num_data());
(gdb) list -10
198 boosting_->MergeFrom(other->boosting_.get());
199 }
200
201 ~Booster() {
202 }
203
204 void CreateObjectiveAndMetrics() {
205 // create objective function
206 objective_fun_.reset(ObjectiveFunction::CreateObjectiveFunction(config_.objective,
207 config_));
Parameters string is
s << "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=" LGBM_MAXBIN " num_leaves=256 min_data_in_leaf=100 learning_rate=" << PROPS.get_k_learn_rate() << " num_iterations=" << PROPS.get_k_epochs() <<
" feature_fraction=0.8 bagging_fraction=0.8 bagging_freq=5 metric=l2 save_binary=true use_missing=false force_col_wise=true num_threads=" << C_n_cpu << " device_type=cuda num_gpu=" << common::gpu_handler_1::get().get_gpu_devices_count();
Reproducible example
Environment info
LightGBM version or commit hash:
Command(s) you used to install LightGBM
shell
20250705-05:27:46] zarko@tempus:/mnt/faststore/repo/tempus-core/build$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.10
Release: 24.10
Codename: oracular
nvidia-smi
Sat Jul 5 05:28:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-FHHL-16GB On | 00000000:03:00.0 Off | 0 |
| N/A 36C P0 24W / 100W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-FHHL-16GB On | 00000000:04:00.0 Off | 0 |
| N/A 35C P0 22W / 100W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-FHHL-16GB On | 00000000:05:00.0 Off | 0 |
| N/A 34C P0 23W / 100W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-FHHL-16GB On | 00000000:82:00.0 Off | 0 |
| N/A 34C P0 25W / 100W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Additional Comments
Metadata
Metadata
Assignees
Labels
gpu (CUDA)Issue is related to the CUDA GPU variant.Issue is related to the CUDA GPU variant.question