Skip to content

[CUDA] Crash when using device_type=cuda #6961

@asenzz

Description

@asenzz

Description

I'm trying to use LightGBM on a CUDA multi GPU NVidia V100 system and when device_type is set to cuda I'm getting a segmentation fault, if device_type=gpu it works fine. I'm using latest checkout from master build of LightGBM.

gdb) where
#0  0x00000ede46d7cfc9 in LightGBM::CUDARegressionObjectiveInterface<LightGBM::RegressionL2loss>::Init(LightGBM::Metadata const&, int) () from /usr/local/lib/lib_lightgbm.so
#1  0x00000ede46466302 in LightGBM::Booster::CreateObjectiveAndMetrics (this=0xedd2d9c2800) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:213
#2  0x00000ede4643b33f in LightGBM::Booster::Booster (this=0xedd2d9c2800, train_data=0xedd2d1e4680, parameters=0xedd30d4a900 "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=255 num_leaves=256 min_data_in_leaf=100 learning_rate=0.01 num_iterations=5000 feature_fraction=0.8 bagging_fraction=0.8 b"...) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:183
#3  LGBM_BoosterCreate (train_data=0xedd2d1e4680, parameters=0xedd30d4a900 "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=255 num_leaves=256 min_data_in_leaf=100 learning_rate=0.01 num_iterations=5000 feature_fraction=0.8 bagging_fraction=0.8 b"..., out=0x7ffc34df6c28) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:1944
#4  0x00000ede996a988d in svr::kernel::kernel_gbm<double>::init (this=this@entry=0xedd3263fcd0, X_t=..., Y=...) at /usr/include/c++/14/bits/basic_string.h:227
#5  0x00000ede99825dd2 in _ZN3svr9datamodel9OnlineSVR4tuneEv._omp_fn.0(void) () at /mnt/faststore/repo/tempus-core/SVRRoot/OnlineSVR/src/onlinesvr_tune_fast.cpp:145


(gdb) list
69      in ./nptl/pthread_mutex_trylock.c
(gdb) up
#1  0x00000ede46466302 in LightGBM::Booster::CreateObjectiveAndMetrics (this=0xedd2d9c2800) at /mnt/slowstore/pub/LightGBM/src/c_api.cpp:213
213           objective_fun_->Init(train_data_->metadata(), train_data_->num_data());
(gdb) list -10
198         boosting_->MergeFrom(other->boosting_.get());
199       }
200
201       ~Booster() {
202       }
203
204       void CreateObjectiveAndMetrics() {
205         // create objective function
206         objective_fun_.reset(ObjectiveFunction::CreateObjectiveFunction(config_.objective,
207                                                                         config_));


Parameters string is

s << "boosting=gbdt objective=regression gpu_use_dp=false tree_learner=data max_bin=" LGBM_MAXBIN " num_leaves=256 min_data_in_leaf=100 learning_rate=" << PROPS.get_k_learn_rate() << " num_iterations=" << PROPS.get_k_epochs() <<
        " feature_fraction=0.8 bagging_fraction=0.8 bagging_freq=5 metric=l2 save_binary=true use_missing=false force_col_wise=true num_threads=" << C_n_cpu << " device_type=cuda num_gpu=" << common::gpu_handler_1::get().get_gpu_devices_count();

Reproducible example

Environment info

LightGBM version or commit hash:

Command(s) you used to install LightGBM

shell

20250705-05:27:46] zarko@tempus:/mnt/faststore/repo/tempus-core/build$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.10
Release:        24.10
Codename:       oracular


nvidia-smi 
Sat Jul  5 05:28:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-FHHL-16GB           On  |   00000000:03:00.0 Off |                    0 |
| N/A   36C    P0             24W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-FHHL-16GB           On  |   00000000:04:00.0 Off |                    0 |
| N/A   35C    P0             22W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-FHHL-16GB           On  |   00000000:05:00.0 Off |                    0 |
| N/A   34C    P0             23W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-FHHL-16GB           On  |   00000000:82:00.0 Off |                    0 |
| N/A   34C    P0             25W /  100W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Additional Comments

Metadata

Metadata

Assignees

No one assigned

    Labels

    gpu (CUDA)Issue is related to the CUDA GPU variant.question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions