Skip to content

[Question]: 执行多卡lora精调的时候报错:Retry to connect to 10.52.83.156:41328 #10889

@aolens-cc

Description

@aolens-cc

请提出你的问题

系统Linux ubuntu 20.04.6 LTS
PaddlePaddle Version: 3.1.0
PaddleNLP Version: 3.0.0b4.post20250704
显卡昆仑芯P800 8卡
配置文件:
root@tjdm-sys-rpm103azbxwk:/home/PaddleNLP-develop/llm# cat config/qwen/lora_argument_32B.json
{
"model_name_or_path": "Qwen/Qwen3-32B",
"dataset_name_or_path": "./data",
"output_dir": "./checkpoints/lora_ckpts_32B",
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 4,
"per_device_eval_batch_size": 4,
"eval_accumulation_steps":16,
"num_train_epochs": 3,
"learning_rate": 3e-04,
"warmup_steps": 30,
"logging_steps": 1,
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
"bf16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
"disable_tqdm": true,
"load_best_model_at_end": true,
"eval_with_do_generation": false,
"metric_for_best_model": "accuracy",
"recompute": true,
"save_total_limit": 1,
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"lora": true,
"unified_checkpoint": true,
"zero_padding": false,
"use_flash_attention": true,
"pissa": false,
"device":"xpu"
}
启动命令:
python -u -m paddle.distributed.launch run_finetune.py ./config/qwen/lora_argument_32B.json
python -u -m paddle.distributed.launch --master 127.0.0.1:12345 --devices "0,1,2,3" run_finetune.py ./config/qwen/lora_argument_32B.json
python -u -m paddle.distributed.launch --master 127.0.0.1:12345 --nnodes 1 --nproc_per_node 4 run_finetune.py ./config/qwen/lora_argument_32B.json
以上三种都尝试了
报错:
XCCL /usr/local/lib/python3.10/dist-packages/paddle/base/../libs/libbkcl.so loaded
[09:34:00][tjdm-sys-rpm103azbxwk.tjdm.baidu][11342:11342][WARN][BKCL][globals.cpp:259] xccl version: 7b9c31e [rdma] [fix compile issue] build data: May 13 2025 06:07:35
/usr/local/lib/python3.10/dist-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
/usr/local/lib/python3.10/dist-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
[2025-07-28 09:34:01,994] [ INFO] distributed_strategy.py:333 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_xpus', current_value='0', default_value='')
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)

I0728 09:34:01.995637 11342 tcp_utils.cc:185] The server starts to listen on IP_ANY:41328; setting synclog to 2048
I0728 09:34:05.063338 11342 tcp_utils.cc:111] Retry to connect to 10.52.83.156:41328 while the server is not yet listening.
I0728 09:34:11.143321 11342 tcp_utils.cc:111] Retry to connect to 10.52.83.156:41328 while the server is not yet listening.

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions