Skip to content

[BUG]: 在NPU卡执行example/language/llama7B,torch.npu.current_device()调用出错 #6359

@upwindfly

Description

@upwindfly

Is there an existing issue for this bug?

  • I have searched the existing issues

The bug has not been fixed in the latest main branch

  • I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

完整错误:
File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/init.py", line 2, in
from .hybrid_parallel_plugin import HybridParallelPlugin
File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 43, in
from colossalai.zero.low_level import LowLevelZeroOptimizer
File "/usr/local/lib/python3.10/dist-packages/colossalai/zero/init.py", line 1, in
from .gemini import GeminiAdamOptimizer, GeminiDDP, GeminiOptimizer, get_static_torch_model
File "/usr/local/lib/python3.10/dist-packages/colossalai/zero/gemini/init.py", line 4, in
from .gemini_optimizer import GeminiAdamOptimizer, GeminiOptimizer
File "/usr/local/lib/python3.10/dist-packages/colossalai/zero/gemini/gemini_optimizer.py", line 70, in
class GeminiOptimizer(OptimizerWrapper):
File "/usr/local/lib/python3.10/dist-packages/colossalai/zero/gemini/gemini_optimizer.py", line 578, in GeminiOptimizer
device: torch.device = get_accelerator().get_current_device(),
File "/usr/local/lib/python3.10/dist-packages/colossalai/accelerator/npu_accelerator.py", line 41, in get_current_device
return torch.device(f"npu:{torch.npu.current_device()}")
File "/usr/local/lib/python3.10/dist-packages/torch_npu/npu/utils.py", line 62, in current_device
torch_npu.npu._lazy_init()
File "/usr/local/lib/python3.10/dist-packages/torch_npu/npu/init.py", line 215, in _lazy_init
torch_npu._C._npu_init()
RuntimeError: Initialize:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:247 NPU function error: at_npu::native::AclSetCompileopt(aclCompileOpt::ACL_PRECISION_MODE, precision_mode), error code is 500001
[ERROR] 2025-07-10-06:34:49 (PID:272758, Device:0, RankID:0) ERR00100 PTA call acl api failed
[Error]: The internal ACL of the system is incorrect.
Rectify the fault based on the error information in the ascend log.
E90000: [PID: 272758] 2025-07-10-06:34:49.078.665 Compile operator failed, cause:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

但在终端执行可以正常得到结果
import torch
import torch_npu
print (torch.npu.current_device())

Environment

镜像docker pull hpcaitech/pytorch-npu:2.4.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions