Skip to content

Issue running 'PhoenixOS-Remoting' separately, with stable-diffusion (pytorch version) as AI app #17

@zaglc

Description

@zaglc

1 Problem description

I came across an cuda error 209 when running stable-difussion app (arg: batch=2, iter=2, inference.py), where the programme cannot find kernel image, possibly not supporting exteral .so. I came across the same issue when running original cricket, even though in your work Characterizing Network Requirements for GPU API Remoting in AI Applications have supported running SD in pytorch version.
client报错-无忧化phoenixOS

Here are my compiling arguments, VERSION=NO_OPTIMIZATION or version where both async, cache, handler are included results the same error.

LOG=INFO VERSION=NO_OPTIMIZATION make

2 Environment setup

My machine is ubuntu 22.04 with one nvidia A4500 (sm=80), driver version=535.183.06, cuda version=11.8. I tried several solution, none of them succeed

  • directly compiling on physical machine: raise error when compiling cuda-gdb=11.1 (one config.h is not found)
  • use cuda-gdb 11.8 source rpm: lots of path issues due to version discrepancy (e.g. build/gnu->build-gnu) and it's impossible to figure out all wrong path for me
  • use provided dockerfile (cuda11.1) by this repo, and issue from original cricket: NVML: version discrepancy in Driver/Lib, can only download NVML>=565.

Then I pulled nvidia docker image nvidia/cuda:11.1.1-cudnn8-devel-rockylinux8 and tried to build env based on it according to your dockerfile. What follows is the same problem in 1

I ran pytorch SD app under miniconda, here are my envs:

python             3.8.0
accelerate         0.20.1
certifi            2024.12.14
charset-normalizer 3.4.1
diffusers          0.9.0
filelock           3.16.1
fsspec             2024.12.0
huggingface-hub    0.24.6
idna               3.10
importlib_metadata 8.5.0
numpy              1.24.4
packaging          24.2
pillow             10.4.0
pip                24.2
psutil             6.1.1
PyYAML             6.0.2
regex              2024.11.6
requests           2.32.3
safetensors        0.5.0
sentencepiece      0.2.0
setuptools         75.1.0
tokenizers         0.13.3
torch              1.10.1+cu111
torchaudio         0.10.1+cu111
torchvision        0.11.2+cu111
tqdm               4.67.1
transformers       4.30.0
typing_extensions  4.12.2
urllib3            2.2.3
wheel              0.44.0
zipp               3.20.2

3 Other modification

Due to the dependency on main PhoenixOS, I cancel the following code in cpu/proxy/svc.cpp, which is not included in POS_ENABLE
svc
and I have manually disabled compilation of tests and bin/tests in the main Makefile.

I sincerely hope that you can figure out my omissive steps, or other extra traceback infos I can provide, or provide an executable configuration or Dockerfile or DockerImage.

Thanks a lot

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions