-
Notifications
You must be signed in to change notification settings - Fork 27
Description
1 Problem description
I came across an cuda error 209 when running stable-difussion app (arg: batch=2, iter=2, inference.py), where the programme cannot find kernel image, possibly not supporting exteral .so. I came across the same issue when running original cricket, even though in your work Characterizing Network Requirements for GPU API Remoting in AI Applications have supported running SD in pytorch version.
Here are my compiling arguments, VERSION=NO_OPTIMIZATION or version where both async, cache, handler are included results the same error.
LOG=INFO VERSION=NO_OPTIMIZATION make
2 Environment setup
My machine is ubuntu 22.04 with one nvidia A4500 (sm=80), driver version=535.183.06, cuda version=11.8. I tried several solution, none of them succeed
- directly compiling on physical machine: raise error when compiling cuda-gdb=11.1 (one config.h is not found)
- use cuda-gdb 11.8 source rpm: lots of path issues due to version discrepancy (e.g. build/gnu->build-gnu) and it's impossible to figure out all wrong path for me
- use provided dockerfile (cuda11.1) by this repo, and issue from original cricket: NVML: version discrepancy in Driver/Lib, can only download NVML>=565.
Then I pulled nvidia docker image nvidia/cuda:11.1.1-cudnn8-devel-rockylinux8 and tried to build env based on it according to your dockerfile. What follows is the same problem in 1
I ran pytorch SD app under miniconda, here are my envs:
python 3.8.0
accelerate 0.20.1
certifi 2024.12.14
charset-normalizer 3.4.1
diffusers 0.9.0
filelock 3.16.1
fsspec 2024.12.0
huggingface-hub 0.24.6
idna 3.10
importlib_metadata 8.5.0
numpy 1.24.4
packaging 24.2
pillow 10.4.0
pip 24.2
psutil 6.1.1
PyYAML 6.0.2
regex 2024.11.6
requests 2.32.3
safetensors 0.5.0
sentencepiece 0.2.0
setuptools 75.1.0
tokenizers 0.13.3
torch 1.10.1+cu111
torchaudio 0.10.1+cu111
torchvision 0.11.2+cu111
tqdm 4.67.1
transformers 4.30.0
typing_extensions 4.12.2
urllib3 2.2.3
wheel 0.44.0
zipp 3.20.2
3 Other modification
Due to the dependency on main PhoenixOS, I cancel the following code in cpu/proxy/svc.cpp, which is not included in POS_ENABLE
and I have manually disabled compilation of tests and bin/tests in the main Makefile.
I sincerely hope that you can figure out my omissive steps, or other extra traceback infos I can provide, or provide an executable configuration or Dockerfile or DockerImage.
Thanks a lot