-
Notifications
You must be signed in to change notification settings - Fork 482
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
🐛 Describe the bug
Not able to train gpt2-large with ILQL with max_length=1024 on 4xA40 GPUS and ~900GB of RAM because of CUDA OOM error.
Accelerate env
- `Accelerate` version: 0.15.0
- Platform: Linux-5.13.0-40-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- dynamo_backend: NO
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: None
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero_stage': 2}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
- tpu_name: None
- tpu_zone: None
- command_file: None
- commands: None
How to reproduce
You can use my fork with small change: https://github.com/AlekseyKorshuk/trlx/tree/ilql-dalio
accelerate launch examples/dalio/ilql_dalio.py
Which trlX version are you using?
trlx==0.3.0
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working