-
Notifications
You must be signed in to change notification settings - Fork 29k
Description
Is there an existing issue for this?
- I have searched the existing issues and checked the recent builds/commits
What would your feature do ?
Many modern processors have bfloat16 support such as AMD Zen4, Apple M2, Intel Cooper Lake, Intel Sapphire Rapids.
By using autocast bfloat16 I doubled the performance.
- Ryzen 9 7950X (32Gb) speedup from 0.625 it/s to 1.3 it/s
Proposed workflow
- Change in ./modules/devices.py
Add return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) in autocast functions.
def autocast(disable=False):
from modules import shared
if disable:
contextlib.nullcontext()
if dtype == torch.float32 or shared.cmd_opts.precision == "full":
return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) # contextlib.nullcontext()
return torch.autocast("cuda")
- Change in ./modules/sd_samplers_common.py
Add if x_sample.dtype == torch.bfloat16: x_sample = x_sample.to(torch.float16) in single_sample_to_image, because numpy dont work with bfloat16 yet.
def single_sample_to_image(sample, approximation=None):
if approximation is None:
approximation = approximation_indexes.get(opts.show_progress_type, 0)
if approximation == 2:
x_sample = sd_vae_approx.cheap_approximation(sample)
elif approximation == 1:
x_sample = sd_vae_approx.model()(sample.to(devices.device, devices.dtype).unsqueeze(0))[0].detach()
else:
x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0]
x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
if x_sample.dtype == torch.bfloat16:
x_sample = x_sample.to(torch.float16)
x_sample = 255. * np.moveaxis(x_sample.cpu().numpy(), 0, 2)
x_sample = x_sample.astype(np.uint8)
return Image.fromarray(x_sample)
Additional information
Other system informations:
COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"
python: 3.10.6 • torch: 2.1.0.dev20230506+cpu • xformers: N/A • gradio: 3.28.1 • commit: 5ab7f213 • checkpoint: b4391b7978
OS Ubuntu 22.04
P.S. Since I'm still just a beginner programmer, the changes were made only as a proof of concept.
I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model, the rest of the functionality worked at 2X increase in speed.