Skip to content

[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

@LynxPDA

Description

@LynxPDA

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Many modern processors have bfloat16 support such as AMD Zen4, Apple M2, Intel Cooper Lake, Intel Sapphire Rapids.

By using autocast bfloat16 I doubled the performance.

  • Ryzen 9 7950X (32Gb) speedup from 0.625 it/s to 1.3 it/s

Proposed workflow

  1. Change in ./modules/devices.py

Add return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) in autocast functions.

def autocast(disable=False):
    from modules import shared

    if disable:
        contextlib.nullcontext()

    if dtype == torch.float32 or shared.cmd_opts.precision == "full":
        return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) # contextlib.nullcontext()

    return torch.autocast("cuda")
  1. Change in ./modules/sd_samplers_common.py
    Add if x_sample.dtype == torch.bfloat16: x_sample = x_sample.to(torch.float16) in single_sample_to_image, because numpy dont work with bfloat16 yet.
def single_sample_to_image(sample, approximation=None):
    if approximation is None:
        approximation = approximation_indexes.get(opts.show_progress_type, 0)

    if approximation == 2:
        x_sample = sd_vae_approx.cheap_approximation(sample)
    elif approximation == 1:
        x_sample = sd_vae_approx.model()(sample.to(devices.device, devices.dtype).unsqueeze(0))[0].detach()
    else:
        x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0]

    x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
    if x_sample.dtype == torch.bfloat16:
        x_sample = x_sample.to(torch.float16)
    x_sample = 255. * np.moveaxis(x_sample.cpu().numpy(), 0, 2)
    x_sample = x_sample.astype(np.uint8)
    return Image.fromarray(x_sample)

Additional information

Other system informations:

COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"

python: 3.10.6  •  torch: 2.1.0.dev20230506+cpu  •  xformers: N/A  •  gradio: 3.28.1  •  commit: 5ab7f213  •  checkpoint: b4391b7978

OS Ubuntu 22.04

P.S. Since I'm still just a beginner programmer, the changes were made only as a proof of concept.

I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model, the rest of the functionality worked at 2X increase in speed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions