[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs

### Is there an existing issue for this?

- [X] I have searched the existing issues and checked the recent builds/commits

### What would your feature do ?

Many modern processors have bfloat16 support such as AMD Zen4, Apple M2, Intel Cooper Lake, Intel Sapphire Rapids.

By using autocast bfloat16 I doubled the performance.

- Ryzen 9 7950X (32Gb) speedup from 0.625 it/s to 1.3 it/s

### Proposed workflow

1. Change in ./modules/devices.py 

Add return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) in autocast functions.

```
def autocast(disable=False):
    from modules import shared

    if disable:
        contextlib.nullcontext()

    if dtype == torch.float32 or shared.cmd_opts.precision == "full":
        return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) # contextlib.nullcontext()

    return torch.autocast("cuda")
```

2. Change in ./modules/sd_samplers_common.py 
Add  if x_sample.dtype == torch.bfloat16: x_sample = x_sample.to(torch.float16) in single_sample_to_image, because numpy dont work with bfloat16 yet.

```
def single_sample_to_image(sample, approximation=None):
    if approximation is None:
        approximation = approximation_indexes.get(opts.show_progress_type, 0)

    if approximation == 2:
        x_sample = sd_vae_approx.cheap_approximation(sample)
    elif approximation == 1:
        x_sample = sd_vae_approx.model()(sample.to(devices.device, devices.dtype).unsqueeze(0))[0].detach()
    else:
        x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0]

    x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
    if x_sample.dtype == torch.bfloat16:
        x_sample = x_sample.to(torch.float16)
    x_sample = 255. * np.moveaxis(x_sample.cpu().numpy(), 0, 2)
    x_sample = x_sample.astype(np.uint8)
    return Image.fromarray(x_sample)
```


### Additional information

Other system informations:

COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"

python: 3.10.6  •  torch: 2.1.0.dev20230506+cpu  •  xformers: N/A  •  gradio: 3.28.1  •  commit: [5ab7f213](https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/5ab7f213bec2f816f9c5644becb32eb72c8ffb89)  •  checkpoint: [b4391b7978](https://google.com/search?q=b4391b7978d973427a4a2a3ed0e67b18ebe34e213904575043d6c9faec75c22d)

OS Ubuntu 22.04

P.S. Since I'm still just a beginner programmer, the changes were made only as a proof of concept.

I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model, the rest of the functionality worked at 2X increase in speed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

Is there an existing issue for this?

What would your feature do ?

Proposed workflow

Additional information

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

Description

Is there an existing issue for this?

What would your feature do ?

Proposed workflow

Additional information

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions