Failing to reproduce results

Congratulations on the [arxiv submission](https://arxiv.org/abs/2212.05032)!

I tried to reproduce the results of this paper on top of [Huggingface Diffusers](https://github.com/huggingface/diffusers), based on the reference implementation provided in the [preprint](https://openreview.net/forum?id=PUIqjT4rzq7).  

I ended up implementing like so:  
Changes [to txt2img](https://github.com/Birch-san/diffusers-play/compare/5d0492cf7c2711157702824537f6904295aa6a7a...7ff06988631bce4a17eb08799e5c9ab67f226a0a)  
Changes [to diffusers](https://github.com/Birch-san/diffusers/compare/fc94c60c8373862c509e388f3f4065d98cedf589...58a5f7d3252f5c967ee2225987b5ec093d63c766)  
Some explanation in [tweet](https://twitter.com/Birchlabs/status/1602105002448805891).

In my independent implementation: structured diffusion changes the images **only slightly**, and in the 10 samples * 4 prompts that I tried, _never_ made the generations more relevant to the prompt.

structured (left) / regular (right) "two blue sheep and a red goat":  
<img width="128" src="https://user-images.githubusercontent.com/6141784/207460981-1c8f4308-9f00-4f91-8505-f1fbfb1a9bd0.png"><img width="128" src="https://user-images.githubusercontent.com/6141784/207461050-c7d480f7-2276-42dd-b898-ad01adb8ee90.png">
<img width="128" src="https://user-images.githubusercontent.com/6141784/207461435-136cda5f-86e2-4375-b67a-32bc05d1f855.png"><img width="128" src="https://user-images.githubusercontent.com/6141784/207461581-82104077-90bd-4471-97cb-0aee074ecf2d.png">
<img width="128" src="https://user-images.githubusercontent.com/6141784/207461747-4436c7b9-9dd2-4260-b97d-d4b4d8ba898f.png"><img width="128" src="https://user-images.githubusercontent.com/6141784/207461788-bc7553bf-c09c-4718-823d-45257f8e91eb.png">
<img width="128" src="https://user-images.githubusercontent.com/6141784/207461857-50181766-0aa0-4b19-b836-92368d7d3ec4.png"><img width="128" src="https://user-images.githubusercontent.com/6141784/207461897-48289b13-0d2b-4d6c-ae6d-9b21c7ec3871.png">

I attach the rest of my results:  
[A red bird and a green apple.zip](https://github.com/weixi-feng/Structured-Diffusion-Guidance/files/10222748/A.red.bird.and.a.green.apple.zip)
[A white goat standing next to two black goats.zip](https://github.com/weixi-feng/Structured-Diffusion-Guidance/files/10222749/A.white.goat.standing.next.to.two.black.goats.zip)
[two blue sheep and a red goat.zip](https://github.com/weixi-feng/Structured-Diffusion-Guidance/files/10222750/two.blue.sheep.and.a.red.goat.zip)
[Two ripe spotted bananas are sitting inside a green bowl on a gray counter.zip](https://github.com/weixi-feng/Structured-Diffusion-Guidance/files/10222751/Two.ripe.spotted.bananas.are.sitting.inside.a.green.bowl.on.a.gray.counter.zip)

Basically, I'm wondering whether:
- this is exactly the kind of difference I should expect to see (in line with the claimed 5–8% advantage)
- there's a mistake in my reproduction; better results are possible

Could you possibly read my [`attention.py`](https://github.com/Birch-san/diffusers/compare/fc94c60c8373862c509e388f3f4065d98cedf589...58a5f7d3252f5c967ee2225987b5ec093d63c766#diff-852d5806e41c100ee99c68c9e1059bfad9c0177e6f4416f0faf5a2f2a6a64067R610) and see if it looks like a reasonable interpretation of your algorithm? I changed it substantially to make it to do more work in parallel. I _think_ it should be equivalent, but did I miss something important?

Thanks in advance for any attention you can give this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failing to reproduce results #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Failing to reproduce results #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions