-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Congratulations on the arxiv submission!
I tried to reproduce the results of this paper on top of Huggingface Diffusers, based on the reference implementation provided in the preprint.
I ended up implementing like so:
Changes to txt2img
Changes to diffusers
Some explanation in tweet.
In my independent implementation: structured diffusion changes the images only slightly, and in the 10 samples * 4 prompts that I tried, never made the generations more relevant to the prompt.
structured (left) / regular (right) "two blue sheep and a red goat":
I attach the rest of my results:
A red bird and a green apple.zip
A white goat standing next to two black goats.zip
two blue sheep and a red goat.zip
Two ripe spotted bananas are sitting inside a green bowl on a gray counter.zip
Basically, I'm wondering whether:
- this is exactly the kind of difference I should expect to see (in line with the claimed 5–8% advantage)
- there's a mistake in my reproduction; better results are possible
Could you possibly read my attention.py
and see if it looks like a reasonable interpretation of your algorithm? I changed it substantially to make it to do more work in parallel. I think it should be equivalent, but did I miss something important?
Thanks in advance for any attention you can give this!