Skip to content

Audio–Video inference returns only "!!!!!!!" #176

@rahulnuvowork-cpu

Description

@rahulnuvowork-cpu

When trying to run inference with the released VideoLLaMA2.1-7B-AV checkpoint using both audio and video input, the model returns meaningless outputs consisting only of "!" characters, instead of a natural language response.

Reproduction Steps
Minimal script used (based on repo instructions):

///////////////////////////////////////// Inference script /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
import sys
import os
sys.path.append('./')

from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
from huggingface_hub import snapshot_download

def inference():
disable_torch_init()

# --------- Load model weights ----------
model_path = "./checkpoints/VideoLLaMA2.1-7B-AV"
if not os.path.exists(model_path):
    print("Downloading model weights …")
    snapshot_download("DAMO-NLP-SG/VideoLLaMA2.1-7B-AV", local_dir=model_path)

# --------- Initialize model ---------
model, processor, tokenizer = model_init(model_path)

# --------- Input video (with audio track) ---------
video_path = "test_audio_latest3.mp4"
preprocess = processor["video"]

# Audio + Video
av_tensor = preprocess(video_path, va=True)

# --------- Question ---------
question = "Describe what is happening in the video and also analyze the sound (voices, background, or emotions)."

# --------- Inference ---------
output = mm_infer(
    av_tensor,
    question,
    model=model,
    tokenizer=tokenizer,
    modal="video",   # video modal with va=True for AV fusion
    do_sample=False,
)

print("\n=== Model Answer ===")
print(output)

if name == "main":
inference()
///////////////////////////////////////////////////////////////////////// Inference Code Ends //////////////////////////////////////////////////////////////////////////

Expected Behavior
The model should return a descriptive natural language answer that integrates both the video and audio information.

Actual Behavior
The output is only:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Questions for the Authors:-

Are the released VideoLLaMA2.1-7B-AV checkpoints fully supporting audio–video inference?

Is there any special preprocessing needed for the audio track (e.g., specific codecs or formats)?

Should the modal="video" argument be replaced with "av" for correct fusion?

Could the "!!!!" outputs be a sign that audio alignment weights were not included in the release?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions