Audio–Video inference returns only "!!!!!!!"

When trying to run inference with the released VideoLLaMA2.1-7B-AV checkpoint using both audio and video input, the model returns meaningless outputs consisting only of "!" characters, instead of a natural language response.

Reproduction Steps
Minimal script used (based on repo instructions):


/////////////////////////////////////////  Inference script /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
import sys
import os
sys.path.append('./')

from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
from huggingface_hub import snapshot_download

def inference():
    disable_torch_init()

    # --------- Load model weights ----------
    model_path = "./checkpoints/VideoLLaMA2.1-7B-AV"
    if not os.path.exists(model_path):
        print("Downloading model weights …")
        snapshot_download("DAMO-NLP-SG/VideoLLaMA2.1-7B-AV", local_dir=model_path)

    # --------- Initialize model ---------
    model, processor, tokenizer = model_init(model_path)

    # --------- Input video (with audio track) ---------
    video_path = "test_audio_latest3.mp4"
    preprocess = processor["video"]

    # Audio + Video
    av_tensor = preprocess(video_path, va=True)

    # --------- Question ---------
    question = "Describe what is happening in the video and also analyze the sound (voices, background, or emotions)."

    # --------- Inference ---------
    output = mm_infer(
        av_tensor,
        question,
        model=model,
        tokenizer=tokenizer,
        modal="video",   # video modal with va=True for AV fusion
        do_sample=False,
    )

    print("\n=== Model Answer ===")
    print(output)

if __name__ == "__main__":
    inference()
///////////////////////////////////////////////////////////////////////// Inference  Code Ends //////////////////////////////////////////////////////////////////////////

Expected Behavior
The model should return a descriptive natural language answer that integrates both the video and audio information.

Actual Behavior
The output is only:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Questions for the Authors:- 

Are the released VideoLLaMA2.1-7B-AV checkpoints fully supporting audio–video inference?

Is there any special preprocessing needed for the audio track (e.g., specific codecs or formats)?

Should the modal="video" argument be replaced with "av" for correct fusion?

Could the "!!!!" outputs be a sign that audio alignment weights were not included in the release?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audio–Video inference returns only "!!!!!!!" #176

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Audio–Video inference returns only "!!!!!!!" #176

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions