-
Notifications
You must be signed in to change notification settings - Fork 84
Description
When trying to run inference with the released VideoLLaMA2.1-7B-AV checkpoint using both audio and video input, the model returns meaningless outputs consisting only of "!" characters, instead of a natural language response.
Reproduction Steps
Minimal script used (based on repo instructions):
///////////////////////////////////////// Inference script /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
import sys
import os
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
from huggingface_hub import snapshot_download
def inference():
disable_torch_init()
# --------- Load model weights ----------
model_path = "./checkpoints/VideoLLaMA2.1-7B-AV"
if not os.path.exists(model_path):
print("Downloading model weights …")
snapshot_download("DAMO-NLP-SG/VideoLLaMA2.1-7B-AV", local_dir=model_path)
# --------- Initialize model ---------
model, processor, tokenizer = model_init(model_path)
# --------- Input video (with audio track) ---------
video_path = "test_audio_latest3.mp4"
preprocess = processor["video"]
# Audio + Video
av_tensor = preprocess(video_path, va=True)
# --------- Question ---------
question = "Describe what is happening in the video and also analyze the sound (voices, background, or emotions)."
# --------- Inference ---------
output = mm_infer(
av_tensor,
question,
model=model,
tokenizer=tokenizer,
modal="video", # video modal with va=True for AV fusion
do_sample=False,
)
print("\n=== Model Answer ===")
print(output)
if name == "main":
inference()
///////////////////////////////////////////////////////////////////////// Inference Code Ends //////////////////////////////////////////////////////////////////////////
Expected Behavior
The model should return a descriptive natural language answer that integrates both the video and audio information.
Actual Behavior
The output is only:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Questions for the Authors:-
Are the released VideoLLaMA2.1-7B-AV checkpoints fully supporting audio–video inference?
Is there any special preprocessing needed for the audio track (e.g., specific codecs or formats)?
Should the modal="video" argument be replaced with "av" for correct fusion?
Could the "!!!!" outputs be a sign that audio alignment weights were not included in the release?