MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Jingli Lin^1,2*, Runsen Xu^1,3*^†, Shaohao Zhu^1,4, Sihan Yang¹, Peizhou Cao^1,5, Yunlong Ran^1,4, Miao Hu⁶, Chenming Zhu^1,7, Yiman Xie^1,4, Yilin Long^1,8, Wenbo Hu^1,9, Dahua Lin^1,3, Tai Wang^1✉, Jiangmiao Pang^1✉

1. Shanghai AI Laboratory 2. Shanghai Jiaotong University 3. The Chinese University of Hong Kong 4. Zhejiang University 5. Beihang University 6. Xi'an Jiaotong University 7. University of Hong Kong 8. Fudan University 9. University of California, Los Angeles

^*Equal Contribution ^†Project Lead

🎥 Demo

🔔 News

🔥[2025-12]: Our MMSI-Video-Bench has been integrated into VLMEvalKit.

🔥[2025-12]: We released our paper, benchmark, and evaluation codes.

Features of MMSI-Video-Bench

MMSI-Video-Bench is a fully human-annotated benchmark for evaluating video-based spatial intelligence in MLLMs. It implements a four-level framework—Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 video clips sourced from 25 public datasets as well as in-house recordings.

High Quality. All data are manually annotated by 11 domain experts in 3D vision, following a rigorous review and acceptance process to ensure annotation accuracy and reliability.
Challenging. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human–AI gap, even the best reasoning model trails human performance by nearly 60%.
Diverse Video Sources. MMSI-Video-Bench includes videos from 25 public datasets and 1 in-house collection, spanning tabletop recordings, indoor and multi-floor environments, outdoor scenes, sports, and movie footage.
Comprehensive and Hostile Task Coverage. The benchmark covers spatial layout reasoning, motion understanding, decision-making, and cross-video reasoning, providing a more holistic evaluation of video-based spatial intelligence.

Example

The questions in MMSI-VIdeo-Bench span 5 major categories: (1).Spatial Construction：This category focuses on spatial attributes of instances and scenes, as well as spatial relationships among instances, scenes, and cameras (six subtypes in total); (2).Motion Understanding: This includes understanding camera motion, instance motion, and interactive motion between instances; (3).Planning based on spatiotemporal video information; (4).Prediction: Assessing a model’s ability to predict, anticipate, or imagine future states based on the observed video; (5).Cross-Video Reasoning: This involves memory update across temporally separated video segments and multi-view integration across videos captured from different viewpoints.

📊 Data Details

All of our data is available on Hugging Face and includes the following components:

🎥 Video Data (videos.zip): Contains the video clip file (.mp4) corresponding to each sample. This file is generally not required for most models.

🎥 Frame Data (frames.zip): Contains the frames (.jpg) extracted from each sample's video at the base sampling rate. This rate ensures no key information loss during sampling. Each frame file is named using the format {timestamp}_frame_{base_interval}_{image_id} (e.g., 00:06.00_frame_1.50_4), where the timestamp, also shown on the top-left corner of the frame, indicates its capture time in the original recording.

🖼️ Reference Image Data (ref_images.zip): Contains the auxiliary images referenced in the questions for each sample.

📝 Text Annotation (mmsivideo.json)：This file contains the annotation information for MMSI-Video-Bench. All time references in the questions correspond to the capture time in the original recording and align with the timestamp flag on each frame. Key fields include:

{
  "ref_images": [Paths to auxiliary images referenced in the question,...],
  "video_list": [
    {
      "path": Video clip file path,
      "start": Timestamp (in seconds) of the first frame of the video clip in the original recording,
      "end": Timestamp (in seconds) of the last frame of the video clip in the original recording,
      "base_fps": Base sampling rate
    },
    ...
  ],
  "frames_list": [[Paths to frames sampled at the base sampling rate,...],...],
  "system_prompt": "...",
  "task_prompt": Task-specific prompt,
  "user_prompt": Question text, with <video> as a placeholder for video and <image> for auxiliary images,
  "format_prompt": Output format requirements,
  "ground_truth": Correct answer
}

Unless otherwise specified, the model input generally consists of: system_prompt + task_prompt + user_prompt + format_prompt.

🚀 Quick Start

As outlined in our paper, we support two evaluation settings: Sufficient-Coverage ensures lossless sampling by preserving all essential information, while Uniform‑50 uniformly samples 50 frames and cannot guarantee full information retention. Although most models—due to API constraints or GPU memory limits—only support the Uniform‑50 setting, we still strongly recommend evaluating under the Sufficient‑Coverage setting whenever possible.

Native Codebase Evaluation

Data Preparation: After downloading the data, unzip the files and organize the directory as follows:
```
data/
├── mmsivideo.json
├── frames/
├── ref_images/
├── videos/
```
Install Required Packages. If you plan to evaluate open-source models, please set up their corresponding environments.
```
pip install -r requirements.txt
```
Run Inference & Evaluation: For proprietary models: Update the base_url and api_key in utils/openai_api.py with your own credentials. For open-source models: Modify the load_model function in inference.py to point to the corresponding model path. Run the following commands to perform inference for a specific model under a given setting, and evaluate the results on a specific benchmark (Main, Robot Bench, Indoor Scene Perception Bench or Grounding Bench):
```
python inference.py --model_name {model_name} --setting Uniform-50/Sufficient-Coverage # inference
python evaluation.py --eval_dir {path/to/results} --bench main/robot_bench/ground_bench/indoor_perception_bench/easy2hard_bench # evaluation
```

VLMevalKit Evaluation

Our MMSI-Video-Bench has been integrated into VLMEvalKit. As VLMEvalKit currently does not support mixed video and image inputs, we have adapted a version that accepts only image inputs. Follow the QuickStart guide of VLMEvalKit and place the required data under the following structure:

LMUDATA/
├── MMSIVideo_SC.tsv
├── MMSIVideo_U50.tsv
├── images/
│   └── MMSIVideo/
│       ├── frames/
│       ├── ref_images/

When evaluating proprietary models, set img_detail = low; when evaluating QwenVL-series models, set max_pixels = 360*420. Run the evaluation with the following command:

python run.py --model Qwen2.5-VL-32B-Instruct --data MMSIVideo_SC/MMSIVideo_U50

Note: The results reported in our paper were generated through the Native Codebase Evaluation. While actual evaluation results may differ slightly from the paper, variations remain within an acceptable range. Potential contributing factors include:

(1) Variations in option ordering, runtime environments, or random seeds, as well as differences in inference configurations between VLMEvalKit and the Native Codebase Evaluation, may lead to fluctuations of 1–2% in the overall score.

(2) For certain models—such as the QwenVL series and the LLaVA-Video series—the input configuration in VLMEvalKit (image-only) differs from that in the Native Codebase (image + video), which may result in a variance of 2–3% in the overall score.

(3) The overall scores of thinking models (e.g., GPT-4o, Gemini 3 Pro, Gemini 2.5 Flash) may also fluctuate within a range of 2–3% across runs.

🏆 Leaderboard

📦 Uniform-50 Setting

Model	Avg.(%)	Type
Human	96.40	Baseline
🥇Gemini 3 pro	37.97	Proprietary
🥈 O3	36.98	Proprietary
🥉GPT-5	36.80	Proprietary
Gemini 2.5 Flash	35.44	Proprietary
Gemini 2.5 Flash (Thinking)	35.17	Proprietary
Seed-1.6-vision	34.87	Proprietary
Claude-haiku-4.5	34.27	Proprietary
O4-mini	34.18	Proprietary
QwenVL2.5-72B	32.73	Open-Source
InternVL3-78B	32.55	Open-Source
Doubao-1.5-thinking	31.65	Proprietary
GPT-4o	31.56	Proprietary
InternVL2.5-78B	31.37	Open-Source
InternVL2.5-38B	31.01	Open-Source
QwenVL3-30B (Thinking)	30.83	Open-Source
LLaVA-Video-72B	30.38	Open-Source
InternVL3-8B	30.38	Open-Source
QwenVL2.5-VL-7B-Instruct	29.66	Open-Source
InternVL2.5-8B	29.11	Open-Source
InternVL3-38B	28.84	Open-Source
QwenVL3-30B	28.75	Open-Source
QwenVL2.5-32B	28.57	Open-Source
LLaVA-Video-7B	28.48	Open-Source
QwenVL3-8B	27.58	Open-Source
InternVideo2.5-8B	27.40	Open-Source
Random Guessing	24.10	Baseline

📦 Sufficient-Coverage Setting

Model	Avg.(%)	Type
Human	96.4	Baseline
🥇O3	37.34	Proprietary
🥈Gemini 2.5 Flash (Thinking)	36.71	Proprietary
🥉Gemini 2.5 Flash	36.62	Proprietary
O4-mini	35.08	Proprietary
QwenVL2.5-32B	32.37	Open-Source
QwenVL2.5-72B	31.83	Open-Source
InternVL3-8B	29.57	Open-Source
QwenVL3-30B	29.11	Open-Source
QwenVL3-8B	29.09	Open-Source
QwenVL2.5-7B	28.84	Open-Source
InternVL2.5-8B	28.66	Open-Source
GPT-4o	28.12	Proprietary
QwenVL3-30B (Thinking)	28.03	Open-Source
InternVideo2.5-8B	26.85	Open-Source
Random Guessing	24.10	Baseline

🤖 Robot Sub-Bench

Model	Avg.(%)	Type
🥇Gemini 3 Pro	40.20	Proprietary
🥈Gemini 2.5 Flash (Thinking)	39.71	Proprietary
🥉Seed-1.6-vision	39.34	Proprietary
O3	39.22	Proprietary
QwenVL2.5-72B	37.75	Open-Source
InternVL3-8B	37.75	Open-Source
GPT-5	37.75	Proprietary
InternVL2.5-38B	36.27	Open-Source
Doubao-1.5-thinking	36.07	Proprietary
Gemini 2.5 Flash	35.78	Proprietary
O4-mini	35.29	Proprietary
QwenVL2.5-7B	34.8	Open-Source
InternVL2.5-78B	34.8	Open-Source
Claude-haiku-4.5	34.8	Proprietary
InternVL3-78B	34.31	Open-Source
LLaVA-Video-72B	34.31	Open-Source
QwenVL3-30B	32.84	Open-Source
QwenVL2.5-32B	32.84	Open-Source
QwenVL3-8B	32.12	Open-Source
InternVideo2.5-8B	29.90	Open-Source
GPT-4o	29.90	Proprietary
InternVL2.5-8B	28.43	Open-Source
InternVL3-38B	27.94	Open-Source
QwenVL3-30B (Thinking)	27.94	Open-Source
LLaVA-Video-7B	24.51	Open-Source

🏠 Indoor Scene Perception Sub-Bench

Model	Avg.(%)	Type
🥇GPT-5	41.68	Proprietary
🥈O3	40.73	Proprietary
🥉Gemini 2.5 Flash	39.39	Proprietary
Gemini 3 Pro	39.39	Proprietary
Gemini 2.5 Flash (Thinking)	37.86	Proprietary
O4-mini	37.48	Proprietary
Seed-1.6-vision	34.2	Proprietary
Claude-haiku-4.5	33.46	Proprietary
Doubao-1.5-thinking	33.04	Proprietary
InternVL3-78B	32.5	Open-Source
QwenVL3-30B (Thinking)	32.31	Open-Source
GPT-4o	31.74	Proprietary
QwenVL2.5-72B	30.78	Open-Source
InternVL2.5-78B	30.4	Open-Source
QwenVL3-30B	30.02	Open-Source
QwenVL2.5-32B	29.64	Open-Source
InternVL2.5-8B	29.45	Open-Source
InternVL3-38B	29.06	Open-Source
QwenVL3-8B	28.68	Open-Source
InternVL2.5-38B	28.3	Open-Source
LLaVA-Video-72B	28.11	Open-Source
InternVL3-8B	27.72	Open-Source
LLaVA-Video-7B	27.53	Open-Source
QwenVL2.5-7B	27.15	Open-Source
InternVideo2.5-8B	26.77	Open-Source

📍 Grounding Sub-Bench

Model	Avg.(%)	Type
🥇Gemini 2.5 Flash	38.81	Proprietary
🥈Gemini 2.5 Flash (Thinking)	38.21	Proprietary
🥉O3	37.61	Proprietary
Doubao-1.5-thinking	37.05	Proprietary
InternVL3-78B	35.52	Open-Source
GPT-5	35.22	Proprietary
Gemini 3 Pro	35.22	Proprietary
O4-mini	34.33	Proprietary
QwenVL2.5-72B	34.33	Open-Source
Seed-1.6-vision	33.04	Proprietary
Claude-haiku-4.5	32.84	Proprietary
InternVL2.5-38B	31.94	Open-Source
InternVL3-8B	31.94	Open-Source
GPT-4o	31.94	Proprietary
QwenVL3-30B (Thinking)	31.64	Open-Source
QwenVL2.5-32B	31.04	Open-Source
LLaVA-Video-72B	31.04	Open-Source
InternVL3-38B	30.45	Open-Source
InternVL2.5-8B	30.15	Open-Source
InternVL2.5-78B	29.85	Open-Source
QwenVL3-30B	29.25	Open-Source
QwenVL2.5-7B	28.66	Open-Source
QwenVL3-8B	28.66	Open-Source
InternVideo2.5-8B	27.76	Open-Source
LLaVA-Video-7B	27.16	Open-Source

Note: For the three sub-benchmarks, we take the higher score of each model across the two settings for easier presentation.

🔗 Citation

@misc{lin2025mmsivideobenchholisticbenchmarkvideobased,
      title={MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence}, 
      author={Jingli Lin and Runsen Xu and Shaohao Zhu and Sihan Yang and Peizhou Cao and Yunlong Ran and Miao Hu and Chenming Zhu and Yiman Xie and Yilin Long and Wenbo Hu and Dahua Lin and Tai Wang and Jiangmiao Pang},
      year={2025},
      eprint={2512.10863},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10863}, 
}

📄 License

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Acknowledgment

MMSI-Video-Bench utilizes data from the following open-source datasets: Roomtour3d, ScanNet, ScanNet++, 3RScan, ARKitScenes, RealEstate10k, DL3DV, Waymo, NuScenes, OVIS, TrackingNet, LaSOT, UAV123, Ego4D, EPIC-KITCHENS, EgoExoLearn, MultiSports, charades, LEMMA, TF2023, CVMHT, AVA, DROID, RH20T, DTU. We sincerely thank the respective teams for their valuable contributions to the research community.

Contact

Jingli Lin: [email protected]
Runsen Xu: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
meta_data		meta_data
models		models
utils		utils
README.md		README.md
dataset.py		dataset.py
evaluation.py		evaluation.py
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

🎥 Demo

🔔 News

Features of MMSI-Video-Bench

Example

📊 Data Details

🚀 Quick Start

Native Codebase Evaluation

VLMevalKit Evaluation

🏆 Leaderboard

🔗 Citation

📄 License

Acknowledgment

Contact

About

Uh oh!

Releases

Packages

Languages

InternRobotics/MMSI-Video-Bench

Folders and files

Latest commit

History

Repository files navigation

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

🎥 Demo

🔔 News

Features of MMSI-Video-Bench

Example

📊 Data Details

🚀 Quick Start

Native Codebase Evaluation

VLMevalKit Evaluation

🏆 Leaderboard

🔗 Citation

📄 License

Acknowledgment

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages