MiniVeo3-Reasoner: Thinking with Videos from Open-Source Priors

🎯 Overview

Advanced video models have recently demonstrated remarkable zero-shot capabilities of visual reasoning, solving tasks like maze, symmetry, and analogy completion through a chain-of-frames (CoF) reasoning process.

This project shows that such CoF capability can be acquired by fine-tuning open-source video models like Wan2.2.

In the maze domain, the fine-tuned models—dubbed MiniVeo3-Reasoner—exhibit surprisingly strong visual reasoning performance, achieving near-perfect accuracy on in-distribution tests and robust out-of-distribution generalization.

Under controlled comparisons, MiniVeo3-Reasoner significantly outperforms baseline approaches that reason in other modalities such as text or images.

We further envision that this visual reasoning capability can be enhanced through reinforcement learning of video models.

🔥 News

🚩 2025.10: We are thrilled to release MiniVeo3-Reasoner, with mazes as a testbed for visual reasoning!

🤗 Models

Models	Download Links	Description
MiniVeo3-Reasoner-Maze-5B	🤗 HuggingFace	Fine-tuned LoRA for Maze tasks (3x3 to 6x6 sizes) from the base model Wan2.2-TI2V-5B

✨ Examples

Problem Setup	Examples
Maze 3x3	maze3_1.mp4	maze3_2.mp4
Maze 4x4	maze4_1.mp4	maze4_2.mp4
Maze 5x5	maze5_1.mp4	maze5_2.mp4
Maze 6x6	maze6_1.mp4	maze6_2.mp4

OOD Generalization

OOD Solution Lengths:

Problem Setup	Examples
Maze 6x6 (solution len > 12)	maze6ood_1.mp4	maze6ood_2.mp4

OOD Maze Sizes:

Problem Setup	Examples
Maze 7x7	maze7_1.mp4	maze7_2.mp4
Maze 8x8	maze8_1.mp4	maze8_2.mp4

📊 Performance

Following Visual Planning: Let's Think Only with Images, we report two metrics:

Exact Match (EM) measures whether the model successfully generates the complete and correct trajectory that aligns with the shortest optimal valid path.
Progress Rate (PR) measures the number of consecutively correct steps (valid forward moves) from the start to the number of steps in the optimal path.

MiniVeo3-Reasoner-Maze-5B	EM (%)	PR (%)
Maze 3x3	100	100
Maze 4x4	100	100
Maze 5x5	100	100
Maze 6x6	98.4	98.7
Maze 6x6 (OOD solution length)	53.6	59.7
Maze 7x7 (OOD size)	86.8	90.1
Maze 8x8 (OOD size)	60.4	67.8

Comparisons

Under the same amount of training data, we include performance metrics reported in Visual Planning for reference and comparison.

Model	Thinking Modality	Maze EM (%)	Maze PR (%)
Gemini 2.0 Flash - Direct	Text	8.3	31.4
Gemini 2.0 Flash - CoT	Text	6.9	29.8
Gemini 2.0 Pro (think)	Text	21.5	35.5
Qwen 2.5-VL-Instruct-3B - Direct	Text	0.5	13.6
Qwen 2.5-VL-Instruct-3B - CoT	Text	0.8	8.2
Qwen 2.5-VL-Instruct-3B - SFT	Text	33.3	52.7
LVM-3B - VPFT	Image	59.0	64.0
LVM-3B - VPRL	Image	74.5	77.6
MiniVeo3-Reasoner-Maze-5B	Video	99.6	99.7

🚀 Get Started

Environment Setup

conda create -n miniveo3_reasoner python==3.12
conda activate miniveo3_reasoner
pip install -r requirements.txt

We use DiffSynth-Studio for diffusion model training and inference. You need also install it:

git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
git checkout ed256ef8be195d5deae2846a7e9f025670d99db3
pip install -e .

Data Preparation

Our data generator produces a series of mazes with configurable size, path length and amount, outputting a .mp4 video file and a .png image (the first frame of the video).

We use a customized version of maze-dataset. You can install it as follows:

pip install -e data/maze/maze-dataset

After installation, use the script below to generate mazes with custom configurations:

python data/maze/maze_generator.py

To reproduce the same data distribution used in our experiments, simply run:

bash scripts/generate_maze_dataset.sh

The result will be in dataset/maze_train and dataset/maze_test respectively.

Inference

Download our LoRA weights:

pip install "huggingface_hub[cli]"
huggingface-cli download thuml/MiniVeo3-Reasoner-Maze-5B --local-dir models/thuml/MiniVeo3-Reasoner-Maze-5B

To run inference on a single file or directory, use:

python inference/maze/inference_maze.py [-r] filename_or_directory

💡 The first run may take additional time to automatically download the base model files.

To perform inference on all test samples, simply run:

bash scripts/inference_maze_testset.sh

Success Evaluation

Our evaluator compares the predicted trajectory with the ground truth, computing the distance between the two paths.

We implement our own versions of Exact Match (EM) and Progress Rate (PR) metrics for video-based evaluation.

If your generated results are stored in dataset/maze_test and named properly, you can evaluate all test samples by running:

bash scripts/evaluate_maze.sh

Training Models

We train Wan2.2-TI2V-5B with LoRA, following the instructions provided in DiffSynth-Studio. You can easily fine-tune your own models using the same framework.

For your convenience, if you follow ours, you can copy the train dataset dataset/maze_train directly into DiffSynth-Studio/data/example_video_dataset.

🤝 Contributors

Jialong Wu*, Tianhao Huang*, Changjing He*, Mingsheng Long. (*: Equal Contribution)

We welcome contributions! Feel free to open GitHub issues for bug reports or feature requests.

💡 Acknowledgements

Veo 3: This project is inspired by the impressive zero-shot performance of Veo 3!
Wan: Powerful open-source video diffusion models used as base models.
DiffSynth-Studio: Video diffusion model training.
maze-dataset: Data generation for maze reasoning tasks.
Visual Planning: Baseline benchmark for performance comparison.
Nano Banana: Help in generating the project logo.

📜 Citation

There is currently no technical report available.

If you find MiniVeo3-Reasoner useful, we would appreciate it if you could cite our work:

@misc{miniveo3reasoner,
    title = {MiniVeo3-Reasoner: Thinking with Videos from Open-Source Priors},
    author = {Jialong Wu, Tianhao Huang, Changjing He, Mingsheng Long},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/thuml/MiniVeo3-Reasoner}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
assets		assets
data/maze		data/maze
evaluation/maze		evaluation/maze
inference/maze		inference/maze
prompts		prompts
scripts		scripts
showcases/maze		showcases/maze
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MiniVeo3-Reasoner: Thinking with Videos from Open-Source Priors

🎯 Overview

🔥 News

🤗 Models

✨ Examples

OOD Generalization

📊 Performance

Comparisons

🚀 Get Started

Environment Setup

Data Preparation

Inference

Success Evaluation

Training Models

🤝 Contributors

💡 Acknowledgements

📜 Citation

🌟 Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

thuml/MiniVeo3-Reasoner

Folders and files

Latest commit

History

Repository files navigation

MiniVeo3-Reasoner: Thinking with Videos from Open-Source Priors

🎯 Overview

🔥 News

🤗 Models

✨ Examples

OOD Generalization

📊 Performance

Comparisons

🚀 Get Started

Environment Setup

Data Preparation

Inference

Success Evaluation

Training Models

🤝 Contributors

💡 Acknowledgements

📜 Citation

🌟 Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages