GitHub - MCG-NJU/StreamForest: [NeurIPS 2025 Spotlight] StreamForest: Efficient Online Video Understanding with Persistent Event Memory

[NeurIPS 2025 Spotlight] StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang

🤗 Model & Data ｜ 🖥️ Demo | 📑 Paper | 🌐 Blog

🔥 Updates

2025/09/29: 🎉🎉🎉Our models, data and code have been released.
2025/09/19: 🎉🎉🎉Our paper has been accepted as Spotlight in the proceedings of NeurIPS 2025.

🦜 Introduction

StreamForest is a novel architecture designed for real-time streaming video understanding with Multimodal Large Language Models (MLLMs). Unlike prior approaches that struggle with memory constraints or coarse spatiotemporal reasoning, StreamForest introduces two complementary innovations:

Persistent Event Memory Forest (PEMF): A hierarchical, event-level memory system that adaptively organizes and compresses long-term video history using temporal distance, content similarity, and merge frequency. This ensures efficient storage without losing critical contextual information.
Fine-grained Spatiotemporal Window (FSTW): A short-term perception module that captures detailed local spatiotemporal features around the current moment, enabling precise real-time reasoning.

For effective deployment and comprehensive evaluation, we contribute:

OnlineIT, an instruction-tuning dataset tailored for streaming scenarios, improving both present-moment perception and future prediction.
ODV-Bench, a benchmark for real-time autonomous driving video understanding.

Extensive experiments demonstrate that StreamForest consistently outperforms previous state-of-the-art streaming video MLLMs and achieves performance comparable to leading offline models. Even under extreme visual token compression, the model preserves nearly all of its accuracy, highlighting its robustness, efficiency, and scalability for real-world streaming video applications.

⚙️ Installation

Please execute the following commands to clone the StreamForest source code to your local environment:

git clone https://github.com/MCG-NJU/StreamForest.git
cd StreamForest

Create a new environment and install the required dependencies:

conda create --name StreamForest
conda activate StreamForest
pip install -r requirements.txt

Search for all instances of /your_local_path_to within the codebase and replace them with the absolute path of the StreamForest root directory on your local machine.

📊 Evaluation

We employ lmms-eval for model evaluation.

Preparatory Steps

Preparation of the test set JSON: Download the dataset from StreamForest-Annodata and place it under the ./anno directory.
Preparation of the test set videos: Please check the utils.py file under each test task in ./lmms_eval/tasks. Some tasks include a DATA_LIST entry, which means you’ll need to manually specify the directory where your local test videos are stored.

Evaluating StreamForest

You can run the following command to evaluate StreamForest on eight benchmark datasets, including our proposed ODVBench:

bash scripts/eval/run_eval.sh

Evaluating Other Models

Predefined Models

Execute the script lmms_eval/scripts/eval_internvl2-8B.sh to initiate the benchmark evaluation.

Custom Models

To add a custom model, please refer to the implementation of lmms_eval/models/streamforest.py.
Alternatively, you may reuse an existing model integration from lmms_eval and adapt it to your needs. Ensure that lmms_eval/models/__init__.py is updated accordingly to register the new model.

Note: Since the video data in ODVBench is associated with query timestamps, it is essential to employ llava/video_utils.py for proper video loading. As a reference, you may consult the implementation of the load_video function in lmms_eval/models/streamforest.py. Incorporate this function into your custom model as needed to ensure full compatibility with the lmms_eval evaluation framework.

⚡ Training

Data Preparation

OnlineIT-general: Download the dataset from StreamForest-Annodata and place it under the ./anno directory.
Video & Image Data: Download annotations from VideoChat-Flash-Training-Data and store them in the ./annotations directory.
Inspect all YAML files in ./anno/data_list. For each entry, download the corresponding image or video data specified by the data_root field, and replace the URLs with the paths to your local data storage.

Training Procedure

Our training pipeline consists of five stages. The first three stages follow the setup of VideoChat-Flash, while the fourth and fifth stages involves online video fine-tuning.

To execute the first three stages of offline video pretraining, run

bash scripts/train/stage1-init_connector/s1_siglip_tome64_mlp.sh
bash scripts/train/stage2-visual_pretraining/s2_siglip_tome64_mlp.sh
bash scripts/train/stage3-video_sft/s3_siglip_tome16_mlp.sh

To perform the fourth and fifth stages of online video fine-tuning, run:

bash scripts/train/stage4-online_ft/s4_siglip_online_dynamic_tree_memory.sh
bash scripts/train/stage5-drive_ft/s5_siglip_online_tree_memory_drive.sh

The table below summarizes the configuration of each stage, including the corresponding scripts, datasets, and released checkpoints.

Stage	ViT	Connector	LLM	Script	Data	Ckpt
1	❄️	🔥	❄️	click me	click me	-
2	🔥	🔥	🔥	click me	click me	-
3	🔥	🔥	🔥	click me	click me	StreamForest-Pretrain-Qwen2-7B
4	❄️	🔥	🔥	click me	click me	StreamForest-Qwen2-7B
5	❄️	🔥	🔥	click me	click me	StreamForest-Drive-Qwen2-7B

Based on the table above, you can quickly start the 4th and 5th stages of online fine-tuning using the provided checkpoints.

📄 Citation

@misc{zeng2025streamforest,
      title={StreamForest: Efficient Online Video Understanding with Persistent Event Memory}, 
      author={Xiangyu Zeng and Kefan Qiu and Qingyu Zhang and Xinhao Li and Jing Wang and Jiaxin Li and Ziang Yan and Kun Tian and Meng Tian and Xinhai Zhao and Yi Wang and Limin Wang},
      year={2025},
      eprint={2509.24871},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.24871}, 
}

💫 Acknowledgement

Thanks to the open source of the following projects:

VideoChat-Flash: Efficient architecture, data and training methods for offline video understanding.
LLaVA-NeXT: Original code framework and image and video training data.
ToMe: Efficient token merging and compression method.
VideoChat-Online: Inspiring memory mechanism design and online training data.
StreamChat: Design of a training-free tree-structured hierarchical memory mechanism.
MovieChat: Pioneer of long video memory mechanisms.
VideoLLM-Online: Pioneer of online video understanding methods.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
demo		demo
llava		llava
lmms_eval		lmms_eval
scripts		scripts
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[NeurIPS 2025 Spotlight] StreamForest: Efficient Online Video Understanding with Persistent Event Memory

🔥 Updates

🦜 Introduction

⚙️ Installation

📊 Evaluation

Preparatory Steps

Evaluating StreamForest

Evaluating Other Models

Predefined Models

Custom Models

⚡ Training

Data Preparation

Training Procedure

📄 Citation

💫 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

MCG-NJU/StreamForest

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS 2025 Spotlight] StreamForest: Efficient Online Video Understanding with Persistent Event Memory

🔥 Updates

🦜 Introduction

⚙️ Installation

📊 Evaluation

Preparatory Steps

Evaluating StreamForest

Evaluating Other Models

Predefined Models

Custom Models

⚡ Training

Data Preparation

Training Procedure

📄 Citation

💫 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages