[NeurIPS 2025 Spotlight] StreamForest: Efficient Online Video Understanding with Persistent Event Memory
Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang
        🤗 Model & Data    |   🖥️ Demo    |    📑 Paper    |    🌐 Blog
- 2025/09/29: 🎉🎉🎉Our models, data and code have been released.
- 2025/09/19: 🎉🎉🎉Our paper has been accepted as Spotlight in the proceedings of NeurIPS 2025.
StreamForest is a novel architecture designed for real-time streaming video understanding with Multimodal Large Language Models (MLLMs). Unlike prior approaches that struggle with memory constraints or coarse spatiotemporal reasoning, StreamForest introduces two complementary innovations:
- 
Persistent Event Memory Forest (PEMF): A hierarchical, event-level memory system that adaptively organizes and compresses long-term video history using temporal distance, content similarity, and merge frequency. This ensures efficient storage without losing critical contextual information. 
- 
Fine-grained Spatiotemporal Window (FSTW): A short-term perception module that captures detailed local spatiotemporal features around the current moment, enabling precise real-time reasoning. 
For effective deployment and comprehensive evaluation, we contribute:
- 
OnlineIT, an instruction-tuning dataset tailored for streaming scenarios, improving both present-moment perception and future prediction. 
- 
ODV-Bench, a benchmark for real-time autonomous driving video understanding. 
Extensive experiments demonstrate that StreamForest consistently outperforms previous state-of-the-art streaming video MLLMs and achieves performance comparable to leading offline models. Even under extreme visual token compression, the model preserves nearly all of its accuracy, highlighting its robustness, efficiency, and scalability for real-world streaming video applications.
- Please execute the following commands to clone the StreamForest source code to your local environment:
git clone https://github.com/MCG-NJU/StreamForest.git
cd StreamForest
- Create a new environment and install the required dependencies:
conda create --name StreamForest
conda activate StreamForest
pip install -r requirements.txt
- Search for all instances of /your_local_path_towithin the codebase and replace them with the absolute path of the StreamForest root directory on your local machine.
We employ lmms-eval for model evaluation.
- 
Preparation of the test set JSON: Download the dataset from StreamForest-Annodata and place it under the ./annodirectory.
- 
Preparation of the test set videos: Please check the utils.pyfile under each test task in./lmms_eval/tasks. Some tasks include aDATA_LISTentry, which means you’ll need to manually specify the directory where your local test videos are stored.
- You can run the following command to evaluate StreamForest on eight benchmark datasets, including our proposed ODVBench:
bash scripts/eval/run_eval.sh
- Execute the script lmms_eval/scripts/eval_internvl2-8B.shto initiate the benchmark evaluation.
- 
To add a custom model, please refer to the implementation of lmms_eval/models/streamforest.py.
- 
Alternatively, you may reuse an existing model integration from lmms_eval and adapt it to your needs. Ensure that lmms_eval/models/__init__.pyis updated accordingly to register the new model.
Note: Since the video data in ODVBench is associated with query timestamps, it is essential to employ llava/video_utils.py for proper video loading. As a reference, you may consult the implementation of the load_video function in lmms_eval/models/streamforest.py. Incorporate this function into your custom model as needed to ensure full compatibility with the lmms_eval evaluation framework.
- 
OnlineIT-general: Download the dataset from StreamForest-Annodata and place it under the ./annodirectory.
- 
Video & Image Data: Download annotations from VideoChat-Flash-Training-Data and store them in the ./annotationsdirectory.
- 
Inspect all YAML files in ./anno/data_list. For each entry, download the corresponding image or video data specified by thedata_rootfield, and replace the URLs with the paths to your local data storage.
Our training pipeline consists of five stages. The first three stages follow the setup of VideoChat-Flash, while the fourth and fifth stages involves online video fine-tuning.
- To execute the first three stages of offline video pretraining, run
bash scripts/train/stage1-init_connector/s1_siglip_tome64_mlp.sh
bash scripts/train/stage2-visual_pretraining/s2_siglip_tome64_mlp.sh
bash scripts/train/stage3-video_sft/s3_siglip_tome16_mlp.sh
- To perform the fourth and fifth stages of online video fine-tuning, run:
bash scripts/train/stage4-online_ft/s4_siglip_online_dynamic_tree_memory.sh
bash scripts/train/stage5-drive_ft/s5_siglip_online_tree_memory_drive.sh
The table below summarizes the configuration of each stage, including the corresponding scripts, datasets, and released checkpoints.
| Stage | ViT | Connector | LLM | Script | Data | Ckpt | 
|---|---|---|---|---|---|---|
| 1 | ❄️ | 🔥 | ❄️ | click me | click me | - | 
| 2 | 🔥 | 🔥 | 🔥 | click me | click me | - | 
| 3 | 🔥 | 🔥 | 🔥 | click me | click me | StreamForest-Pretrain-Qwen2-7B | 
| 4 | ❄️ | 🔥 | 🔥 | click me | click me | StreamForest-Qwen2-7B | 
| 5 | ❄️ | 🔥 | 🔥 | click me | click me | StreamForest-Drive-Qwen2-7B | 
Based on the table above, you can quickly start the 4th and 5th stages of online fine-tuning using the provided checkpoints.
@misc{zeng2025streamforest,
      title={StreamForest: Efficient Online Video Understanding with Persistent Event Memory}, 
      author={Xiangyu Zeng and Kefan Qiu and Qingyu Zhang and Xinhao Li and Jing Wang and Jiaxin Li and Ziang Yan and Kun Tian and Meng Tian and Xinhai Zhao and Yi Wang and Limin Wang},
      year={2025},
      eprint={2509.24871},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.24871}, 
}Thanks to the open source of the following projects:
- VideoChat-Flash: Efficient architecture, data and training methods for offline video understanding.
- LLaVA-NeXT: Original code framework and image and video training data.
- ToMe: Efficient token merging and compression method.
- VideoChat-Online: Inspiring memory mechanism design and online training data.
- StreamChat: Design of a training-free tree-structured hierarchical memory mechanism.
- MovieChat: Pioneer of long video memory mechanisms.
- VideoLLM-Online: Pioneer of online video understanding methods.


