1S-Lab, Nanyang Technological Universityβ 2A*STAR, Singaporeβ 3Simon Fraser Universityβ 4Shanghai AI Lab
Ego-R1 is a comprehensive research framework that combines reinforcement learning-based tool-use reasoning with egocentric video analysis capabilities.
This repository provides:
- Chain-of-Tool-Thought Generation (cott_gen): Multi-modal AI agents for analyzing egocentric video data with tool-calling capabilities (RAG, Video-LLM, VLM)
- Ego-R1-Agent: Reinforcement learning framework for training multiturn tool-use interleaved LLMs
- Ego-R1 Dataset: 25K Chain-of-Tool-Thought examples and 4.4K QA pairs
- Multi-modal Tool-Augmented Reasoning: Combines RAG search, Video-LLM, and Vision-Language Models for long video understanding. Agents learn to use multiple tools to decompose and answer complex egocentric video questions
- Reinforcement Learning: GRPO training for thinking-reasoning-and-acting interleaved behavior
- Comprehensive Dataset: Release the code for CoTT data generation and pre-processed data for both SFT and RL training
- [2025.6.8] Officially launch the Ego-R1 codebase.
- Repository Structure
- Installation
- Quick Start
- Usage Examples
- Dataset
- Acknowledgments
- License
- Contributing
- Authors & Contact
- Citation
Ego-R1/
βββ cott_gen/ # Chain-of-Tool-Thought generation for egocentric video QA
β βββ main.py # Main agent runner with multi-turn reasoning
β βββ tools.py # Tool implementations (RAG, Video-LLM, VLM)
β βββ utils.py # Utility functions and data processing
β βββ prompts.py # System and reasoning prompts
β βββ postprocess.py # Data postprocessing and analysis
β βββ environment.yml # Conda environment for autogen
βββ LLaMA-Factory/ # LLM fine-tuning framework (submodule)
βββ Ego-R1-Agent/ # RL framework for reasoning + search LLMs
β βββ train_grpo.sh # GRPO training script
β βββ train_ppo.sh # PPO training script
β βββ eval/ # Inference and evaluation scripts
β βββ verl/ # veRL framework components
βββ data/ # Ego-R1 dataset (should be downloaded from HF)
β βββ Ego-CoTT-25K/ # 25K Chain-of-Tool-Thought for SFT
β βββ Ego-QA-4.4K/ # 4.4K QA pairs for RL training
β βββ Ego-CoTT-raw/ # Raw data in multiple formats
βββ scripts/ # Training and generation scripts
β βββ train/ # SFT training scripts
β βββ gen/ # Data generation scripts
βββ api/ # API components for RAG and visual tools
βββ rag/ # RAG-related API components
βββ visual_tools/ # Multi-modal visual tool APIs
huggingface-cli download Ego-R1/Ego-R1-Data --local-dir data --repo-type dataset
i. Set Environment
cd api/rag
pip install -e .
Make sure to install FFmpeg beforehand, as it is required for the visual tools to function properly.
ii. Prepare the Data For Egoschema and Videomme benchmark
huggingface-cli download Ego-R1/h-rag_database --local-dir data --repo-type dataset
Unzip the Videomme and Egoschema videos.
iii. Setup API
-
Set GPT Key
export AZURE_OPENAI_ENDPOINT=ENDPOINT export AZURE_OPENAI_API_KEY=KEY
-
Start RAG
-
For Egolife/Ego-R1:
- Set video directory in
rag/configs/egolife.yaml
:base: data_dir: data/egolife # set to h-rag_database/egolife
- Run:
python api_for_egolife.py
- Set video directory in
-
For Egoschema:
- Run:
python api_for_egoschema.py --min_log_dir=h-rag_database/egoschema --port 6001 # default
- Run:
-
For Videomme:
- Run:
python api_for_videomme.py --min_log_dir=h-rag_database/videomme/videomme_10min --sec_log_dir=h-rag_database/videomme/videomme_30s --port 7001 # default
- Run:
-
iv. Start Visual API
-
Set Config
- Set video directory in
visual_tools/configs.yaml
for EgoLife, Egoschema, and Videomme videos separately:data_dir: "/path/to/egolife" data_dir: "/path/to/videomme" data_dir: "/path/to/egoschema"
- Set any number of Gemini API keys:
gemini_api_keys: ["your-gemini-api-key-1", "your-gemini-api-key-2"]
- Set video directory in
-
Run API
- For any visual API, run:
python api.py
- For LLaVA-based VideoLLM, run the LLaVA API first:
python xxxx_videollm_llava/llava_video.py
- For any visual API, run:
# One-line installation
cd cott_gen
conda env create -f environment.yml
conda activate autogen
# Or install step by step:
# conda create -n autogen python=3.10
# conda activate autogen
# pip install -U autogenstudio==0.6.1
# pip install future google-genai
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
conda create -n egor1 python=3.9
conda activate egor1
# Install PyTorch (optional - vllm can handle this)
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# verl
pip install -e .
# flash attention 2
pip3 install flash-attn --no-build-isolation
pip install wandb google-genai
You can follow Search-R1 to build the environment as well.
bash Ego-R1-Agent/utils/serve.sh
conda activate egor1
# with a summary model
bash Ego-R1-Agent/eval/infer_bench_summ.sh
# or you can go with a basic one
# python infer.py --arg1 xxx --arg2 xxx
# Prepare data
mkdir -p LLaMA-Factory/data
cp data/Ego-CoTT-25K/train-cott.json LLaMA-Factory/data/
# Train model
conda activate llamafactory
cd LLaMA-Factory
llamafactory-cli train examples/train_full/qwen.yaml
# Prepare data
mkdir -p Ego-R1-Agent/data
cp data/Ego-CoTT-raw/*.parquet Ego-R1-Agent/data/
# Start RL training
conda activate egor1
cd Ego-R1-Agent
bash train_grpo.sh # For GRPO training
# Generate reasoning traces with multi-modal tools
conda activate autogen
bash scripts/gen/run_data_gen.sh
The Ego-R1 agent uses a structured chain-of-tool-thought approach:
- Think: Analyze the question and plan the reasoning approach
- RAG Search: Retrieve relevant context from video databases across different time granularities
- Video-LLM: Analyze specific video segments for detailed understanding
- VLM: Extract visual details from specific frames when needed
- Answer: Provide reasoned response based on collected evidence
{
"name": "rag",
"arguments": {
"level": "day", # or "week", "hour"
"keywords": ["cooking", "kitchen"],
"start_time": "DAY1_11210217",
"query_time": "DAY1_11220217"
}
}
{
"name": "video_llm",
"arguments": {
"question": "What cooking action is being performed?",
"range": "DAY1_11210217-DAY1_11220217"
}
}
{
"name": "vlm",
"arguments": {
"question": "What objects are visible on the table?",
"timestamp": "DAY1_11210217"
}
}
- Size: 25,000 examples (415MB)
- Format: Multi-turn conversations with tool calls
- Purpose: Supervised fine-tuning
- Tools: RAG, Video-LLM, VLM integration
- Size: 4,400 QA pairs
- Sources: 1.5K Gemini-generated + 2.9K manual annotations
- Agents: 6 different identities (A1-A6)
- Purpose: Rule-based reinforcement learning training or generating CoTT from scratch
This project builds upon several excellent open-source frameworks:
- autogen: Foundation for multi-agent conversations and tool calling
- veRL: Reinforcement learning framework for LLM training
- LLaMA-Factory: Comprehensive LLM fine-tuning platform
- Search-R1: RL framework for reasoning + search capabilities
- DeepSeek-R1: Inspiration for reasoning model architecture
This project is licensed under the Apache License 2.0. See the LICENSE files in individual components for details.
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests to help improve this research framework.
If you have any queries, feel free to contact: Shulin Tian ([email protected]) & Ruiqi Wang ([email protected])
@misc{tian2025egor1chainoftoolthoughtultralongegocentric,
title={Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning},
author={Shulin Tian and Ruiqi Wang and Hongming Guo and Penghao Wu and Yuhao Dong and Xiuying Wang and Jingkang Yang and Hao Zhang and Hongyuan Zhu and Ziwei Liu},
year={2025},
eprint={2506.13654},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.13654},
}