Skip to content

egolife-ai/Ego-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Project Page Paper Demo Badge Discord

Shulin Tian*1,2, Ruiqi Wang*1,3, Hongming Guo4, Penghao Wu1, Yuhao Dong1, Xiuying Wang1, Jingkang Yang1, Hao Zhang3, Hongyuan Zhu2, Ziwei Liu1
1S-Lab, Nanyang Technological University  2A*STAR, Singapore  3Simon Fraser University  4Shanghai AI Lab

Ego-R1 is a comprehensive research framework that combines reinforcement learning-based tool-use reasoning with egocentric video analysis capabilities.

πŸ” Project Overview

This repository provides:

  • Chain-of-Tool-Thought Generation (cott_gen): Multi-modal AI agents for analyzing egocentric video data with tool-calling capabilities (RAG, Video-LLM, VLM)
  • Ego-R1-Agent: Reinforcement learning framework for training multiturn tool-use interleaved LLMs
  • Ego-R1 Dataset: 25K Chain-of-Tool-Thought examples and 4.4K QA pairs

🌟 Key Features

  • Multi-modal Tool-Augmented Reasoning: Combines RAG search, Video-LLM, and Vision-Language Models for long video understanding. Agents learn to use multiple tools to decompose and answer complex egocentric video questions
  • Reinforcement Learning: GRPO training for thinking-reasoning-and-acting interleaved behavior
  • Comprehensive Dataset: Release the code for CoTT data generation and pre-processed data for both SFT and RL training

πŸ“° News

  • [2025.6.8] Officially launch the Ego-R1 codebase.

πŸ”— Table of Contents

πŸ“ Repository Structure

Ego-R1/
β”œβ”€β”€ cott_gen/                # Chain-of-Tool-Thought generation for egocentric video QA
β”‚   β”œβ”€β”€ main.py             # Main agent runner with multi-turn reasoning
β”‚   β”œβ”€β”€ tools.py            # Tool implementations (RAG, Video-LLM, VLM)
β”‚   β”œβ”€β”€ utils.py            # Utility functions and data processing
β”‚   β”œβ”€β”€ prompts.py          # System and reasoning prompts
β”‚   β”œβ”€β”€ postprocess.py      # Data postprocessing and analysis
β”‚   └── environment.yml     # Conda environment for autogen
β”œβ”€β”€ LLaMA-Factory/          # LLM fine-tuning framework (submodule)
β”œβ”€β”€ Ego-R1-Agent/          # RL framework for reasoning + search LLMs
β”‚   β”œβ”€β”€ train_grpo.sh       # GRPO training script
β”‚   β”œβ”€β”€ train_ppo.sh        # PPO training script  
β”‚   β”œβ”€β”€ eval/               # Inference and evaluation scripts
β”‚   └── verl/               # veRL framework components
β”œβ”€β”€ data/                   # Ego-R1 dataset (should be downloaded from HF)
β”‚   β”œβ”€β”€ Ego-CoTT-25K/      # 25K Chain-of-Tool-Thought for SFT
β”‚   β”œβ”€β”€ Ego-QA-4.4K/       # 4.4K QA pairs for RL training
β”‚   └── Ego-CoTT-raw/      # Raw data in multiple formats
β”œβ”€β”€ scripts/                # Training and generation scripts
β”‚   β”œβ”€β”€ train/             # SFT training scripts
β”‚   └── gen/               # Data generation scripts
└── api/                   # API components for RAG and visual tools
    β”œβ”€β”€ rag/               # RAG-related API components
    └── visual_tools/      # Multi-modal visual tool APIs

πŸ”§ Installation

Download Ego-R1-Data

huggingface-cli download Ego-R1/Ego-R1-Data --local-dir data --repo-type dataset

Environment Setup

0. Toolbox API Environment

i. Set Environment

cd api/rag
pip install -e .

Make sure to install FFmpeg beforehand, as it is required for the visual tools to function properly.

ii. Prepare the Data For Egoschema and Videomme benchmark

huggingface-cli download Ego-R1/h-rag_database --local-dir data --repo-type dataset

Unzip the Videomme and Egoschema videos.

iii. Setup API

  • Set GPT Key

    export AZURE_OPENAI_ENDPOINT=ENDPOINT
    export AZURE_OPENAI_API_KEY=KEY
  • Start RAG

    • For Egolife/Ego-R1:

      • Set video directory in rag/configs/egolife.yaml:
        base:
          data_dir: data/egolife # set to h-rag_database/egolife
      • Run:
        python api_for_egolife.py
    • For Egoschema:

      • Run:
        python api_for_egoschema.py --min_log_dir=h-rag_database/egoschema --port 6001 # default
    • For Videomme:

      • Run:
        python api_for_videomme.py --min_log_dir=h-rag_database/videomme/videomme_10min --sec_log_dir=h-rag_database/videomme/videomme_30s --port 7001 # default

iv. Start Visual API

  • Set Config

    • Set video directory in visual_tools/configs.yaml for EgoLife, Egoschema, and Videomme videos separately:
      data_dir: "/path/to/egolife"
      data_dir: "/path/to/videomme"
      data_dir: "/path/to/egoschema"
    • Set any number of Gemini API keys:
      gemini_api_keys: ["your-gemini-api-key-1", "your-gemini-api-key-2"]
  • Run API

    • For any visual API, run:
      python api.py
    • For LLaVA-based VideoLLM, run the LLaVA API first:
      python xxxx_videollm_llava/llava_video.py

1. CoTT-Data-Generation Environment

# One-line installation
cd cott_gen
conda env create -f environment.yml
conda activate autogen


# Or install step by step:
# conda create -n autogen python=3.10
# conda activate autogen
# pip install -U autogenstudio==0.6.1
# pip install future google-genai

2. SFT (LLaMA-Factory) Environment

cd LLaMA-Factory
pip install -e ".[torch,metrics]"

3. RL (Ego-R1-Agent) Environment

conda create -n egor1 python=3.9
conda activate egor1

# Install PyTorch (optional - vllm can handle this)
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121

# verl
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
pip install wandb google-genai

You can follow Search-R1 to build the environment as well.

πŸš€ Quick Start

Inference

1. Test the model

bash Ego-R1-Agent/utils/serve.sh

2. Inference on the benchmark

conda activate egor1

# with a summary model
bash Ego-R1-Agent/eval/infer_bench_summ.sh

# or you can go with a basic one
# python infer.py --arg1 xxx --arg2 xxx

1. Supervised Fine-Tuning (SFT)

# Prepare data
mkdir -p LLaMA-Factory/data 
cp data/Ego-CoTT-25K/train-cott.json LLaMA-Factory/data/

# Train model
conda activate llamafactory
cd LLaMA-Factory
llamafactory-cli train examples/train_full/qwen.yaml

2. Reinforcement Learning Training

# Prepare data
mkdir -p Ego-R1-Agent/data
cp data/Ego-CoTT-raw/*.parquet Ego-R1-Agent/data/

# Start RL training
conda activate egor1
cd Ego-R1-Agent
bash train_grpo.sh  # For GRPO training

3. Chain-of-Tool-Thought Generation

# Generate reasoning traces with multi-modal tools
conda activate autogen
bash scripts/gen/run_data_gen.sh

πŸ”¬ Usage Examples

Multi-Modal Reasoning Process

The Ego-R1 agent uses a structured chain-of-tool-thought approach:

  1. Think: Analyze the question and plan the reasoning approach
  2. RAG Search: Retrieve relevant context from video databases across different time granularities
  3. Video-LLM: Analyze specific video segments for detailed understanding
  4. VLM: Extract visual details from specific frames when needed
  5. Answer: Provide reasoned response based on collected evidence

Tool Usage Examples

RAG Search

{
    "name": "rag",
    "arguments": {
        "level": "day",  # or "week", "hour"
        "keywords": ["cooking", "kitchen"],
        "start_time": "DAY1_11210217",
        "query_time": "DAY1_11220217"
    }
}

Video Analysis

{
    "name": "video_llm", 
    "arguments": {
        "question": "What cooking action is being performed?",
        "range": "DAY1_11210217-DAY1_11220217"
    }
}

Image Analysis

{
    "name": "vlm",
    "arguments": {
        "question": "What objects are visible on the table?",
        "timestamp": "DAY1_11210217"
    }
}

πŸ“Š Dataset

Ego-CoTT-25K

  • Size: 25,000 examples (415MB)
  • Format: Multi-turn conversations with tool calls
  • Purpose: Supervised fine-tuning
  • Tools: RAG, Video-LLM, VLM integration

Ego-QA-4.4K

  • Size: 4,400 QA pairs
  • Sources: 1.5K Gemini-generated + 2.9K manual annotations
  • Agents: 6 different identities (A1-A6)
  • Purpose: Rule-based reinforcement learning training or generating CoTT from scratch

πŸ™ Acknowledgments

This project builds upon several excellent open-source frameworks:

  • autogen: Foundation for multi-agent conversations and tool calling
  • veRL: Reinforcement learning framework for LLM training
  • LLaMA-Factory: Comprehensive LLM fine-tuning platform
  • Search-R1: RL framework for reasoning + search capabilities
  • DeepSeek-R1: Inspiration for reasoning model architecture

πŸ“„ License

This project is licensed under the Apache License 2.0. See the LICENSE files in individual components for details.

🀝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests to help improve this research framework.

πŸ‘¨β€πŸ’» Authors & Contact

If you have any queries, feel free to contact: Shulin Tian ([email protected]) & Ruiqi Wang ([email protected])

πŸ“š Citation

@misc{tian2025egor1chainoftoolthoughtultralongegocentric,
      title={Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning}, 
      author={Shulin Tian and Ruiqi Wang and Hongming Guo and Penghao Wu and Yuhao Dong and Xiuying Wang and Jingkang Yang and Hao Zhang and Hongyuan Zhu and Ziwei Liu},
      year={2025},
      eprint={2506.13654},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.13654}, 
}

About

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages