Skip to content

lyteabovenyte/Interactive-VLN_CE

Repository files navigation

VLN-CE on AI2THOR: Advancing Vision-and-Language Navigation

Welcome to the implementation of Vision-and-Language Navigation under Continuous Environment (VLN-CE) models on the AI2THOR dataset. This project aims to push the boundaries of embodied AI by integrating the latest advancements and novelties in the field, leveraging state-of-the-art object detection, and developing robust navigation strategies.


🚀 Introduction

Vision-and-Language Navigation (VLN) tasks challenge agents to interpret natural language instructions and navigate complex, photorealistic environments. This repository focuses on:

  • Implementing VLN-CE models in the AI2THOR simulation environment based on manually interactive collection via keyboard and navigating through the environment to collect data or automatically generated data.
  • Integrating real-time object detection and scene understanding using YOLOv5s.
  • Exploring and optimizing advanced navigation strategies.

✨ Features

  • Real-time Object Detection: Utilizes pretrained models (e.g., YOLOv5s) for fast and accurate object recognition.
  • Custom Navigation Strategies: Implements and compares Greedy Lookahead, Lin-Kernighan, Dynamic Programming, and A* algorithms, all informed by detected objects.
  • Action Sequence Optimization: Seeks to minimize the number of actions required to reach the destination.
  • Modular Dataset Handling: Supports custom dataset generation and fine-tuning.
  • Extensible Framework: Designed for easy integration of new models, strategies, and research ideas.

Sample Object Detection
Sample output: Real-time object detection and labeling in AI2THOR scenes using YOLOv5s.

🛠️ Technologies & Libraries

  • AI2THOR: Interactive 3D environment for embodied AI research.
  • YOLOv5s: Real-time object detection.
  • PyTorch: Deep learning framework.
  • OpenCV: Image processing and computer vision.
  • Pynput: Keyboard control for manual navigation.
  • Tkinter: GUI for object selection.
  • [NumPy, Pandas, Matplotlib, Seaborn]: Data handling and visualization.

🧭 Navigation Strategies

  • Greedy Lookahead: Selects the next action based on immediate reward and detected objects.
  • Lin-Kernighan Heuristic: Applies advanced local search for path optimization.
  • Dynamic Programming: Finds optimal action sequences by breaking down navigation into subproblems.
  • A*: Utilizes heuristic search to efficiently reach the goal.
  • Customizable: Easily add or modify navigation algorithms.

📦 Dataset

  • AI2THOR Scenes: Rich, interactive environments for training and evaluation.
  • Custom Dataset Generation: Scripts for collecting RGB, depth, segmentation, and metadata.
  • YOLOv5 Labeling: Automated label generation for object detection fine-tuning.

🚀 Getting Started

  1. Clone the Repository
    git clone https://github.com/yourusername/VLN-CE_pro.git
    cd VLN-CE_pro
  2. Set Up the Environment
    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
  3. Download Pretrained Weights
    • Place YOLOv5s weights in the appropriate directory (see yolov5/).
  4. Run Manual Navigation & Data Collection
    python yolov5/src/scripts/manual_navigator.py
  5. Fine-tune YOLOv5s on Custom Data
    python yolov5/src/fine-tune_yolov5s.py
  6. Run Detection or Navigation Experiments
    • See scripts in yolov5/src/ for detection and navigation.

🗂️ Project Structure

VLN-CE_pro/
├── yolov5/
│   ├── src/
│   │   ├── scripts/
│   │   │   └── manual_navigator.py
│   │   ├── custom_detect.py
│   │   └── fine-tune_yolov5s.py
│   ├── runs/
│   ├── models/
│   └── ...
├── notes/
├── requirements.txt
└── README.md

🔮 Future Work & Novelties

  • Transformer-based VLN Models: Integrate recent advances like VLN-BERT, EnvDrop, and HAMT.
  • Vision-Language Pretraining: Leverage large-scale pretrained models (e.g., CLIP, BLIP) for improved grounding.
  • Curriculum Learning: Gradually increase task difficulty for more robust agents.
  • Uncertainty Estimation: Incorporate Bayesian methods for safer navigation.
  • Multi-modal Fusion: Combine audio, depth, and semantic maps for richer perception.
  • Reinforcement Learning Enhancements: Explore curiosity-driven exploration and hierarchical RL.
  • Sim2Real Transfer: Bridge the gap between simulation and real-world deployment.
  • Mind the Error!: Detection and Localization of Instruction Errors in Vision-and-Language Navigation. Ref

🙏 Acknowledgements

  • AI2THOR team for the simulation environment.
  • Ultralytics YOLOv5 for object detection.
  • Open-source contributors and the embodied AI research community.

Contact: For questions or collaborations, please open an issue or reach out via Telegram.

About

Object Detection server on Interactive AI2THOR environment using Rust and yolov5s

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published