Welcome to the implementation of Vision-and-Language Navigation under Continuous Environment (VLN-CE) models on the AI2THOR dataset. This project aims to push the boundaries of embodied AI by integrating the latest advancements and novelties in the field, leveraging state-of-the-art object detection, and developing robust navigation strategies.
Vision-and-Language Navigation (VLN) tasks challenge agents to interpret natural language instructions and navigate complex, photorealistic environments. This repository focuses on:
- Implementing VLN-CE models in the AI2THOR simulation environment based on manually interactive collection via keyboard and navigating through the environment to collect data or automatically generated data.
- Integrating real-time object detection and scene understanding using YOLOv5s.
- Exploring and optimizing advanced navigation strategies.
- Real-time Object Detection: Utilizes pretrained models (e.g., YOLOv5s) for fast and accurate object recognition.
- Custom Navigation Strategies: Implements and compares Greedy Lookahead, Lin-Kernighan, Dynamic Programming, and A* algorithms, all informed by detected objects.
- Action Sequence Optimization: Seeks to minimize the number of actions required to reach the destination.
- Modular Dataset Handling: Supports custom dataset generation and fine-tuning.
- Extensible Framework: Designed for easy integration of new models, strategies, and research ideas.
- AI2THOR: Interactive 3D environment for embodied AI research.
- YOLOv5s: Real-time object detection.
- PyTorch: Deep learning framework.
- OpenCV: Image processing and computer vision.
- Pynput: Keyboard control for manual navigation.
- Tkinter: GUI for object selection.
- [NumPy, Pandas, Matplotlib, Seaborn]: Data handling and visualization.
- Greedy Lookahead: Selects the next action based on immediate reward and detected objects.
- Lin-Kernighan Heuristic: Applies advanced local search for path optimization.
- Dynamic Programming: Finds optimal action sequences by breaking down navigation into subproblems.
- A*: Utilizes heuristic search to efficiently reach the goal.
- Customizable: Easily add or modify navigation algorithms.
- AI2THOR Scenes: Rich, interactive environments for training and evaluation.
- Custom Dataset Generation: Scripts for collecting RGB, depth, segmentation, and metadata.
- YOLOv5 Labeling: Automated label generation for object detection fine-tuning.
- Clone the Repository
git clone https://github.com/yourusername/VLN-CE_pro.git cd VLN-CE_pro - Set Up the Environment
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt - Download Pretrained Weights
- Place YOLOv5s weights in the appropriate directory (see
yolov5/).
- Place YOLOv5s weights in the appropriate directory (see
- Run Manual Navigation & Data Collection
python yolov5/src/scripts/manual_navigator.py
- Fine-tune YOLOv5s on Custom Data
python yolov5/src/fine-tune_yolov5s.py
- Run Detection or Navigation Experiments
- See scripts in
yolov5/src/for detection and navigation.
- See scripts in
VLN-CE_pro/
├── yolov5/
│ ├── src/
│ │ ├── scripts/
│ │ │ └── manual_navigator.py
│ │ ├── custom_detect.py
│ │ └── fine-tune_yolov5s.py
│ ├── runs/
│ ├── models/
│ └── ...
├── notes/
├── requirements.txt
└── README.md
- Transformer-based VLN Models: Integrate recent advances like VLN-BERT, EnvDrop, and HAMT.
- Vision-Language Pretraining: Leverage large-scale pretrained models (e.g., CLIP, BLIP) for improved grounding.
- Curriculum Learning: Gradually increase task difficulty for more robust agents.
- Uncertainty Estimation: Incorporate Bayesian methods for safer navigation.
- Multi-modal Fusion: Combine audio, depth, and semantic maps for richer perception.
- Reinforcement Learning Enhancements: Explore curiosity-driven exploration and hierarchical RL.
- Sim2Real Transfer: Bridge the gap between simulation and real-world deployment.
- Mind the Error!: Detection and Localization of Instruction Errors in Vision-and-Language Navigation. Ref
- AI2THOR team for the simulation environment.
- Ultralytics YOLOv5 for object detection.
- Open-source contributors and the embodied AI research community.
Contact: For questions or collaborations, please open an issue or reach out via Telegram.
