VAST is implemented based on Pytorch. We use Python-3.9 and Cuda-11.7. Other version could be also compatible. Other needed packages are listed in preinstall.sh.
conda create -n vast python=3.9
conda activate vast
sh preinstall.sh
make a dir named pretrained_weights under the main work dir.
1.download evaclip weight:
wget -P pretrained_weights/clip/ https://huggingface.co/QuanSun/EVA-CLIP/resolve/main/EVA01_CLIP_g_14_psz14_s11B.pt
2.download beats weight from https://github.com/microsoft/unilm/tree/master/beats
3.download bert weight:
from transformers import BertModel, BertTokenizer
bert = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert.save_pretrained('pretrained_weights/bert/bert-base-uncased')
bert_tokenizer.save_pretrained('pretrained_weights/bert/bert-base-uncased')
The processed pretrained_weights path should be as follows:
    ├── pretrained_weights
    │   ├── beats
    │   │   └── BEATs_iter3_plus_AS2M.pt
    │   ├── bert
    │   │   └── bert-base-uncased
    │   ├── clip
    │   │   └── EVA01_CLIP_g_14_psz14_s11B.pt
make a dir named output under the main work dir.
1.download vast model (optional, for finetuning)
[Google Drive Link] [Baidu Cloud Link]
2.vision captioner (optional, for labeling images/videos)
[Google Drive Link] [Baidu Cloud Link]
3.audio captioner (optional, for labeling audios)
[Google Drive Link] [Baidu Cloud Link]
The processed output path should be as follows:
    ├── output
    │   ├── vast
    │   │   ├── pretrain_vast
    │   │   ├── vision_captioner
    │   │   └── audio_captioner
[Google Drive Link] [Baidu Cloud Link]
Raw videos could be downloaded from YouTube.
make a dir named datasets under the main work dir.
[Google Drive Link] [Baidu Cloud Link]
The processed datasets path should be as follows:
    ├── output
    │   ├── annotations
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd
    │   ├── srcdata
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd
srcdata (images/videos/audios) should be collected by yourself.
- finetune retrieval tasks
sh scripts/vast/finetune_ret.sh
- finetune captioning tasks
sh scripts/vast/finetune_cap.sh
- finetune QA tasks
sh scripts/vast/finetune_qa.sh
sh scripts/pretrain_vast.sh
For example, if the cmd for finetuning retrieval model is as follows:
python3 -m torch.distributed.launch \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_port 9834 \
./run.py \
--learning_rate 2e-5 \
--checkpointing true \
--first_eval true \
--save_best true \
--config ./config/vast/finetune_cfg/retrieval-msrvtt.json \
--pretrain_dir $output_dir \
--output_dir $output_dir/downstream/retrieval-msrvtt \
if you want to test model, just add following two rows to the cmd:
--mode 'testing' \
--checkpoint /PATH/TO/SAVED_CHECKPOINT.pt
You need to prepare 1)a folder containing all videos/images or audios.
2)a meta.json composed of [{'video_id':'09WssDay9FE_1'},{'video_id':'09WssDay9FE_2'},...]
and then write the config file.
sh scripts/vast/vision_captioner.sh
sh scripts/vast/audio_captioner.sh
--train_vision_sample_num
--test_vision_sample_num
--train_audio_sample_num
--test_audio_sample_num
--train_task
--test_task
--learning_rate
--train_batch_size
--test_batch_size
--train_epoch
--train_steps
--checkpointing
--frozen_vision
--valid_freq
--beam_size
If you find this code useful for your research, please consider citing:
@article{chen2024vast,
  title={Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset},
  author={Chen, Sihan and Li, Handong and Wang, Qunbo and Zhao, Zijia and Sun, Mingzhen and Zhu, Xinxin and Liu, Jing},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}
This project is released under the MIT license
For the full list of third-party licenses used in this project, please see the THIRD_PARTY_LICENSES.md file.

