Welcome! 👋 This is the official repository for our ACL 2025 main paper:
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates by Jaewoo Ahn*, Heeseung Yun*, Dayoon Ko, and Gunhee Kim.
We recommend using Anaconda. The following command will create a new conda environment MAC with all the dependencies.
conda env create -f environment.yml
To activate the environment:
conda activate MAC
Additionally, install the English language model for spaCy:
python -m spacy download en_core_web_sm
Please refer to DATASET.md for setup instructions for COCO, MSR-VTT, and AudioCaps
To use the LanguageBind
model, please run the following commands:
git submodule update --init --recursive
cd dataset_processing/LanguageBind
# Create a separate environment for LanguageBind
conda create -n LanguageBind python=3.10.10
conda activate LanguageBind
# Install required dependencies
pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1
pip install -r requirements.txt
# Return to the main environment
conda activate MAC
# Deceptive-General Prompt (Zero-Shot)
sh scripts/zero_shot.sh coco clip
# + Self-Train + Large-N Distilled + Diversity-Promoted (Ours)
sh scripts/self_train_with_large_N_ours.sh coco clip
LoRA checkpoint (ASR_total 42.1%) fine-tuned on LLaMA 3.1:8B to deceive CLIP on the COCO dataset.
Replace the model loading part with the following code:
model = AutoPeftModelForCausalLM.from_pretrained(
'ahnpersie/llama3.1-8b-lora-coco-deceptive-clip', # changed from local "model_checkpoint_dir" to HuggingFace repo
torch_dtype=torch.bfloat16,
attn_implementation=attn_implementation,
device_map=device
)
sh scripts/generate_evaluate_iter1_example.sh
Note: All experiments were conducted on a single NVIDIA RTX A6000 GPU (48GB VRAM).
If you have any questions, feel free to ask us: Jaewoo Ahn ([email protected]) or Heeseung Yun ([email protected])