AR + Diffusion Architecture: Similar with BLIP3o, BLIP3o-NEXT generates intermediate features via the autoregressive model and then conditions on these features to generate images through the diffusion model.
Discrete Image Token Supervision: We add discrete SigLIP-2 image token prediction as extra training supervision, jointly optimizing CrossEntropy and the diffusion objective. By having the AR model lay down a discrete "blueprint" and feeding their hidden representations into the diffusion model, we combine structural accuracy with high visual-fidelity image outputs.
RL with verified reward: The introduction of discrete image tokens unlocks seamless compatibility with existing language-model RL framework. Using Group Relative Policy Optimization (GRPO), we train the BLIP3o-NEXT to improve prompt alignment and text rendering in image generation.
Fully Open-Source:
- Pretraining Data: 27 Million Detailed Captions, 5 Million Short Captions
- Instruction Tuning Data: BLIP3o-60k, ShareGPT-4o-Image
- Model Weights (3B): Pretrain, Instruction Tuning, GRPO-Geneval, GRPO-Text
- Training Code: Pretrain, Instruction Tuning, GRPO
🔥 Welcome to discuss with us if you have any questions. Discord: https://discord.gg/SsVYdV84bw or Wechat
Install package for pretraining and instruction tuning
conda create -n blip3o-next python=3.11 -y
conda activate blip3o-next
pip install --upgrade pip setuptools
pip install -r requirements.txt
pip install -e .
Import slurm config and environment
sbatch scrips/run.sh
For the inference, change the model path in inference.py and
python inference.py
For GRPO, we recommend to install a new enviroment since some version conflicts for torch if using blip3o-next environment. Also you need to install the dependency from setup.py, please follow below
cd trl
conda create -n grpo python=3.11 -y
conda activate grpo
pip install -r requirements.txt
cd ..
pip install -e .