Releases: unslothai/unsloth
gpt-oss Reinforcement Learning + Auto Kernel Notebook
We’re introducing gpt-oss RL support and the fastest RL inference and lowest VRAM use vs. any implementation. Blog: https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning
- Unsloth now offers the fastest inference (~3x faster), lowest VRAM (50% less) and most context (8x longer) for gpt-oss RL vs. any implementation - with no accuracy loss.
- Since RL on gpt-oss isn't yet vLLM compatible, we rewrote Transformers inference code to enable faster inference
- gpt-oss-20b GSPO free Colab notebook
- This notebook automatically creates faster matrix multiplication kernels and uses a new Unsloth reward function. We also show how to counteract reward-hacking which is one of RL's biggest challenges.

- We previously released Vision RL with GSPO support
⚠️ Reminder to NOT use Flash Attention 3 for gpt-oss as it'll make your training loss wrong.- DeepSeek-V3.1-Terminus is here and you can run locally via our GGUF
Read how our 3-bit GGUF beats Claude-4-Opus (thinking) on Aider Polyglot here - Magistral 1.2 is here and you can run it locally here or fine-tune it for free by using our Kaggle notebook
- Fine-tuning the new Qwen3 models including Qwen3-VL, Qwen3-Omni and Qwen3-Next should work in Unsloth if you install the latest transformers. The models are big however so ensure you have enough VRAM.
- BERT is now fixed! Feel free to use our BERT fine-tuning notebook
- ⭐ We’re hosting a Developer event with Mistral AI & NVIDIA at Y Combinator’s Office in San Francisco on Oct 21. Come say hello!
- We’re also joining Pytorch and AMD for a 2 day Virtual AI Agents Challenge with prizes. Join Hackathon
Don't forget to also join our Reddit: r/unsloth 🥰
What's Changed
- Bug fixes by @danielhanchen in #3329
- Fix QAT + LoRA fast path, add tests by @andrewor14 in #3307
- Use gemma3n embedder patch + adjust FORCE_FLOAT32 match logic by @mmathew23 in #3332
- Synthetic Data updates by @mmathew23 in #3333
- Fix loading issues for BERT by @Etherll in #3339
- Bug fixes by @danielhanchen in #3335
- peft_config before model_config by @mmathew23 in #3342
- specify different tokenizer_path/name by @mmathew23 in #3343
- correct python support statement by @laz-001 in #3374
- GPT OSS RL by @danielhanchen in #3362
New Contributors
Full Changelog: September-2025-v2...September-2025-v3
Vision Reinforcement Learning + Memory Efficient RL
We're excited to support Vision models for RL and even more memory efficient + faster RL!
Unsloth now supports vision/multimodal RL with Gemma 3, Qwen2.5-VL and other vision models. Due to Unsloth's unique weight sharing and custom kernels, Unsloth makes VLM RL 1.5–2× faster, uses 90% less VRAM, and enables 10× longer context lengths than FA2 setups, with no accuracy loss. Qwen2.5-VL GSPO notebook
Gemma 3 (4B) Vision GSPO notebook
Full details in our blogpost: https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl
- This update also introduces Qwen's GSPO algorithm.
- Our new vision RL support also comes now even faster & more memory efficient! Our new kernels & algos allows faster RL for text and vision LLMs with 50% less VRAM & 10× more context.
- Introducing a new RL feature called 'Standby'. Before, RL requires GPU splitting between training & inference. With Unsloth Standby, you no longer have to & 'Unsloth Standby' uniquely limits speed degradation compared to other implementations and sometimes makes training even faster! Read our Blog

- We released Aider Polyglot benchmarks for our DeepSeek-V3.1 Dynamic GGUFs and Unsloth quants perform consistently better than others. Blog

Don't forget to also join our Reddit: r/unsloth 🥰
What's Changed
- GPT OSS Bug fixes by @danielhanchen in #3231
- tests for mxfp4 and quantized models merge fix unsloth zoo pr 254 by @rolandtannous in #3223
- Update mistral.py, showed flag to not call cut cross entropy by @pluesclues in #3233
- Remove old version constraint in dependency list by @timkpaine in #3237
- chore: Fix Typos by @DefiWimar7 in #3246
- Fix incorrect function call in test_qwen3_grpo.py by @stevenxdavis in #3212
- [Intel] make intel device support ROPE by @leizhenyuan in #3164
- Support saving locally in
model.save_pretrained_torchao
by @jerryzh168 in #3263 - fixed save_pretrained_torchao and associated tests by @rolandtannous in #3264
- patch sftrainer to disable _is_vlm by @mmathew23 in #3265
- Bug fixes by @danielhanchen in #3266
- Filter vllm executor log by @Datta0 in #3268
- llama vision inference fix by @mmathew23 in #3270
- Add TorchAO quantization tests with FP16 models and serialization workarounds by @rolandtannous in #3269
- GptAttention turn training off during inference by @mmathew23 in #3289
- Add support for QAT full fine-tuning by @andrewor14 in #3238
- simplify unsloth_base_fast_generate by @mmathew23 in #3291
- Bug fixes by @danielhanchen in #3295
- [ROCm] add hip device path by @billishyahao in #3301
- Bug fixes by @danielhanchen in #3322
- Add support for modules_to_save in FastModel.get_peft_model by @l1ghtsource in #3317
- Fast Inference with vLLM for VLMs by @Datta0 in #2975
- TRL Updated version of VLM GRPO update along with GSPO by @pluesclues in #3132
New Contributors
- @timkpaine made their first contribution in #3237
- @stevenxdavis made their first contribution in #3212
- @l1ghtsource made their first contribution in #3317
Full Changelog: August-2025-v2...September-2025-v2
Unsloth Flex Attention + Long context gpt-oss Training
We’re excited to introduce Unsloth Flex Attention support for OpenAI gpt-oss training that enables >8× longer context lengths, >50% less VRAM usage and >1.5× faster training compared to all implementations including those using Flash Attention 3 (FA3). Unsloth Flex Attention makes it possible to train with a 60K context length on just 80GB of VRAM for BF16 LoRA. Also:
- You can now export/save your QLoRA fine-tuned gpt-oss model to llama.cpp, vLLM, or HF.
- We fixed gpt-oss training losses going to infinity on float16 GPUs (like T4 Colab)
- We fixed gpt-oss implementation issues, most notably ensuring that
swiglu_limit = 7.0
is properly applied during MXFP4 inference in transformers - Unsloth Flex Attention scales with context, longer sequences yield bigger savings in both VRAM and training time
Full details in our blogpost: https://docs.unsloth.ai/basics/long-context-gpt-oss-training
What's Changed
- Add Qwen3 Instruct / Thinking chat templates by @Etherll in #3110
- Add Qwen3 4B to mapper.py by @Etherll in #3120
- Nightly by @danielhanchen in #3148
- Fix GPT OSS by @danielhanchen in #3154
- Nightly by @danielhanchen in #3169
- Update Blackwell install instructions for latest vLLM release by @qingy1337 in #3175
- Fix potential generator exhaustion bug in model loading file detection by @rolandtannous in #3167
- Fix vision model GGUF quantization_method error type by @rolandtannous in #3173
- Replace back ticks with single quotes by @rnowling in #3157
- Fix original_push_to_hub fallback by @Thiraput01 in #3115
- Add support for QAT + LoRA by @andrewor14 in #2976
- Bug fixes by @danielhanchen in #3180
- Torch 2.8 by @danielhanchen in #3186
- Fix extras transformers typo in pyproject.toml by @parth2510 in #3187
- Bug fixes by @danielhanchen in #3195
- allow torch.float32 dtype in FastLanguageModel by @mmathew23 in #3204
- fix is casual for qwen3 by @leizhenyuan in #3213
- Support
model.save_pretrained_torchao
by @jerryzh168 in #3111 - Fix gemma-3n by @mmathew23 in #3219
- Handle transformers move to dtype from torch_dtype by @mmathew23 in #3225
- chore: Fix Typos by @DefiWimar7 in #3224
New Contributors
- @rnowling made their first contribution in #3157
- @Thiraput01 made their first contribution in #3115
- @andrewor14 made their first contribution in #2976
- @parth2510 made their first contribution in #3187
- @jerryzh168 made their first contribution in #3111
- @DefiWimar7 made their first contribution in #3224
Full Changelog: August-2025...August-2025-v2
gpt-oss Fine-tuning

gpt-oss is here! ✨
Finetune gpt-oss for free with our Unsloth Colab notebook!
- We’ve managed to make gpt-oss train on just 14GB of VRAM, making it possible to work on free Colab due to our linear conversions. For more details, Read our Guide/Blogpost
- Fine-tuning gpt-oss is 1.5x faster and uses 50% less VRAM with Unsloth. gpt-oss-120b model fits on 65GB of VRAM.
- Model uploads: 20b GGUF • 120b GGUF • All uploads
🦥 Unsloth updates
- We’ve made algorithmic updates to Unsloth so every model now trains faster and with less VRAM, no matter which.
- Unsloth now works on RTX 50 and Blackwell GPUs. Read our guide.
- Official Unsloth Docker image coming very soon!
- You can now run Unsloth models directly via Docker:
docker model pull hf.co/unsloth/gpt-oss-20b-GGUF
🌠 Qwen3-Coder + Qwen3-2507
Qwen made July, 2025 updates called 'Qwen3-2507' and launched their SOTA coding models!
- Qwen3-Coder (with Unsloth fixes): Guide • Coder uploads
- Qwen3-2507: Guide • 2507 uploads
- Fine-tune Qwen3-4B-2507 with our Colab notebook
🔮 New models + Support:
Run these new models:
- Kimi-K2: Guide • GGUF
- GLM: 4.5-Air • 4.5 • 4-32B-0414
- Orpheus-3B • Hunyuan-A13B
Unsloth also now supports running + training for:
- We collabed with the Liquid & TII teams to support training for Falcon-H1-7B and LFM2-1.2B! Notebooks here
- Devstral-2507 • Magistral-2507 • SmolLM3-3B
Don't forget to also join our Reddit: r/unsloth 🥰
What's Changed
- Fix argument mismatch in GRPO _get_per_token_logps lambda function by @rolandtannous in #2929
- patch falcon h1 inference by @mmathew23 in #2932
- Fix falcon H1 dropout issue by @Datta0 in #2938
- fix: change lora_dropout from int to float for type consistency by @muzzlol in #2949
- GRPO fix dataloader_num_workers value error in GRPOTrainer by @rolandtannous in #2944
- GRPO Fix - Support vllm pre-dequantized quantization states in fast_dequantize kernel by @rolandtannous in #2943
- Bug fixes by @danielhanchen in #2982
- Update unsloth-cli.py by @qgallouedec in #2985
- use fastmodel falcon h1 by @mmathew23 in #2987
- Add Qwen2.5-VL-32B-Instruct mapping to fix quantized model merge error by @rolandtannous in #2986
- Revert "Add Qwen2.5-VL-32B-Instruct mapping to fix quantized model merge error" by @danielhanchen in #2988
- Revert "Revert "Add Qwen2.5-VL-32B-Instruct mapping to fix quantized … by @danielhanchen in #2990
- Bug fixes by @danielhanchen in #2998
- Update README.md by @qgallouedec in #2991
- Bug fixes by @danielhanchen in #3017
- [bugs] fix for casual mask by @leizhenyuan in #3011
- [intel] add for intel path for llama.py by @leizhenyuan in #3012
- Fix Gemma 2 by @danielhanchen in #3024
- falcon h1 force float32 when dtype is torch.float16 by @mmathew23 in #3026
- Fix torch compile issues by @danielhanchen in #3028
- Fix Llama and Gemma inference by @Erland366 in #3034
- Fixup multi GPU workload. by @Datta0 in #3049
- Bug Fixes and Enhancements for Model Loading by @Etherll in #3052
- Add gemma-3n chat template to chat_templates.py by @Etherll in #3051
- Fix: Added specific check for Gemma so models like BERT properly init… by @Sekinal in #3055
- fixup rope sync for everything by @Datta0 in #3061
- get_per_token_logps_and_entropies: return tuple instead of dict by @mmathew23 in #3080
- Docs: Add WSL Installation Guide for Blackwell / RTX 5090 GPU by @dongbin-lunark in #3079
- GPT-OSS support by @mmathew23 in #3099
- Nightly by @danielhanchen in #3102
- gpt-oss manually call temporary patch by @mmathew23 in #3104
New Contributors
- @muzzlol made their first contribution in #2949
- @Sekinal made their first contribution in #3055
- @dongbin-lunark made their first contribution in #3079
Full Changelog: July-2025...August-2025
Less VRAM + bug fixes
More VRAM reduction, faster & bug fixes
Please update Unsloth!
pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo
- Gemma 3N Vision now works and is fixed! Please re-download all model checkpoints (Unsloth will auto do it) Try Kaggle Notebook! There is also a challenge with a prize pool of $100,000!
- Gemma 3 text and vision are all fixed for T4, and is much faster. Losses of 6 to 7 are now fixed - it should be 1 to 2.
- 10 to 25% less VRAM consumption for all models. Also faster compiling and less errors. Unsloth is now more stable!
- Downloads stuck at 90% to 95% fixed!
- Qwen 2.5, Qwen 2, GLM all fixed as well.
- GRPO now works with latest main TRL
- Main TRL, PEFT, Transformers all work
- Forced upgrading transformers is now fixed.
- Falcon H1 finetuning should work great! Notebooks incoming
- Devstral 1.1 and MedGemma 27B, 4B support with vision
- Many many many more bug fixes - this release of Unsloth should be much more stable and error tolerant!
Please update Unsloth!
pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo
What's Changed
- Gemma 3N by @danielhanchen in #2809
- Add instructions for installing unsloth on RTX 5090 by @jeromeku in #2812
- Add falcon h1 by @dhiaEddineRhaiem in #2650
- Granite4 support by @mmathew23 in #2799
- import undefined transformers_version for falcon model by @mmathew23 in #2822
- Fix LoftQ with FastBaseModel by @mehmetoguzderin in #2826
- Create stale.yml by @danielhanchen in #2832
- Create stale.yml by @danielhanchen in #2836
- Added conda/mamba section to blackwell installation readme by @rolandtannous in #2817
- Gemma 3N bug fixes by @danielhanchen in #2842
- Fix loftq None config for FastBaseModel by @mmathew23 in #2848
- Convert torch.bfloat16, torch.float16, etc. to vLLM valid dtypes by @rishabh135 in #2811
- [Feature] enable unsloth on amd gpu by @billishyahao in #2520
- Fix Gemma 3N by @danielhanchen in #2854
- fix quantized model parameter count method by @rolandtannous in #2855
- Update CSM for faster inference (no compile) by @mmathew23 in #2865
- Fix
UnslothTrainingArguments
not patchingtrl.Config
properly by @Erland366 in #2873 - Fix unnecessary warning for transformers >= 4.53.0 by @mmathew23 in #2867
- Update README.md by @danielhanchen in #2885
- Many bug fixes by @danielhanchen in #2908
- silenty skip falcon h1 import if transformers_version < 4.53.0 by @mmathew23 in #2912
- Dynamically adjust get_per_token_logps [trl main upgrade] by @Datta0 in #2911
- [Intel] add intel gpu with vllm support by @leizhenyuan in #2903
- [bugs] fix for casual mask by @leizhenyuan in #2868
- Explicitly check if xformers exists for attention by @Datta0 in #2889
- Falcon H1: if mlp doesn't exist in layer module check for feed_forward by @mmathew23 in #2913
- Move inputs to right devices. by @Datta0 in #2919
- Many bug fixes by @danielhanchen in #2927
New Contributors
- @dhiaEddineRhaiem made their first contribution in #2650
- @mehmetoguzderin made their first contribution in #2826
- @rishabh135 made their first contribution in #2811
- @billishyahao made their first contribution in #2520
Full Changelog: June-2025...July-2025
Gemma 3n + Text-to-speech (TTS)
✨ Gemma 3n now available
- Google's new Gemma 3n multimodal models that support text, image, video & audio. Guide
- Gemma 3n finetuning notebook + audio, vision, text inference Colab notebook
- Gemma 3n collection in dynamic GGUF, safetensor 4-bit etc formats: Gemma-3n
🎵 Text-to-Speech (TTS) Fine-tuning
- Train TTS/STT models like Sesame-CSM, Orpheus-TTS and OpenAI's Whisper locally! Guide
- Clone voices, learn new emotions, tones & styles with 1.5x faster training and -50% VRAM. Notebooks
Tip
Update Unsloth via pip install --upgrade --force-reinstall unsloth unsloth_zoo
🧠 DeepSeek-R1-0528 Support with Dynamic 1-bit GGUFs
- Fine-tune DeepSeek-R1-0528-Qwen3 with GRPO! Our new reward function increases multilingual response rates by 40%+ Notebook
- Dynamic 1-bit GGUFs shrink the full 715GB model to just 175GB (-80% size)
📈 Dynamic 2.0 GGUFs
- New quantization method that achieves SOTA performance. More info
- Sets new benchmarks for 5-shot MMLU and KL Divergence and selectively quantizes layers for optimal accuracy
⚡ Advanced Qwen3 GRPO notebook
- Proximity scoring for more better reward functions. Advanced GRPO notebook
- New Prefinetuning/priming to skip GRPO format learning
🎯 Magistral Conversational Reasoning
- Fine-tune Magistral-24B for advanced conversational reasoning. Notebook
👁️ Gemma3 Vision Support
- Fine-tune Gemma3 vision models for multimodal tasks Notebook
Documentation & Guides
- Reinforcement Learning Guide: Complete guide on RL for LLMs covering GRPO, RLHF, DPO. Guide
- LoRA Hyperparameters Guide: Master optimal learning rates, epochs, LoRA rank & alpha settings. Guide
What's Changed
- Nightly by @danielhanchen in #2448
- Added k_norm & q_norm to merged Qwen3 layers by @cblomert in #2452
- MoE Kernel by @jeromeku in #2465
- Blackwell Support by @johnnynunez in #2458
- Added missing code of conduct by @rolandtannous in #2416
- Fix readme example by @yuanzhedong in #2492
- the pixtral vision notebook fails during inference by @mmathew23 in #2466
- [1/N] Enable intel GPU for unsloth by @leizhenyuan in #2350
- [2/N] Enable intel GPU for unsloth by @leizhenyuan in #2388
- vLLM Windows CUDA support [tested] by @fenglui in #2158
- Add Sesame CSM by @mmathew23 in #2527
- Add Qwen-3 chat template and Ollama template support by @kiankyars in #2537
- Fix typos by @omahs in #2540
- Add use_rslora reference to LoraConfig inititalisation by @jkumz in #2539
- TTS by @danielhanchen in #2545
- Quick fix on the CompileConfig error by @Erland366 in #2554
- Fix trust remote code by @Etherll in #2357
- fix issue with qwen3 template double quote escapes by @davedgd in #2563
- Display the model name in RoPE scaling unsupported error by @emmanuel-ferdman in #2564
- Fix Whisper, ModernBERT by @danielhanchen in #2565
- fix: improved error handling when llama.cpp build fails #2358 by @Hansehart in #2603
- Remove
dataset_text_field
fromSFTConfig
by @qgallouedec in #2609 - Upgrade trl fix by @Datta0 in #2544
- Check the
skip_prepare_dataset
before accessing dataset fields. #2496 by @Premik in #2633 - Llama4 MoE Grouped GEMM by @jeromeku in #2639
- Latest TRL, GRPO + Bug fixes by @danielhanchen in #2645
- Fix SFTtraining for new trl by @mmathew23 in #2647
- Bug fixes by @danielhanchen in #2651
- Fix quant model param fetch regex by @Datta0 in #2662
- Fix batched generation for prompts of different lengths by @RunFMe in #2216
- reroute merge logic language models + comprehensive tests + eval kits by @rolandtannous in #2673
- unsloth checkpointing fix for latest transformers==4.52.x by @mmathew23 in #2674
- patch sft_trainer to favor max_seq_length over max_length in config by @mmathew23 in #2669
- Update prepare 4d causal attention call by @mmathew23 in #2678
- Ignore None Values when building vllm subprocess_command by @Salpingopharyngeus in #2680
- add support for torch270 with Intel GPU by @leizhenyuan in #2709
- Making protobuf version more flexible by @user799595 in #2637
- tests for additional merge fix unsloth zoo pr 163 by @rolandtannous in #2719
- Reward modeling update (There seems to be another patch) by @pluesclues in #2710
- Fix Typos in Documentation and Comments by @leopardracer in #2721
- Fix renaming on other model than Llama by @Erland366 in #2762
- Enable vLLM to share memory space by @Datta0 in #2712
- Fix TRL 1.8.2 by @marcandrelarochelle in #2774
- Fix AttributeError in GRPO trainer for models without llm attribute by @rolandtannous in #2780
- Additional tests for unsloth-zoo PR#174 by @rolandtannous in #2779
- Update pyproject.toml by @amrothemich in #2778
- Fix for grpo_compute_loss_slow by @simpissa in #2702
- Fix GRPO by @danielhanchen in #2787
- Docs: Fix typo and improve MoE docstrings by @kilavvy in #2784
- [5/N] Enable intel GPU for unsloth by @leizhenyuan in #2768
- Sequence Classification Bug Fixes by @pluesclues in #2793
- intel 5/N fix patch by @mmathew23 in #2792
- [3/N] Enable intel GPU for unsloth by @leizhenyuan in #2620
- [4/N] Enable intel GPU for unsloth by @mmathew23 in #2801
- [intel] use DeviceProperties instead of torch.xxx.deviceproperties by @leizhenyuan in #2803
- Fix grpo sleep regex and indentation by @Datta0 in #2804
- Bug fixes by @danielhanchen in #2805
- Bug fixes by @danielhanchen in #2807
New Contributors
- @cblomert made their first contribution in #2452
- @johnnynunez made their first contribution in #2458
- @rolandtannous made their first contribution in #2416
- @yuanzhedong made their first contribution in #2492
- @mmathew23 made their first contribution in #2466
- @leizhenyuan made their first contribution in #2350
- @fenglui made their first contribution in #2158
- @kiankyars made their first contribution in #2537
- @omahs made their first contribution in #2540
- @jkumz made their first contribution in #2539
- @davedgd made their first contribution in #2563
- @emmanuel-ferdman made their first contribution in https://github.com/unslothai/u...
Qwen3
Qwen 3 support + bug fixes
Please update Unsloth via pip install --upgrade --force-reinstall unsloth unsloth_zoo
Qwen3 notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb
GRPO with Qwen3 notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb
There are also many bug fixes in this release!
The 30B MoE is also fine-tunable in Unsloth!
from unsloth import FastModel
import torch
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/Qwen3-30B-A3B",
max_seq_length = 2048, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "hf_...", # use one if using gated models
)
What's Changed
- GGUF saving by @danielhanchen in #2017
- Gemma 3 readme by @danielhanchen in #2019
- Update README.md by @danielhanchen in #2028
- bug fix #2008 - load_in_4bit = True + fast_inference = True by @void-mckenzie in #2039
- unsloth_fast_generate model is not defined fix by @KareemMusleh in #2051
- Ensure trust_remote_code propagates down to unsloth_compile_transformers by @CuppaXanax in #2075
- Show
peft_error
by @IsaacBreen in #2080 - Add generation prompt error message change by @KareemMusleh in #2046
- Many bug fixes by @danielhanchen in #2087
- fix: config.torch_dtype in LlamaModel_fast_forward_inference by @lurf21 in #2091
- Updating new FFT 8bit support by @shimmyshimmer in #2110
- Bug fixes by @danielhanchen in #2113
- Small fix by @danielhanchen in #2114
- fix(utils): add missing importlib import to fix NameError by @naliazheli in #2134
- Add QLoRA Train and Merge16bit Test by @jeromeku in #2130
- Fix Transformers 4.45 by @danielhanchen in #2151
- Bug Fixes by @danielhanchen in #2197
- Issues templates by @jeromeku in #2242
- Fix feature_request ISSUE_TEMPLATE by @jeromeku in #2250
- Registry refactor by @jeromeku in #2255
- Update README.md by @Kimizhao in #2267
- Update README.md by @jackswl in #2119
- Update bug_report.md by @shimmyshimmer in #2323
- feat: Support custom
auto_model
for wider model compatibility (Whisper, Bert,etc) &attn_implementation
support by @Etherll in #2263 - fix: improved error handling when llama.cpp build fails by @Hansehart in #2358
- Revert "fix: improved error handling when llama.cpp build fails" by @shimmyshimmer in #2375
- Fix saving 4bit for VLM by @Erland366 in #2381
- [WIP] Initial support for Qwen3. Will udpate when the model is released by @Datta0 in #2211
- Fixup qwen3 by @Datta0 in #2423
- Fixup qwen3 qk norm by @Datta0 in #2427
- Qwen3 inference fixes by @Datta0 in #2436
- Update mapper.py to add Qwen3 base by @Etherll in #2439
- Qwen 3, Bug Fixes by @danielhanchen in #2445
New Contributors
- @void-mckenzie made their first contribution in #2039
- @CuppaXanax made their first contribution in #2075
- @IsaacBreen made their first contribution in #2080
- @lurf21 made their first contribution in #2091
- @naliazheli made their first contribution in #2134
- @jeromeku made their first contribution in #2130
- @Kimizhao made their first contribution in #2267
- @jackswl made their first contribution in #2119
- @Etherll made their first contribution in #2263
- @Hansehart made their first contribution in #2358
Full Changelog: 2025-03...May-2025
Gemma 3 + FFT Support
March Release 🦥
Get the latest stable Unsloth via:
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
The March release should be stable - you can force the version via:
pip install "unsloth==2025.3.18" "unsloth_zoo==2025.3.16"
New Features
-
Read all details here: https://unsloth.ai/blog/gemma3
-
Gemma 3 1B, 4B, 12B and 27B finetuning all work now! Colab Notebook We fixed some issues which caused Gemma 3 training loss to be very high. This includes some tokenization issues so fine-tuning Gemma 3 will now work correctly if you use Unsloth.
-
We also encountered many infinite gradients during Gemma 3 (1B to 27B) finetuning. We found float16 mixed precision (Tesla T4, RTX 2080 series) to not function well, and we defaulted to float32 precision. Float16 also failed on A100, so this is a hardware agnostic issue. Bfloat16 is fine though! Unsloth auto selects the best data-type! You do not have to do anything! Colab Notebook to finetune Gemma 3
-
Preliminary support for full-finetuning and 8bit finetuning - set
full_finetuning = True
orload_in_8bit = True
Both will be optimized further in the future! A reminder you will need more powerful GPUs!
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-3-4B-it",
max_seq_length = 2048, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "hf_...", # use one if using gated models
)
- New Unsloth Auto Model support - nearly all models are now supported! We now supports vision and text models out of the box, without the need for custom implementations (and all are optimized!)
- Mixtral (yes finally!), Gemma 3, Granite 3.2, Cohere, OLMo, Reka, and generally any vision or language model! There might be some occasional models which don't work!
model, tokenizer = FastModel.from_pretrained(
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1",
)
- Windows support via pip install unsloth should function now! Utilizes https://pypi.org/project/triton-windows/ which provides a pip installable path for Triton. Use:
pip install unsloth
- Train on completions / responses only for vision models supported! Use it like below:
data_collator = UnslothVisionDataCollator(
model,
tokenizer,
train_on_responses_only = False,
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)
SFTTrainer(..., data_collator = data_collator)
- Conversions to llama.cpp GGUFs for 16bit and 8bit now DO NOT need compiling! This solves many many issues, and this means no need to install GCC, Microsoft Visual Studio etc!
model.save_pretrained_merged("gemma-3-finetune", tokenizer)
model.save_pretrained_gguf(
"gemma-3-finetune",
quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
)
-
Vision models now auto resize images which stops OOMs and also allows truncating sequence lengths!
-
Many multiple optimizations in Unsloth allowing a further +10% less VRAM usage, and >10% speedup boost for 4bit (on top of our original 2x faster, 70% less memory usage). 8bit and full finetuning also benefit!
-
GRPO in Unsloth now allows non Unsloth uploaded models to be in 4bit as well - reduces VRAM usage a lot! (ie pretend your own finetune of Llama)
-
New training logs and infos - training parameter counts, total batch size
-
Vision models now also work for normal text training! This means non vision notebooks can work with vision models!
-
Complete gradient accumulation bug fix coverage for all models!
-
GRPO notebook for Gemma 3 coming soon with Hugging Face's reasoning course!
-
DoRA, Dropout, and other PEFT methods should just work!
Bug fixes
- Faster and less error prone streamlined finetuning experience! Apologies for the recent issues with constant releases and breaking breaks - the March release should be stable! Ie
pip install "unsloth==2025.3.14" "unsloth_zoo==2025.3.12"
- Pixtral and Llava finetuning are now fixed! In fact nearly all vision models are supported out of the box! Please update transformers for Pixtral:
pip install --no-deps git+https://github.com/huggingface/transformers.git
- Fixed all Colabs not working - cloud instances like Runpod should just work now!
- Fixed many many bugs - will reply to each issue with updates!
Other items
- GRPO Bug fixes by @danielhanchen in #1623
- Fixes Triton url in README.md by @DiogoNeves in #1607
- Update README.md by @shimmyshimmer in #1654
- Update README.md by @shimmyshimmer in #1688
- Fix bugs by @danielhanchen in #1701
- Fix bugs by @danielhanchen in #1706
- Memory efficient GRPO, DPO etc by @danielhanchen in #1716
- Add GRPO metrics by @danielhanchen in #1718
- llama-quantize on WINDOWS WSL error fix - edit save.py (gguf saving breaks) by @everythingisc00l in #1649
- Update rl_replacements.py by @SethHWeidman in #1754
- Update README.md by @danielhanchen in #1768
- fix an import error by @NinoRisteski in #1767
- Gemma Mask convert to float by @Erland366 in #1762
- [Windows Support] Add latest
xformers
wheels to pyproject.toml by @versipellis in #1753 - Memory Efficient GRPO by @danielhanchen in #1773
- Bug Fixes by @danielhanchen in #1774
- Export Model to ollama.com by @gjyotin305 in #1648
- Fix: GRPO with Mistral and importing by @oKatanaaa in #1831
- Fix key error in GRPOTrainer by @le-big-mac in #1818
- fixed syntax warnings by @KareemMusleh in #1522
- Direct windows support for unsloth by @adityaghai07 in #1841
- Fix Layernorm when num_cols not a power of 2 by @MekkCyber in #1867
- Added Python version warning to Windows Install Section by @areebuzair in #1872
- Update README.md by @shimmyshimmer in #1885
- Bug fixes by @danielhanchen in #1891
- Many bug fixes by @danielhanchen in #1900
- Logits fixes by @danielhanchen in #1916
- Bug fixes by @danielhanchen in #1920
- Bug fixes by @danielhanchen in #1951
- move use_modelscope to _utils by @KareemMusleh in #1938
- Don't use revision when loading model_config and is_peft=True by @wiwu2390 in #1949
- More syntax warnings by @KareemMusleh in #1944
- Gemma 3 by @danielhanchen in #1986
- Gemma 3 bug fixes by @danielhanchen in #2005
- Triton windows update by @Captain-T2004 in #1976
- Update RMS LayerNorm implementation, and list compr. change in chat templates by @NinoRisteski in #1974
- Gemma 3, bug fixes by @danielhanchen in #2014
New Contributors
- @DiogoNeves made their first contribution in #1607
- @everythingisc00l made their first contribution in #1649
- @SethHWeidman made their first contribution in #1754
- @versipellis made their first contribution in #1753
- @gjyotin305 made their first contribution in #1648
- @le-big-mac made their first contribution in #1818
- @MekkCyber made their first contribution in #1867
- @areebuzair made their first contribution in #1872
- @wiwu2390 made their first contribution in #1949
- @Captain-T2004 made their first contribution in #1976
Full Changelog: 2025-02...2025-03
Long Context GRPO
90% less memory usage GRPO
Update Unsloth via pip install --upgrade --no-cache-dir unsloth unsloth_zoo
More details in blog post: https://unsloth.ai/blog/grpo
Llama 3.1 8B GRPO Colab: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb
Metric | Unsloth | TRL + FA2 |
---|---|---|
Training Memory Cost (GB) | 42GB | 414GB |
GRPO Memory Cost (GB) | 9.8GB | 78.3GB |
Inference Cost (GB) | 0GB | 16GB |
Inference KV Cache for 20K context (GB) | 2.5GB | 2.5GB |
Total Memory Usage | 54.3GB (90% less) | 510.8GB |
You automatically get 90% less memory usage! Also all reward logs for individual reward functions will show up.
Script to run GRPO:
!pip install unsloth vllm
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.6, # Reduce if out of memory
)
model = FastLanguageModel.get_peft_model(
model,
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
], # Remove QKVO if out of memory
lora_alpha = lora_rank,
use_gradient_checkpointing = "unsloth", # Enable long context finetuning
random_state = 3407,
)
import re
from datasets import load_dataset, Dataset
global COUNTER
COUNTER = 0
global PRINT_EVERY
PRINT_EVERY = 20
# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""
def extract_xml_answer(text: str) -> str:
answer = text.split("<answer>")[-1]
answer = answer.split("</answer>")[0]
return answer.strip()
def extract_hash_answer(text: str) -> str | None:
if "####" not in text:
return None
return text.split("####")[1].strip()
# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
data = data.map(lambda x: { # type: ignore
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_hash_answer(x['answer'])
}) # type: ignore
return data # type: ignore
dataset = get_gsm8k_questions()
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
q = prompts[0][-1]['content']
extracted_responses = [extract_xml_answer(r) for r in responses]
global COUNTER
if COUNTER % PRINT_EVERY == 0:
print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
COUNTER += 1
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
def int_reward_func(completions, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]
def strict_format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def soft_format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def count_xml(text) -> float:
count = 0.0
if text.count("<reasoning>\n") == 1:
count += 0.125
if text.count("\n</reasoning>\n") == 1:
count += 0.125
if text.count("\n<answer>\n") == 1:
count += 0.125
count -= len(text.split("\n</answer>\n")[-1])*0.001
if text.count("\n</answer>") == 1:
count += 0.125
count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
return count
def xmlcount_reward_func(completions, **kwargs) -> list[float]:
contents = [completion[0]["content"] for completion in completions]
return [count_xml(c) for c in contents]
max_prompt_length = 256
from trl import GRPOConfig, GRPOTrainer
# Optional extra params for vLLM
from unsloth import vLLMSamplingParams
vllm_sampling_params = vLLMSamplingParams(
min_p = 0.01,
seed = 3407,
)
training_args = GRPOConfig(
learning_rate = 5e-6,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
per_device_train_batch_size = 1,
gradient_accumulation_steps = 1, # Increase to 4 for smoother training
num_generations = 6, # Decrease if out of memory
max_prompt_length = max_prompt_length,
max_completion_length = max_seq_length - max_prompt_length,
# num_train_epochs = 1, # Set to 1 for a full training run
max_steps = 250,
report_to = "none", # Can use Weights & Biases
vllm_sampling_params = vllm_sampling_params, # Optional
temperature = 1.0,
)
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [
xmlcount_reward_func,
soft_format_reward_func,
strict_format_reward_func,
int_reward_func,
correctness_reward_func,
],
args = training_args,
train_dataset = dataset,
)
trainer.train()
What's Changed
- GRPO Bug fixes by @danielhanchen in #1623
- Fixes Triton url in README.md by @DiogoNeves in #1607
- Update README.md by @shimmyshimmer in #1654
- Update README.md by @shimmyshimmer in #1688
- Fix bugs by @danielhanchen in #1701
- Fix bugs by @danielhanchen in #1706
- Memory efficient GRPO, DPO etc by @danielhanchen in #1716
- Add GRPO metrics by @danielhanchen in #1718
- llama-quantize on WINDOWS WSL error fix - edit save.py (gguf saving breaks) by @everythingisc00l in #1649
- Update rl_replacements.py by @SethHWeidman in #1754
- Update README.md by @danielhanchen in #1768
- fix an import error by @NinoRisteski in #1767
- Gemma Mask convert to float by @Erland366 in #1762
- [Windows Support] Add latest
xformers
wheels to pyproject.toml by @versipellis in #1753 - Memory Efficient GRPO by @danielhanchen in #1773
New Contributors
- @DiogoNeves made their first contribution in #1607
- @everythingisc00l made their first contribution in #1649
- @SethHWeidman made their first contribution in #1754
- @versipellis made their first contribution in #1753
Full Changelog: 2025-02...2025-02-v2
GRPO, vLLM
GRPO is in Unsloth!
- Experience the "aha moment" from DeepSeek R1's paper now with Unsloth!
- LoRA (16bit) / QLoRA (4bit) actually work for GRPO now!
- Unsloth can do GRPO for Phi-4 14B Llama-3.1 8B in a free 15GB Colab GPU!
- Unsloth now has native fast inference (20x more throughput) via vLLM! Use it via
model.fast_generate
after settingFastLanguageModel.from_pretrained(..., fast_inference = True)
and installing vLLM viapip install vllm
- Llama 3.3 70B QLoRA GRPO should fit in 1x 48GB (best 1x 80GB)
- Update unsloth via
pip install --upgrade --no-cache-dir --force-reinstall unsloth_zoo unsloth vllm
GRPO Notebooks
Model | Type | Colab Link |
---|---|---|
Phi 4 (14B) | GRPO | Open in Colab |
Llama 3.1 (8B) | GRPO | Open in Colab |
Qwen 2.5 (3B) | GRPO | Open in Colab |
Minimal GRPO example (courtesy of Will Brown]
!pip install unsloth vllm
!pip install git+https://github.com/huggingface/trl.git
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 512
lora_rank = 32
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True,
fast_inference = True,
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.6,
)
model = FastLanguageModel.get_peft_model(
model,
r = lora_rank,
lora_alpha = lora_rank,
)
import re
from datasets import load_dataset, Dataset
# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""
def extract_xml_answer(text: str) -> str:
answer = text.split("<answer>")[-1]
answer = answer.split("</answer>")[0]
return answer.strip()
def extract_hash_answer(text: str) -> str | None:
if "####" not in text:
return None
return text.split("####")[1].strip()
# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
data = data.map(lambda x: { # type: ignore
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_hash_answer(x['answer'])
}) # type: ignore
return data # type: ignore
dataset = get_gsm8k_questions()
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
q = prompts[0][-1]['content']
extracted_responses = [extract_xml_answer(r) for r in responses]
print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
def int_reward_func(completions, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]
def strict_format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def soft_format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def count_xml(text) -> float:
count = 0.0
if text.count("<reasoning>\n") == 1:
count += 0.125
if text.count("\n</reasoning>\n") == 1:
count += 0.125
if text.count("\n<answer>\n") == 1:
count += 0.125
count -= len(text.split("\n</answer>\n")[-1])*0.001
if text.count("\n</answer>") == 1:
count += 0.125
count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
return count
def xmlcount_reward_func(completions, **kwargs) -> list[float]:
contents = [completion[0]["content"] for completion in completions]
return [count_xml(c) for c in contents]
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
use_vllm = True, # use vLLM for fast inference!
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "paged_adamw_8bit",
logging_steps = 1,
bf16 = is_bfloat16_supported(),
fp16 = not is_bfloat16_supported(),
per_device_train_batch_size = 1,
gradient_accumulation_steps = 1,
num_generations = 6,
max_prompt_length = 256,
max_completion_length = 200,
# num_train_epochs = 1,
max_steps = 250,
save_steps = 250,
max_grad_norm = 0.1,
report_to = "none",
output_dir = "outputs",
)
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [
xmlcount_reward_func,
soft_format_reward_func,
strict_format_reward_func,
int_reward_func,
correctness_reward_func,
],
args = training_args,
train_dataset = dataset,
)
trainer.train()
Bug Fixes
- Gemma 2 should be fixed now
- Mistral base mapping should be fixed
- Some syntax warning issue fixes
- And many many more bug fixes!
What's Changed
- Add use_exact_model_name option to prevent automatic model name modification by @niryuu in #1339
- Improve debugging experience by @Erland366 in #1512
- changing model to base_model if peft model is already used by @mosama1994 in #1509
- All attention refactor fix by @KareemMusleh in #1491
- Update granite to work with latest post_patch methods by @Datta0 in #1502
- Minor fixes for granite models by @CoffeeVampir3 in #1503
- support modelscope models and datasets by @tastelikefeet in #1481
- Update README.md by @shimmyshimmer in #1529
- Update bug_report.md by @danielhanchen in #1538
- Update README.md by @shimmyshimmer in #1542
- Torch.Cuda Is Available Condition and Warning by @aminwhat in #1545
- Add dropout to granite to match HF's implementation by @Datta0 in #1557
- fix: flash_attn_detection_error by @Zzhiter in #1556
- Fix Mistral, Qwen by @danielhanchen in #1565
- Update README.md by @shimmyshimmer in #1569
- Update README.md by @shimmyshimmer in #1580
- Update README.md by @shimmyshimmer in #1595
- Mistral 24B, Qwen 2.5 VL support by @danielhanchen in #1598
- GRPO, vLLM, Bug Fixes, Reinforcement Learning by @danielhanchen in #1620
New Contributors
- @niryuu made their first contribution in #1339
- @mosama1994 made their first contribution in #1509
- @KareemMusleh made their first contribution in #1491
- @tastelikefeet made their first contribution in #1481
- @aminwhat made their first contribution in #1545
- @Zzhiter made their first contribution in #1556
Full Changelog: 2025-01...2025-02