A curated collection of papers on Label-Free Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs).
By Qingyang Zhang, Haitao Wu and Yi Ding. If there are any papers I missed, please let me know!
Preference Optimization for Reasoning with Pseudo Feedback, ArXiv, 2024-11, ICLR'25 spotlight
Self-Consistency Preference Optimization, ArXiv, 2024-11
Right question is already half the answer: Fully unsupervised LLM reasoning incentivization, ArXiv, 2025-04-08
Ttrl: Test-time reinforcement learning, ArXiv, 2025-04-22
Absolute Zero: Reinforced Self-play Reasoning with Zero Data, ArXiv, 2025-05-06
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning, ArXiv, 2025-05-21
SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation, ArXiv, 2025-05-22
SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data, ArXiv, 2025-05-25
Learning to Reason without External Rewards, ArXiv, 2025-05-26
Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers, ArXiv, 2025-05-26
Spurious Rewards: Rethinking Training Signals in RLVR, Blog, 2025-05-27
Can Large Reasoning Models Self-Train?, ArXiv, 2025-05-27
Maximizing Confidence Alone Improves Reasoning, ArXiv, 2025-05-28
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO, ArXiv, 2025-05-29
ZeroGUI: Automating Online GUI Learning at Zero Human Cost, ArXiv, 2025-05-29
Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning, ArXiv, 2025-06-02
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models, ArXiv, 2025-06-05
Self-Adapting Language Models, ArXiv, 2025-06-12
No Free Lunch: Rethinking Internal Feedback for LLM Reasoning, ArXiv, 2025-06-20
Self-rewarding correction for mathematical reasoning, ArXiv, 2025-02-26
Reinforcement Learning for Reasoning in Large Language Models with One Training Example, ArXiv, 2025-04-29
Evolving LLMs’ Self-Refinement Capability via Iterative Preference Optimization, ArXiv, 2025-05-17
Sherlock: Self-Correcting Reasoning in Vision-Language Models, ArXiv, 2025-05-28
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models, Arxiv, 2025-07-05
SLOT: Sample-specific Language Model Optimization at Test-time, ArXiv, 2025-05-18
One-shot Entropy Minimization, ArXiv, 2025-05-26
Reinforcing General Reasoning without Verifiers, ArXiv, 2025-05-27
Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims, Blog, 2025-05-29
A critical review on RLVR evaluation setups.