Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
README.md		README.md

Repository files navigation

Awesome Label-Free Reinforcement Learning with Verifiable Rewards

A curated collection of papers on Label-Free Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs).

By Qingyang Zhang, Haitao Wu and Yi Ding. If there are any papers I missed, please let me know!

Table of Contents

Awesome Label-Free Reinforcement Learning with Verifiable Rewards

Overview

Before DeepSeek-R1-Zero

Preference Optimization for Reasoning with Pseudo Feedback, ArXiv, 2024-11, ICLR'25 spotlight

Self-Consistency Preference Optimization, ArXiv, 2024-11

RLVR without External Supervision

Right question is already half the answer: Fully unsupervised LLM reasoning incentivization, ArXiv, 2025-04-08

Ttrl: Test-time reinforcement learning, ArXiv, 2025-04-22

Absolute Zero: Reinforced Self-play Reasoning with Zero Data, ArXiv, 2025-05-06

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning, ArXiv, 2025-05-21

SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation, ArXiv, 2025-05-22

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data, ArXiv, 2025-05-25

Learning to Reason without External Rewards, ArXiv, 2025-05-26

Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers, ArXiv, 2025-05-26

Spurious Rewards: Rethinking Training Signals in RLVR, Blog, 2025-05-27

Can Large Reasoning Models Self-Train?, ArXiv, 2025-05-27

Maximizing Confidence Alone Improves Reasoning, ArXiv, 2025-05-28

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO, ArXiv, 2025-05-29

ZeroGUI: Automating Online GUI Learning at Zero Human Cost, ArXiv, 2025-05-29

Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning, ArXiv, 2025-06-02

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models, ArXiv, 2025-06-05

Self-Adapting Language Models, ArXiv, 2025-06-12

No Free Lunch: Rethinking Internal Feedback for LLM Reasoning, ArXiv, 2025-06-20

RLVR with Limited Data

Self-rewarding correction for mathematical reasoning, ArXiv, 2025-02-26

Reinforcement Learning for Reasoning in Large Language Models with One Training Example, ArXiv, 2025-04-29

Evolving LLMs’ Self-Refinement Capability via Iterative Preference Optimization, ArXiv, 2025-05-17

Sherlock: Self-Correcting Reasoning in Vision-Language Models, ArXiv, 2025-05-28

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models, Arxiv, 2025-07-05

Others

SLOT: Sample-specific Language Model Optimization at Test-time, ArXiv, 2025-05-18

One-shot Entropy Minimization, ArXiv, 2025-05-26

Reinforcing General Reasoning without Verifiers, ArXiv, 2025-05-27

Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims, Blog, 2025-05-29

A critical review on RLVR evaluation setups.

Star History

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

No packages published

Contributors 6