docs: Add how-to guide for aligning LLM-as-Judge #2348

sanjeed5 · 2025-10-08T17:46:46Z

Issue Link / Problem Description

Adds documentation and examples for aligning LLM-as-Judge evaluators with human expert judgments. This addresses a common challenge where LLM judges may not align well with human evaluations, leading to unreliable automated assessments.

Changes Made

Add comprehensive how-to guide: docs/howtos/applications/align-llm-as-judge.md
- Step-by-step instructions for measuring and improving judge alignment
Add complete evaluation example: examples/ragas_examples/judge_alignment/
- evals.py: Baseline and improved judge metrics with alignment measurement
- __init__.py: Module initialization with main entry points
Documentation covers:
- Why judge alignment matters
- Dataset structure and loading
- Defining judge and alignment metrics
- Running baseline evaluations
- Iterating on judge prompts to improve alignment

Testing

How to Test

Manual testing steps:
1. Run baseline evaluation: uv run python -m ragas_examples.judge_alignment
2. Verify alignment metrics are calculated correctly
3. Build docs locally: make serve-docs and navigate to the new guide

- Modified .gitignore to include only the .cursor/plans directory, removing unnecessary entries. - Streamlined the git-pr.md file by replacing the lengthy PR description template with concise instructions for creating a PR using the gh CLI, enhancing usability for contributors.

- Introduced a new module in `examples/ragas_examples/judge_alignment` that includes functions for loading datasets, running evaluations, and calculating alignment metrics. - Added baseline and improved accuracy metrics for judge evaluations. - Enhanced documentation with a module docstring outlining the available functions and metrics.

- Introduced a new evaluation module in `examples/ragas_examples/judge_alignment` that implements functions for loading datasets, executing evaluations, and calculating alignment metrics between LLM judges and human judgments. - Defined baseline and improved accuracy metrics for judge evaluations. - Enhanced module documentation with a comprehensive docstring explaining the evaluation process and available functions.

- Add error analysis section with false positive/negative breakdown - Include improved v2 prompt with abbreviation guide and evaluation approach - Add ground truth quality guidance and tips on prompt optimization - Update evals.py with complete v2 prompt and main_v2() function - Fix model name from gpt-5-mini to gpt-4o-mini - Add collapsible output examples for baseline and v2 results

sanjeed5 marked this pull request as ready for review October 9, 2025 18:34

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 9, 2025

sanjeed5 requested a review from shahules786 October 9, 2025 18:36

sanjeed5 added 7 commits October 10, 2025 00:08

Add new guide for aligning LLMs as judges in documentation

54f9bcc

Add comprehensive how to guide for aligning LLMs as judges

b7d0dff

add link to align llm as judge guide

cc6bd52

sanjeed5 force-pushed the docs/align-llm-judge branch from 6015f1e to cc6bd52 Compare October 9, 2025 18:47

shahules786 merged commit f31c365 into main Oct 14, 2025
8 checks passed

shahules786 deleted the docs/align-llm-judge branch October 14, 2025 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: Add how-to guide for aligning LLM-as-Judge #2348

docs: Add how-to guide for aligning LLM-as-Judge #2348

Uh oh!

sanjeed5 commented Oct 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

docs: Add how-to guide for aligning LLM-as-Judge #2348

docs: Add how-to guide for aligning LLM-as-Judge #2348

Uh oh!

Conversation

sanjeed5 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Link / Problem Description

Changes Made

Testing

How to Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanjeed5 commented Oct 8, 2025 •

edited

Loading