Skip to content

Conversation

sanjeed5
Copy link
Contributor

@sanjeed5 sanjeed5 commented Oct 8, 2025

Issue Link / Problem Description

Adds documentation and examples for aligning LLM-as-Judge evaluators with human expert judgments. This addresses a common challenge where LLM judges may not align well with human evaluations, leading to unreliable automated assessments.

Changes Made

  • Add comprehensive how-to guide: docs/howtos/applications/align-llm-as-judge.md
    • Step-by-step instructions for measuring and improving judge alignment
  • Add complete evaluation example: examples/ragas_examples/judge_alignment/
    • evals.py: Baseline and improved judge metrics with alignment measurement
    • __init__.py: Module initialization with main entry points
  • Documentation covers:
    • Why judge alignment matters
    • Dataset structure and loading
    • Defining judge and alignment metrics
    • Running baseline evaluations
    • Iterating on judge prompts to improve alignment

Testing

How to Test

  • Manual testing steps:
    1. Run baseline evaluation: uv run python -m ragas_examples.judge_alignment
    2. Verify alignment metrics are calculated correctly
    3. Build docs locally: make serve-docs and navigate to the new guide

@sanjeed5 sanjeed5 marked this pull request as ready for review October 9, 2025 18:34
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 9, 2025
@sanjeed5 sanjeed5 requested a review from shahules786 October 9, 2025 18:36
- Modified .gitignore to include only the .cursor/plans directory, removing unnecessary entries.
- Streamlined the git-pr.md file by replacing the lengthy PR description template with concise instructions for creating a PR using the gh CLI, enhancing usability for contributors.
- Introduced a new module in `examples/ragas_examples/judge_alignment` that includes functions for loading datasets, running evaluations, and calculating alignment metrics.
- Added baseline and improved accuracy metrics for judge evaluations.
- Enhanced documentation with a module docstring outlining the available functions and metrics.
- Introduced a new evaluation module in `examples/ragas_examples/judge_alignment` that implements functions for loading datasets, executing evaluations, and calculating alignment metrics between LLM judges and human judgments.
- Defined baseline and improved accuracy metrics for judge evaluations.
- Enhanced module documentation with a comprehensive docstring explaining the evaluation process and available functions.
- Add error analysis section with false positive/negative breakdown
- Include improved v2 prompt with abbreviation guide and evaluation approach
- Add ground truth quality guidance and tips on prompt optimization
- Update evals.py with complete v2 prompt and main_v2() function
- Fix model name from gpt-5-mini to gpt-4o-mini
- Add collapsible output examples for baseline and v2 results
@sanjeed5 sanjeed5 force-pushed the docs/align-llm-judge branch from 6015f1e to cc6bd52 Compare October 9, 2025 18:47
@shahules786 shahules786 merged commit f31c365 into main Oct 14, 2025
8 checks passed
@shahules786 shahules786 deleted the docs/align-llm-judge branch October 14, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants