Skip to content

Conversation

AlanPonnachan
Copy link
Contributor

Issue Link / Problem Description

  • Fixes feat: Addition of Risk-Control Metrics for Trustworthy RAG Evaluation #2279
  • Problem: The ragas library currently excels at evaluating the quality of generated answers but lacks metrics to assess a RAG system's trustworthiness and risk-control mechanisms. Specifically, it cannot measure a system's ability to recognize uncertainty and proactively abstain from answering when the retrieved context is insufficient or irrelevant. This is a critical capability for deploying reliable RAG systems in production and safety-critical domains.

Changes Made

  • Added _risk_control.py: Introduced a new file src/ragas/metrics/_risk_control.py which contains the implementation for a new suite of four interconnected metrics:
    • Risk: Measures the probability of a "risky" answer (lower is better).
    • Carefulness: Measures the ability to correctly discard unanswerable questions.
    • Alignment: Measures the overall accuracy of the keep/discard decision.
    • Coverage: Measures the proportion of questions the system attempts to answer.
  • Added risk_control_suite factory function: This function efficiently initializes all four metrics, sharing a single calculation pass over the dataset to improve performance.
  • Updated metrics/__init__.py: Exposed the new metrics (Risk, Carefulness, Alignment, Coverage) and the risk_control_suite factory function to make them accessible to users.

Testing

How to Test

  • Automated tests added/updated
  • Automated: A new test file tests/unit/test_risk_control.py has been added. It includes comprehensive unit tests that verify:
    • Correct calculation of all four metrics on a sample dataset.
    • Correct handling of edge cases (e.g., no "kept" answers, no "unanswerable" questions).
    • Proper error handling for missing required columns.
  • To run the tests:
    pytest tests/unit/test_risk_control.py

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Sep 15, 2025
Copy link
Contributor

@anistark anistark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @AlanPonnachan

A few questions that comes up:

  • What happens if the underlying dataset changes after metric
    creation?
  • How does this scale with very large datasets?
  • Each _single_turn_ascore call recalculates for the
    entire dataset - is this intentional?

Please also update the docs and check out the failing ci. You can check locally by running make run-ci.

While at it, please also rebase with main.

"""
Factory function to create the suite of four risk-control metrics.
"""
calculator = _RiskControlCalculator(dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add validation for required columns.


def risk_control_suite(dataset: Dataset) -> list[Metric]:
"""
Factory function to create the suite of four risk-control metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more details on usage for docs.

@AlanPonnachan
Copy link
Contributor Author

Hi @anistark ,

Thank you so much for the insightful feedback on the initial proposal. It helped me understand the core ragas evaluation architecture much more deeply.

Based on your feedback, I've completely refactored the implementation to address every concern. Before I update the PR, I wanted to propose the new design here to make sure we're aligned.

Understanding the Challenge: A Corpus-Level Metric

The core challenge, as I now understand it, is that metrics like Risk and Carefulness are fundamentally corpus-level (or dataset-level). The score is an aggregate (Total UK / Total Kept) and cannot be calculated from a single row in isolation. This contrasts with row-level metrics like faithfulness.

The evaluate() function, however, is designed to call _single_turn_ascore() on a per-row basis. The design challenge is to bridge this gap in an efficient and stateless way that respects the ragas architecture.

Proposed Solution: Lazy, Cached Corpus Calculation

The proposed solution is to make the metric objects themselves stateless and perform a lazy, one-time calculation for the entire dataset on the first row processed. The results are then cached for all subsequent rows in that evaluation run.

This is achieved by leveraging the internal __ragas_dataset__ attribute that the evaluate function attaches to each sample.

Here’s a snapshot of the new design pattern in _risk_control.py:

from weakref import WeakKeyDictionary
from ragas.dataset_schema import SingleTurnSample
# ... other imports

# A module-level cache to store results per dataset object
_calculator_cache: WeakKeyDictionary[Dataset, dict[str, float]] = WeakKeyDictionary()

def _calculate_scores_for_dataset(dataset: Dataset) -> dict[str, float]:
    """
    Performs the one-time, full-dataset calculation and caches the result.
    """
    if dataset in _calculator_cache:
        return _calculator_cache[dataset]
    
    # ... full scan and calculation logic ...
    
    _calculator_cache[dataset] = scores
    return scores


@dataclass
class Risk(SingleTurnMetric):
    name: str = "risk"
    # ...

    async def _single_turn_ascore(self, sample: SingleTurnSample, callbacks: Callbacks) -> float:
        """
        For each row, get a reference to the parent dataset and compute scores.
        The calculation only runs on the first call; subsequent calls are instant cache hits.
        """
        dataset = getattr(sample, "__ragas_dataset__")
        scores = _calculate_scores_for_dataset(dataset)
        return scores["risk"]

# Users now import the singleton instances directly, no factory needed.
risk = Risk()
carefulness = Carefulness()

@anistark
Copy link
Contributor

@AlanPonnachan This sounds like a good plan.

A few thoughts:

  1. We won't be able to use metrics outside evaluate() as there'll be no __ragas_dataset__. So, also won't be able to test indidual metrics in isolation.
  2. Cache not updating with if dataset changes.
dataset = Dataset.from_list([...])
results1 = evaluate(dataset, [risk])        # Calculates and caches

dataset = dataset.add_row(new_data)  # Dataset changed
results2 = evaluate(dataset, [risk])        # Uses old cached results
  1. If ragas changes internals, such as renames __ragas_dataset__ to __dataset__ this'll break. However this I can think of as something that would be changed alongside. But better to not hardcode.

How about if you can think of a hybrid approach which combines using factory for control and singleton instances for standard use?

@AlanPonnachan
Copy link
Contributor Author

@anistark

I've designed a new hybrid solution that I believe addresses every point.

The new design isolates the core calculation logic into a pure, standalone function that is completely independent of the ragas framework. The metric classes then act as thin, thread-safe wrappers around this testable core.

This is managed through a factory function, risk_control_suite(), which creates a shared cache for a single evaluate() run, ensuring efficiency and safety.

# FILE: src/ragas/metrics/_risk_control.py

import asyncio
from dataclasses import dataclass

def _calculate_scores_for_dataset(dataset: Dataset) -> dict[str, float]:
    """The PURE, TESTABLE core logic. No framework dependencies."""
    # ... (full calculation logic) ...

@dataclass
class _RiskCache:
    scores: dict[str, float] | None = None
    lock: asyncio.Lock = field(default_factory=asyncio.Lock)

def risk_control_suite() -> list[Metric]:
    """Factory to create the suite with a shared, run-specific cache."""
    cache = _RiskCache()
    return [Risk(cache=cache), Carefulness(cache=cache), ...]

@dataclass(kw_only=True)
class Risk(SingleTurnMetric):
    cache: _RiskCache = field(default_factory=_RiskCache)

    async def _single_turn_ascore(self, sample: SingleTurnSample, ...) -> float:
        async with self.cache.lock: # Prevent race conditions
            if self.cache.scores is None:
                dataset = getattr(sample, "__ragas_dataset__")
                self.cache.scores = _calculate_scores_for_dataset(dataset)
        return self.cache.scores["risk"]

How This Addresses Your Feedback:

  1. Testability: The core logic is now in _calculate_scores_for_dataset, which is a pure function. Our unit tests can call this directly, making the metric fully testable in isolation.

  2. Cache Staleness: The cache is no longer global. The risk_control_suite() factory creates a fresh cache for each evaluate() run, so state cannot leak between runs. This guarantees correctness.

  3. Fragile __ragas_dataset__: The dependency is now architecturally contained. The core, testable logic is pure. The fragile getattr call is isolated in a thin adapter layer (_single_turn_ascore), minimizing the "blast radius" of any future internal changes in ragas. (this seems to be a necessary trade-off for any corpus-level metric that needs to integrate seamlessly into the existing row-level evaluate() loop.)

  4. Hybrid Approach: This is achieved perfectly. The risk_control_suite() factory provides the controlled, efficient path for users, while the metrics remain robust. I've also added an asyncio.Lock to ensure it's safe for parallel execution.

This design feels like a great balance of robustness and simplicity. If this direction looks good to you, I will update the PR with this final implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Addition of Risk-Control Metrics for Trustworthy RAG Evaluation

2 participants