-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat(metrics): Add Risk-Control Metric Suite #2283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(metrics): Add Risk-Control Metric Suite #2283
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @AlanPonnachan
A few questions that comes up:
- What happens if the underlying dataset changes after metric
creation? - How does this scale with very large datasets?
- Each
_single_turn_ascore
call recalculates for the
entire dataset - is this intentional?
Please also update the docs and check out the failing ci. You can check locally by running make run-ci
.
While at it, please also rebase with main.
""" | ||
Factory function to create the suite of four risk-control metrics. | ||
""" | ||
calculator = _RiskControlCalculator(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation for required columns.
|
||
def risk_control_suite(dataset: Dataset) -> list[Metric]: | ||
""" | ||
Factory function to create the suite of four risk-control metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more details on usage for docs.
Hi @anistark , Thank you so much for the insightful feedback on the initial proposal. It helped me understand the core Based on your feedback, I've completely refactored the implementation to address every concern. Before I update the PR, I wanted to propose the new design here to make sure we're aligned. Understanding the Challenge: A Corpus-Level MetricThe core challenge, as I now understand it, is that metrics like The Proposed Solution: Lazy, Cached Corpus CalculationThe proposed solution is to make the metric objects themselves stateless and perform a lazy, one-time calculation for the entire dataset on the first row processed. The results are then cached for all subsequent rows in that evaluation run. This is achieved by leveraging the internal Here’s a snapshot of the new design pattern in from weakref import WeakKeyDictionary
from ragas.dataset_schema import SingleTurnSample
# ... other imports
# A module-level cache to store results per dataset object
_calculator_cache: WeakKeyDictionary[Dataset, dict[str, float]] = WeakKeyDictionary()
def _calculate_scores_for_dataset(dataset: Dataset) -> dict[str, float]:
"""
Performs the one-time, full-dataset calculation and caches the result.
"""
if dataset in _calculator_cache:
return _calculator_cache[dataset]
# ... full scan and calculation logic ...
_calculator_cache[dataset] = scores
return scores
@dataclass
class Risk(SingleTurnMetric):
name: str = "risk"
# ...
async def _single_turn_ascore(self, sample: SingleTurnSample, callbacks: Callbacks) -> float:
"""
For each row, get a reference to the parent dataset and compute scores.
The calculation only runs on the first call; subsequent calls are instant cache hits.
"""
dataset = getattr(sample, "__ragas_dataset__")
scores = _calculate_scores_for_dataset(dataset)
return scores["risk"]
# Users now import the singleton instances directly, no factory needed.
risk = Risk()
carefulness = Carefulness() |
@AlanPonnachan This sounds like a good plan. A few thoughts:
dataset = Dataset.from_list([...])
results1 = evaluate(dataset, [risk]) # Calculates and caches
dataset = dataset.add_row(new_data) # Dataset changed
results2 = evaluate(dataset, [risk]) # Uses old cached results
How about if you can think of a hybrid approach which combines using factory for control and singleton instances for standard use? |
I've designed a new hybrid solution that I believe addresses every point. The new design isolates the core calculation logic into a pure, standalone function that is completely independent of the This is managed through a factory function, # FILE: src/ragas/metrics/_risk_control.py
import asyncio
from dataclasses import dataclass
def _calculate_scores_for_dataset(dataset: Dataset) -> dict[str, float]:
"""The PURE, TESTABLE core logic. No framework dependencies."""
# ... (full calculation logic) ...
@dataclass
class _RiskCache:
scores: dict[str, float] | None = None
lock: asyncio.Lock = field(default_factory=asyncio.Lock)
def risk_control_suite() -> list[Metric]:
"""Factory to create the suite with a shared, run-specific cache."""
cache = _RiskCache()
return [Risk(cache=cache), Carefulness(cache=cache), ...]
@dataclass(kw_only=True)
class Risk(SingleTurnMetric):
cache: _RiskCache = field(default_factory=_RiskCache)
async def _single_turn_ascore(self, sample: SingleTurnSample, ...) -> float:
async with self.cache.lock: # Prevent race conditions
if self.cache.scores is None:
dataset = getattr(sample, "__ragas_dataset__")
self.cache.scores = _calculate_scores_for_dataset(dataset)
return self.cache.scores["risk"] How This Addresses Your Feedback:
This design feels like a great balance of robustness and simplicity. If this direction looks good to you, I will update the PR with this final implementation. |
Issue Link / Problem Description
ragas
library currently excels at evaluating the quality of generated answers but lacks metrics to assess a RAG system's trustworthiness and risk-control mechanisms. Specifically, it cannot measure a system's ability to recognize uncertainty and proactively abstain from answering when the retrieved context is insufficient or irrelevant. This is a critical capability for deploying reliable RAG systems in production and safety-critical domains.Changes Made
_risk_control.py
: Introduced a new filesrc/ragas/metrics/_risk_control.py
which contains the implementation for a new suite of four interconnected metrics:Risk
: Measures the probability of a "risky" answer (lower is better).Carefulness
: Measures the ability to correctly discard unanswerable questions.Alignment
: Measures the overall accuracy of the keep/discard decision.Coverage
: Measures the proportion of questions the system attempts to answer.risk_control_suite
factory function: This function efficiently initializes all four metrics, sharing a single calculation pass over the dataset to improve performance.metrics/__init__.py
: Exposed the new metrics (Risk
,Carefulness
,Alignment
,Coverage
) and therisk_control_suite
factory function to make them accessible to users.Testing
How to Test
tests/unit/test_risk_control.py
has been added. It includes comprehensive unit tests that verify: