Skip to content

Conversation

jjmachan
Copy link
Member

@jjmachan jjmachan commented Sep 27, 2025

Summary

This PR adds persistence capabilities and better string representations for LLM-based metrics, making them easier to save, share, and debug.

Changes

1. Save/Load Functionality

  • Added save() and load() methods to SimpleLLMMetric and its subclasses (DiscreteMetric, NumericMetric, RankingMetric)
  • Supports JSON format with optional gzip compression
  • Handles all prompt types including Prompt and DynamicFewShotPrompt
  • Smart defaults: metric.save() saves to ./metric_name.json

2. Improved __repr__ Methods

  • Clean, informative string representations for both LLM-based and decorator-based metrics
  • Removed implementation details (memory addresses, <locals>, internal attributes)
  • Smart prompt truncation (80 chars max)
  • Function signature display for decorator-based metrics

Before:

create_metric_decorator.<locals>.decorator_factory.<locals>.decorator.<locals>.CustomMetric(name='summary_accuracy', _func=<function summary_accuracy at 0x151ffdf80>, ...)

After:

# LLM-based metrics
DiscreteMetric(name='response_quality', allowed_values=['correct', 'incorrect'], prompt='Evaluate if the response...')

# Decorator-based metrics  
summary_accuracy(user_input, response) -> DiscreteMetric[['pass', 'fail']]

3. Response Model Handling

  • Added create_auto_response_model() factory to mark auto-generated models
  • Only warns about custom response models during save, not standard ones

Usage Examples

# Save metric with default path
metric.save()  # → ./response_quality.json

# Save with custom path
metric.save("custom.json")
metric.save("/path/to/metrics/")  # → /path/to/metrics/response_quality.json
metric.save("compressed.json.gz")  # Compressed

# Load metric
loaded_metric = DiscreteMetric.load("response_quality.json")

# For DynamicFewShotPrompt metrics
loaded_metric = DiscreteMetric.load("metric.json", embedding_model=embeddings)

Testing

  • Comprehensive test suite with 8 tests covering all save/load scenarios
  • Tests for default paths, directory handling, compression
  • Tests for all prompt types and metric subclasses

Dependencies

Note: This PR builds on #2316 (Fix metric inheritance patterns) and requires it to be merged first. The changes here depend on the cleaned-up metric inheritance structure from that PR.

Checklist

  • Tests added
  • Documentation in docstrings
  • Backwards compatible (new functionality only)
  • Follows TDD practices

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Sep 27, 2025
@jjmachan jjmachan changed the title Feat/save llm based metric feat: Add save/load functionality and improved repr for LLM-based metrics Sep 27, 2025
Copy link
Contributor

@anistark anistark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. 2 minor suggestions added.

And need to fix the merge conflict.

prompt: t.Optional[t.Union[str, "Prompt"]] = None
_response_model: t.Type["BaseModel"] = field(init=False)

def __post_init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some metrics can be created without prompts but will fail during save.

Either we can make prompts required here or allow None in serialization..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - SimpleLLMMetric will always have prompt so I'll make it required

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Sep 30, 2025
Copy link

openhands-ai bot commented Sep 30, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • CI

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2320 at branch `feat/save-llm-based-metric`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Member Author

@jjmachan jjmachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've address everything - could you take a look and merge it if you think everything looks good?

prompt: t.Optional[t.Union[str, "Prompt"]] = None
_response_model: t.Type["BaseModel"] = field(init=False)

def __post_init__(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - SimpleLLMMetric will always have prompt so I'll make it required

@anistark anistark merged commit 7458687 into main Oct 1, 2025
12 checks passed
@anistark anistark deleted the feat/save-llm-based-metric branch October 1, 2025 06:53
jjmachan pushed a commit that referenced this pull request Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants