Skip to content

[AI Evaluation] EquivalenceEvaluator is not producing an answer #6814

@mikeholczer

Description

@mikeholczer

Description

When attempting to use the EquivalenceEvaluator, my testing comes back Inclusive. I did some debugging and I found that if using the debugger increase the MaxOutputTokens from 1 to 2, it starts working as I'd expect. I'm not sure why it needs more than 1 output token since the response should be a number between from 1-5, but that's what I'm seeing.

"Failed to parse numeric score for 'Equivalence' from the following text:"

https://github.com/dotnet/extensions/blob/f04bd506a04b585486be20fb7b08157162f2f527/src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/EquivalenceEvaluator.cs#L55C14-L55C33

Reproduction Steps

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "Your are a helpful assistant."),
    new(ChatRole.Assistant, "What's the 3rd planet from the sun?")
};

var response = new ChatResponse(new ChatMessage(ChatRole.Assistant, "The Earth is the 3rd planet."));

var chatConfig = new ChatConfiguration(chatClient.AsBuilder().Build());

var equivalenceEvaluatorContext = new EquivalenceEvaluatorContext("The 3rd planet from the sun is the Earth.");

var equivalenceEvaluator = new EquivalenceEvaluator();
var evaluationResult =
    await equivalenceEvaluator.EvaluateAsync(messages, response, chatConfig, additionalContext: [equivalenceEvaluatorContext]);

Debug.WriteLine(evaluationResult.Metrics.Single().Value.Diagnostics?.Single().Message);

Expected behavior

evaluationResult.Metrics.Single().Value.Interpretation.Rating to be Exceptional

Actual behavior

evaluationResult.Metrics.Single().Value.Diagnostics?.Single().Message is: Failed to parse numeric score for 'Equivalence' from the following text: When looking in the debugger one can see that it model stopped generating output because of the max token limit being hit.

Regression?

No response

Known Workarounds

Attached the debugger here, and change the _clientOptions.MaxOutputTokens to 2 before it's used.

Configuration

Using .net 9, and getting the IChatClient from AnthropicClient.Messages.AsBuidler().Build() with the Anthropic.SDK nuget package and using the AnthropicModels.Claude4Sonnet model id.

Other information

No response

Metadata

Metadata

Labels

area-ai-evalMicrosoft.Extensions.AI.Evaluation and relatedbugThis issue describes a behavior which is not expected - a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions