[AI Evaluation] EquivalenceEvaluator is not producing an answer

### Description

When attempting to use the EquivalenceEvaluator, my testing comes back Inclusive. I did some debugging and I found that if using the debugger increase the MaxOutputTokens from 1 to 2, it starts working as I'd expect. I'm not sure why it needs more than 1 output token since the response should be a number between from 1-5, but that's what I'm seeing.

"Failed to parse numeric score for 'Equivalence' from the following text:"


https://github.com/dotnet/extensions/blob/f04bd506a04b585486be20fb7b08157162f2f527/src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/EquivalenceEvaluator.cs#L55C14-L55C33

### Reproduction Steps

```csharp
var messages = new List<ChatMessage>
{
    new(ChatRole.System, "Your are a helpful assistant."),
    new(ChatRole.Assistant, "What's the 3rd planet from the sun?")
};

var response = new ChatResponse(new ChatMessage(ChatRole.Assistant, "The Earth is the 3rd planet."));

var chatConfig = new ChatConfiguration(chatClient.AsBuilder().Build());

var equivalenceEvaluatorContext = new EquivalenceEvaluatorContext("The 3rd planet from the sun is the Earth.");

var equivalenceEvaluator = new EquivalenceEvaluator();
var evaluationResult =
    await equivalenceEvaluator.EvaluateAsync(messages, response, chatConfig, additionalContext: [equivalenceEvaluatorContext]);

Debug.WriteLine(evaluationResult.Metrics.Single().Value.Diagnostics?.Single().Message);
```


### Expected behavior

`evaluationResult.Metrics.Single().Value.Interpretation.Rating` to be `Exceptional`

### Actual behavior

`evaluationResult.Metrics.Single().Value.Diagnostics?.Single().Message` is: `Failed to parse numeric score for 'Equivalence' from the following text:` When looking in the debugger one can see that it model stopped generating output because of the max token limit being hit.

### Regression?

_No response_

### Known Workarounds

Attached the debugger [here](https://github.com/dotnet/extensions/blob/f04bd506a04b585486be20fb7b08157162f2f527/src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/EquivalenceEvaluator.cs#L99), and change the `_clientOptions.MaxOutputTokens` to `2` before it's used.

### Configuration

Using .net 9, and getting the IChatClient from `AnthropicClient.Messages.AsBuidler().Build()` with the Anthropic.SDK nuget package and using the AnthropicModels.Claude4Sonnet model id.

### Other information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AI Evaluation] EquivalenceEvaluator is not producing an answer #6814

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AI Evaluation] EquivalenceEvaluator is not producing an answer #6814

Description

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions