Add Gemini judge support with weighted scoring in G-Eval #1913

bofenghuang · 2025-08-08T12:45:02Z

Hello 👋,

This PR enables the use of Gemini models as judge to get weighted summed scores in G-Eval.

Currently, there isn't a generate_raw_response like function to get logprobs for Gemini, so it falls back to generate function, which only produces the final sampled score. This PR:

Adds generate_raw_response and a_generate_raw_response function
Caps top_logprobs at 19, since Gemini only supports a range of [0, 20)
Adds a transform_gemini_to_openai_like function to convert Gemini output to OpenAI format, so we can reuse the existing post-processing code (calculate_weighted_summed_score)
Adds an initial unit test

One issue is that Gemini tokenizes the default rubric upper bound 10 to 0 and 1, so with current version of calculate_weighted_summed_score, we need to set the upper bound to less than 10

vercel · 2025-08-08T12:45:06Z

@bofenghuang is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

penguine-ip · 2025-08-09T07:59:23Z

Hey @bofenghuang thanks for the PR! Can we add a quick test for G-Eval as well (like supply the gemini model in code to G-Eval?) Thanks!

bofenghuang · 2025-08-09T09:37:31Z

Hey @penguine-ip ,

Thanks for your review!

Just rebased and added a mocked unit test plus a live test, where the live one reads the Gemini API config from .deepeval/.deepeval.

bofenghuang · 2025-08-11T14:04:21Z

Hi @penguine-ip , is this ready to be merged? Thanks!

penguine-ip · 2025-08-15T03:24:29Z

Hey @bofenghuang yes just one more thing. Would it be better to raise an attribute error if the function does fail? Since right now in G-Eval we are catching that gracefully. Let me know your thoughts

bofenghuang · 2025-08-15T09:54:07Z

Hey @penguine-ip,
IMO these fallbacks are very useful, e.g., falling back to a_generate if a_generate_raw_response doesn't exist, or to the greedy score if calculate_weighted_summed_score fails.
But I'd prefer to have a trace instead of silently falling back, since it took me a while as a new user to realize that Gemini didn't work with the weighted summed score. Maybe could add this to verbose_logs?

penguine-ip · 2025-08-19T09:37:27Z

Hey @bofenghuang yes actually that woul dbe craet. We can add a things like `"Score: X (weighted summation=True)", does this sound feasible?

bofenghuang · 2025-08-19T09:42:02Z

Yes, that would be great! Maybe we could do this in another PR.

bofenghuang · 2025-08-22T07:26:07Z

Hi @penguine-ip, do you think this PR is ready to merge? Anything you'd like to add? Thanks!

bofenghuang · 2025-08-26T15:40:23Z

Hey @penguine-ip, does this added warning look good to you?

penguine-ip · 2025-08-27T05:22:33Z

Hey @bofenghuang is it possible to remove the warning? It will scare users off. The Score: X (weighted=true/false) would be good enough

bofenghuang · 2025-08-27T06:36:31Z

Hey @penguine-ip, I reverted it. Could you add it later? I'm not fully sure I get what you mean, and it's a bit out of scope for this PR 🙂

bofenghuang · 2025-09-09T11:35:46Z

Hello @penguine-ip, just rebased the PR due to a conflict. Could you add the message? Thanks in advance.

bofenghuang changed the title ~~Add logprobs for gemini~~ Add Gemini judge support with weighted scoring in G-Eval Aug 8, 2025

bofenghuang mentioned this pull request Aug 8, 2025

Fix score normalization in G-Eval #1915

Merged

bofenghuang force-pushed the feat/gemini branch 2 times, most recently from 0a901bc to b4f8a8d Compare August 9, 2025 09:34

bofenghuang force-pushed the feat/gemini branch from b6d2d6f to 987e34a Compare August 27, 2025 06:35

bofenghuang added 4 commits September 9, 2025 13:26

Add logprobs for gemini

c8ba307

Add tests

1878a06

ruff

0bb0224

ruff

3f83b99

bofenghuang force-pushed the feat/gemini branch from 987e34a to 3f83b99 Compare September 9, 2025 11:27

Add Gemini judge support with weighted scoring in G-Eval #1913

Are you sure you want to change the base?

Add Gemini judge support with weighted scoring in G-Eval #1913

Uh oh!

Conversation

bofenghuang commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Aug 8, 2025

Uh oh!

penguine-ip commented Aug 9, 2025

Uh oh!

bofenghuang commented Aug 9, 2025

Uh oh!

bofenghuang commented Aug 11, 2025

Uh oh!

penguine-ip commented Aug 15, 2025

Uh oh!

bofenghuang commented Aug 15, 2025

Uh oh!

penguine-ip commented Aug 19, 2025

Uh oh!

bofenghuang commented Aug 19, 2025

Uh oh!

bofenghuang commented Aug 22, 2025

Uh oh!

bofenghuang commented Aug 26, 2025

Uh oh!

penguine-ip commented Aug 27, 2025

Uh oh!

bofenghuang commented Aug 27, 2025

Uh oh!

bofenghuang commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bofenghuang commented Aug 8, 2025 •

edited

Loading