Skip to content

Conversation

@bofenghuang
Copy link
Contributor

@bofenghuang bofenghuang commented Aug 8, 2025

Hello 👋,

This PR enables the use of Gemini models as judge to get weighted summed scores in G-Eval.

Currently, there isn't a generate_raw_response like function to get logprobs for Gemini, so it falls back to generate function, which only produces the final sampled score. This PR:

  • Adds generate_raw_response and a_generate_raw_response function
  • Caps top_logprobs at 19, since Gemini only supports a range of [0, 20)
  • Adds a transform_gemini_to_openai_like function to convert Gemini output to OpenAI format, so we can reuse the existing post-processing code (calculate_weighted_summed_score)
  • Adds an initial unit test

One issue is that Gemini tokenizes the default rubric upper bound 10 to 0 and 1, so with current version of calculate_weighted_summed_score, we need to set the upper bound to less than 10

@vercel
Copy link

vercel bot commented Aug 8, 2025

@bofenghuang is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

@bofenghuang bofenghuang changed the title Add logprobs for gemini Add Gemini judge support with weighted scoring in G-Eval Aug 8, 2025
@penguine-ip
Copy link
Contributor

Hey @bofenghuang thanks for the PR! Can we add a quick test for G-Eval as well (like supply the gemini model in code to G-Eval?) Thanks!

@bofenghuang bofenghuang force-pushed the feat/gemini branch 2 times, most recently from 0a901bc to b4f8a8d Compare August 9, 2025 09:34
@bofenghuang
Copy link
Contributor Author

Hey @penguine-ip ,

Thanks for your review!

Just rebased and added a mocked unit test plus a live test, where the live one reads the Gemini API config from .deepeval/.deepeval.

@bofenghuang
Copy link
Contributor Author

Hi @penguine-ip , is this ready to be merged? Thanks!

@penguine-ip
Copy link
Contributor

Hey @bofenghuang yes just one more thing. Would it be better to raise an attribute error if the function does fail? Since right now in G-Eval we are catching that gracefully. Let me know your thoughts

@bofenghuang
Copy link
Contributor Author

Hey @penguine-ip,
IMO these fallbacks are very useful, e.g., falling back to a_generate if a_generate_raw_response doesn't exist, or to the greedy score if calculate_weighted_summed_score fails.
But I'd prefer to have a trace instead of silently falling back, since it took me a while as a new user to realize that Gemini didn't work with the weighted summed score. Maybe could add this to verbose_logs?

@penguine-ip
Copy link
Contributor

Hey @bofenghuang yes actually that woul dbe craet. We can add a things like `"Score: X (weighted summation=True)", does this sound feasible?

@bofenghuang
Copy link
Contributor Author

Yes, that would be great! Maybe we could do this in another PR.

@bofenghuang
Copy link
Contributor Author

Hi @penguine-ip, do you think this PR is ready to merge? Anything you'd like to add? Thanks!

@bofenghuang
Copy link
Contributor Author

Hey @penguine-ip, does this added warning look good to you?

@penguine-ip
Copy link
Contributor

Hey @bofenghuang is it possible to remove the warning? It will scare users off. The Score: X (weighted=true/false) would be good enough

@bofenghuang
Copy link
Contributor Author

Hey @penguine-ip, I reverted it. Could you add it later? I'm not fully sure I get what you mean, and it's a bit out of scope for this PR 🙂

@bofenghuang
Copy link
Contributor Author

Hello @penguine-ip, just rebased the PR due to a conflict. Could you add the message? Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants