-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[BFCL] Refine Evaluation Metric for Multi Turn Categories #733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fanjia-Yan
approved these changes
Nov 2, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Aside from flag_task_unachievable change, this PR introduces several fix
- Remove Long Context category randomness and patch ground truth accordingly
- Fix Ground Truth Error when the missing function has an alternative path to achieve the same functions.
CharlieJCJ
approved these changes
Nov 5, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested variable name change, error log clarity changes.
LGTM
VishnuSuresh27
pushed a commit
to VishnuSuresh27/gorilla
that referenced
this pull request
Nov 11, 2024
…il#733) This PR contains three changes. In ShishirPatil#725, we updated our evaluation metric for multi-turn to be that: - During evaluation, if `flag_task_unachievable` is called in a turn, that turn will be marked correct for irrelevance detection, even if other functions were also called in that turn. - If the model calls `flag_task_unachievable` in a normal (non-irrelevant) turn, that turn will be marked incorrect. This is an important change and can heavily affect the model's scores. However, we've identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call `flag_task_unachievable`, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, the `flag_task_unachievable` has been removed. Due to the removal of `flag_task_unachievable`, we’ve adjusted the evaluation criteria accordingly. Instead of checking if the model produces no output in a turn with missing function/parameter information, we now assess whether the model can perform correctly once the missing information is provided. To improve evaluation accuracy, we’ve added a response checker for all multi-turn categories, which works alongside the state checker: - **State Checker**: This checker evaluates the state of the instance after each turn, which we previously relied on exclusively. However, some functions (e.g., `get_zipcode_by_city` or `estimate_distance`) don’t directly alter the state, making it unclear whether the model actually invoked them. - **Response Checker**: The new checker compares the model’s execution result against the ground truth execution result, ensuring that the model’s result encompasses the ground truth (i.e., ground truth must be a strict subset of the model result). With this addition, an entry will now only be marked correct if it passes both the state and response checkers. A few dataset entries have been modified to align with these changes.
HuanzhiMao
added a commit
that referenced
this pull request
Nov 11, 2024
This PR continues #737 a 2-week initiative to re-scrutinize across V3 dataset issues with several objectives: - Eliminate Ground Truth mismatches against user questions. - Polish ambiguous prompts that have unclear user intents to eliminate biased-judgement and saturation. Following PRs will be rolled out on a daily basis by categories Note: #733 is a pre-requisite for this PR to merge. --------- Co-authored-by: Huanzhi (Hans) Mao <[email protected]>
HuanzhiMao
added a commit
that referenced
this pull request
Nov 19, 2024
This was referenced Nov 22, 2024
HuanzhiMao
added a commit
that referenced
this pull request
Dec 5, 2024
) A new multi-turn metric is introduced in #733 . This PR updates the blog to make sure its content is up-to-date. --------- Co-authored-by: Fanjia Yan <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains three changes.
Adjustments for Irrelevance Detection in Multi Turn Categories
In #725, we updated our evaluation metric for multi-turn to be that:
flag_task_unachievable
is called in a turn, that turn will be marked correct for irrelevance detection, even if other functions were also called in that turn.flag_task_unachievable
in a normal (non-irrelevant) turn, that turn will be marked incorrect.This is an important change and can heavily affect the model's scores. However, we've identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call
flag_task_unachievable
, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, theflag_task_unachievable
has been removed.Due to the removal of
flag_task_unachievable
, we’ve adjusted the evaluation criteria accordingly. Instead of checking if the model produces no output in a turn with missing function/parameter information, we now assess whether the model can perform correctly once the missing information is provided.Response Checker Addition
To improve evaluation accuracy, we’ve added a response checker for all multi-turn categories, which works alongside the state checker:
get_zipcode_by_city
orestimate_distance
) don’t directly alter the state, making it unclear whether the model actually invoked them.With this addition, an entry will now only be marked correct if it passes both the state and response checkers.
Dataset Adjustments
A few dataset entries have been modified to align with these changes.