Skip to content

Conversation

HuanzhiMao
Copy link
Collaborator

@HuanzhiMao HuanzhiMao commented Nov 2, 2024

This PR contains three changes.

Adjustments for Irrelevance Detection in Multi Turn Categories

In #725, we updated our evaluation metric for multi-turn to be that:

  • During evaluation, if flag_task_unachievable is called in a turn, that turn will be marked correct for irrelevance detection, even if other functions were also called in that turn.
  • If the model calls flag_task_unachievable in a normal (non-irrelevant) turn, that turn will be marked incorrect.

This is an important change and can heavily affect the model's scores. However, we've identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call flag_task_unachievable, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, the flag_task_unachievable has been removed.

Due to the removal of flag_task_unachievable, we’ve adjusted the evaluation criteria accordingly. Instead of checking if the model produces no output in a turn with missing function/parameter information, we now assess whether the model can perform correctly once the missing information is provided.

Response Checker Addition

To improve evaluation accuracy, we’ve added a response checker for all multi-turn categories, which works alongside the state checker:

  • State Checker: This checker evaluates the state of the instance after each turn, which we previously relied on exclusively. However, some functions (e.g., get_zipcode_by_city or estimate_distance) don’t directly alter the state, making it unclear whether the model actually invoked them.
  • Response Checker: The new checker compares the model’s execution result against the ground truth execution result, ensuring that the model’s result encompasses the ground truth (i.e., ground truth must be a strict subset of the model result).

With this addition, an entry will now only be marked correct if it passes both the state and response checkers.

Dataset Adjustments

A few dataset entries have been modified to align with these changes.

@HuanzhiMao HuanzhiMao added BFCL-General General BFCL Issue BFCL-Dataset BFCL Dataset-Related Issue labels Nov 2, 2024
Copy link
Collaborator

@Fanjia-Yan Fanjia-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Aside from flag_task_unachievable change, this PR introduces several fix

  • Remove Long Context category randomness and patch ground truth accordingly
  • Fix Ground Truth Error when the missing function has an alternative path to achieve the same functions.

@HuanzhiMao HuanzhiMao changed the title [BFCL] Refine Eval Metric for Multi Turn Irrelevance Scenarios [BFCL] Refine Evaluation Metric for Multi Turn Categories Nov 5, 2024
@HuanzhiMao HuanzhiMao marked this pull request as ready for review November 5, 2024 01:06
@CharlieJCJ CharlieJCJ self-requested a review November 5, 2024 01:18
Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested variable name change, error log clarity changes.

LGTM

@HuanzhiMao HuanzhiMao merged commit c9c1ff1 into ShishirPatil:main Nov 8, 2024
VishnuSuresh27 pushed a commit to VishnuSuresh27/gorilla that referenced this pull request Nov 11, 2024
…il#733)

This PR contains three changes.

In ShishirPatil#725, we updated our evaluation metric for multi-turn to be that:

- During evaluation, if `flag_task_unachievable` is called in a turn,
that turn will be marked correct for irrelevance detection, even if
other functions were also called in that turn.
- If the model calls `flag_task_unachievable` in a normal
(non-irrelevant) turn, that turn will be marked incorrect.

This is an important change and can heavily affect the model's scores.
However, we've identified that the model might be overly penalized under
this setup. Many models tend to call this function frequently, even in
base entries where it may not be appropriate. For instance, when a model
encounters an execution error after an incorrect action, it might call
`flag_task_unachievable`, assuming the task is unachievable. Without
this flag, the model would sometimes continue to explore and arrive at
the correct action sequence. So after careful consideration, the
`flag_task_unachievable` has been removed.

Due to the removal of `flag_task_unachievable`, we’ve adjusted the
evaluation criteria accordingly. Instead of checking if the model
produces no output in a turn with missing function/parameter
information, we now assess whether the model can perform correctly once
the missing information is provided.

To improve evaluation accuracy, we’ve added a response checker for all
multi-turn categories, which works alongside the state checker:

- **State Checker**: This checker evaluates the state of the instance
after each turn, which we previously relied on exclusively. However,
some functions (e.g., `get_zipcode_by_city` or `estimate_distance`)
don’t directly alter the state, making it unclear whether the model
actually invoked them.
- **Response Checker**: The new checker compares the model’s execution
result against the ground truth execution result, ensuring that the
model’s result encompasses the ground truth (i.e., ground truth must be
a strict subset of the model result).

With this addition, an entry will now only be marked correct if it
passes both the state and response checkers.

A few dataset entries have been modified to align with these changes.
HuanzhiMao added a commit that referenced this pull request Nov 11, 2024
This PR continues #737 a 2-week initiative to re-scrutinize across V3
dataset issues with several objectives:

- Eliminate Ground Truth mismatches against user questions.
- Polish ambiguous prompts that have unclear user intents to eliminate
biased-judgement and saturation.

Following PRs will be rolled out on a daily basis by categories

Note: #733 is a pre-requisite for this PR to merge.

---------

Co-authored-by: Huanzhi (Hans) Mao <[email protected]>
HuanzhiMao added a commit that referenced this pull request Nov 19, 2024
This PR updates the leaderboard to reflect the change in score due to
the following PR merge:

1. #719
2. #722
3. #723
4. #728 
5. #732
6. #725
7. #712
8. #733
9. #720 
10. #760 
11. #761 
12. #767
HuanzhiMao added a commit that referenced this pull request Dec 5, 2024
)

A new multi-turn metric is introduced in #733 . This PR updates the blog
to make sure its content is up-to-date.

---------

Co-authored-by: Fanjia Yan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-Dataset BFCL Dataset-Related Issue BFCL-General General BFCL Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants