[BFCL] Refine Evaluation Metric for Multi Turn Categories #733

HuanzhiMao · 2024-11-02T02:49:45Z

This PR contains three changes.

Adjustments for Irrelevance Detection in Multi Turn Categories

In #725, we updated our evaluation metric for multi-turn to be that:

During evaluation, if flag_task_unachievable is called in a turn, that turn will be marked correct for irrelevance detection, even if other functions were also called in that turn.
If the model calls flag_task_unachievable in a normal (non-irrelevant) turn, that turn will be marked incorrect.

This is an important change and can heavily affect the model's scores. However, we've identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call flag_task_unachievable, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, the flag_task_unachievable has been removed.

Due to the removal of flag_task_unachievable, we’ve adjusted the evaluation criteria accordingly. Instead of checking if the model produces no output in a turn with missing function/parameter information, we now assess whether the model can perform correctly once the missing information is provided.

Response Checker Addition

To improve evaluation accuracy, we’ve added a response checker for all multi-turn categories, which works alongside the state checker:

State Checker: This checker evaluates the state of the instance after each turn, which we previously relied on exclusively. However, some functions (e.g., get_zipcode_by_city or estimate_distance) don’t directly alter the state, making it unclear whether the model actually invoked them.
Response Checker: The new checker compares the model’s execution result against the ground truth execution result, ensuring that the model’s result encompasses the ground truth (i.e., ground truth must be a strict subset of the model result).

With this addition, an entry will now only be marked correct if it passes both the state and response checkers.

Dataset Adjustments

A few dataset entries have been modified to align with these changes.

Fanjia-Yan

LGTM

Aside from flag_task_unachievable change, this PR introduces several fix

Remove Long Context category randomness and patch ground truth accordingly
Fix Ground Truth Error when the missing function has an alternative path to achieve the same functions.

CharlieJCJ

Suggested variable name change, error log clarity changes.

LGTM

…il#733) This PR contains three changes. In ShishirPatil#725, we updated our evaluation metric for multi-turn to be that: - During evaluation, if `flag_task_unachievable` is called in a turn, that turn will be marked correct for irrelevance detection, even if other functions were also called in that turn. - If the model calls `flag_task_unachievable` in a normal (non-irrelevant) turn, that turn will be marked incorrect. This is an important change and can heavily affect the model's scores. However, we've identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call `flag_task_unachievable`, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, the `flag_task_unachievable` has been removed. Due to the removal of `flag_task_unachievable`, we’ve adjusted the evaluation criteria accordingly. Instead of checking if the model produces no output in a turn with missing function/parameter information, we now assess whether the model can perform correctly once the missing information is provided. To improve evaluation accuracy, we’ve added a response checker for all multi-turn categories, which works alongside the state checker: - **State Checker**: This checker evaluates the state of the instance after each turn, which we previously relied on exclusively. However, some functions (e.g., `get_zipcode_by_city` or `estimate_distance`) don’t directly alter the state, making it unclear whether the model actually invoked them. - **Response Checker**: The new checker compares the model’s execution result against the ground truth execution result, ensuring that the model’s result encompasses the ground truth (i.e., ground truth must be a strict subset of the model result). With this addition, an entry will now only be marked correct if it passes both the state and response checkers. A few dataset entries have been modified to align with these changes.

This PR continues #737 a 2-week initiative to re-scrutinize across V3 dataset issues with several objectives: - Eliminate Ground Truth mismatches against user questions. - Polish ambiguous prompts that have unclear user intents to eliminate biased-judgement and saturation. Following PRs will be rolled out on a daily basis by categories Note: #733 is a pre-requisite for this PR to merge. --------- Co-authored-by: Huanzhi (Hans) Mao <[email protected]>

This PR updates the leaderboard to reflect the change in score due to the following PR merge: 1. #719 2. #722 3. #723 4. #728 5. #732 6. #725 7. #712 8. #733 9. #720 10. #760 11. #761 12. #767

) A new multi-turn metric is introduced in #733 . This PR updates the blog to make sure its content is up-to-date. --------- Co-authored-by: Fanjia Yan <[email protected]>

HuanzhiMao added BFCL-General General BFCL Issue BFCL-Dataset BFCL Dataset-Related Issue labels Nov 2, 2024

Fanjia-Yan approved these changes Nov 2, 2024

View reviewed changes

HuanzhiMao added 2 commits November 4, 2024 13:26

add description for each api collection; update a few entries

fb65bcf

update logic for multi turn irrelevance scenario

6f7248c

HuanzhiMao force-pushed the main branch from 5017a6e to 6f7248c Compare November 4, 2024 21:37

HuanzhiMao added 3 commits November 4, 2024 13:40

Merge branch 'main' into main

22a2bc1

improve debugging_log for inference

e97f147

add short alias -d for --include-debugging-log

870038d

HuanzhiMao changed the title ~~[BFCL] Refine Eval Metric for Multi Turn Irrelevance Scenarios~~ [BFCL] Refine Evaluation Metric for Multi Turn Categories Nov 5, 2024

HuanzhiMao marked this pull request as ready for review November 5, 2024 01:06

update change log

a3d1f57

CharlieJCJ self-requested a review November 5, 2024 01:18

HuanzhiMao added 4 commits November 4, 2024 20:15

update variable name for consistency

68d47a1

improve log readability

5ec8c84

fix typo

18ae0b9

rename 'error' to 'error_message' for better readability

492e0c1

CharlieJCJ approved these changes Nov 5, 2024

View reviewed changes

HuanzhiMao added 5 commits November 5, 2024 15:37

fix typo

8e5f07b

fix typo

4cd27af

add back instruction that no func call is end of turn

6f69739

include metadata for oss model as well

25fe3de

clean up

2fdb69b

HuanzhiMao mentioned this pull request Nov 7, 2024

[BFCL] Miss Param score is very low #738

Closed

Fanjia-Yan mentioned this pull request Nov 7, 2024

[BFCL Dataset Revamp 1/n] Multi-Turn (Part 1) #740

Merged

fix typo

538f001

Fanjia-Yan mentioned this pull request Nov 8, 2024

[BFCL Dataset Revamp 4/n] Multi-Turn (Part 2) #745

Closed

Merge branch 'main' into main

cd7af88

HuanzhiMao merged commit c9c1ff1 into ShishirPatil:main Nov 8, 2024

HuanzhiMao mentioned this pull request Nov 9, 2024

[BFCL] Leaderboard Update, 11/17/2024 #748

Merged

This was referenced Nov 22, 2024

[BFCL] Mismatch between local evaluation and leaderboard numbers #773

Closed

[BFCL Blog] Update Multi Turn Blog to Reflect New Metric from #733 #782

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BFCL] Refine Evaluation Metric for Multi Turn Categories #733

[BFCL] Refine Evaluation Metric for Multi Turn Categories #733

Uh oh!

HuanzhiMao commented Nov 2, 2024 •

edited

Loading

Uh oh!

Fanjia-Yan left a comment

Uh oh!

CharlieJCJ left a comment

Uh oh!

Uh oh!

[BFCL] Refine Evaluation Metric for Multi Turn Categories #733

[BFCL] Refine Evaluation Metric for Multi Turn Categories #733

Uh oh!

Conversation

HuanzhiMao commented Nov 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adjustments for Irrelevance Detection in Multi Turn Categories

Response Checker Addition

Dataset Adjustments

Uh oh!

Fanjia-Yan left a comment

Choose a reason for hiding this comment

Uh oh!

CharlieJCJ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuanzhiMao commented Nov 2, 2024 •

edited

Loading