Skip to content

Conversation

@amitojsingh2022
Copy link
Contributor

@amitojsingh2022 amitojsingh2022 commented Oct 4, 2025

multi_turn_base_11: uses find to find all the files and not ls .. could specific prompt to not use find or use the most basic function – separate PR incoming soon

multi_turn_base_15: change last entries to last entry

multi_turn_base_21: Addressed in #1160

multi_turn_base_23: Addressed in #1160. echo doesn’t have ability to create new file (need touch)

multi_turn_base_24: add the word and then to remove confusion and have the ordering match the ground truth

multi_turn_base_33: remove the part about jotting them down to avoid any confusion and keep the ground truth as correct

multi_turn_base_34: added the word whole to avoid confusion on using the last line or not, keeping the ground truth to be correct

multi_turn_base_39: added the start as Enter the project and populate… this way it knows that cd is required and keeps everything consistent with the possible answers

Multi_turn_base_93: Updated user query to make the instruction clearer.

Multi_turn_base_104: Addressed in #1160

multi_turn_base_117: Removed make_transaction, and replaced it with withdraw_funds.

multi_turn_base_162: Updated the implementation of compute_exchange_rate to always return a float.

Multi_turn_base_178: Updated the query to include the booking_id information.

Multi_turn_base_185: just added business CLASS journey to take away any and all confusions that may have resulted in a model to check the pricing for economy class

multi_turn_base_187: Updated the query to include the booking_id information.

multi_turn_base_194: Updated the travel_booking backend to remove token expiration decrement on invalid access token

HuanzhiMao added a commit that referenced this pull request Oct 13, 2025
Fix #1133 .
We thank the community for raising these dataset issues. 

**live_simple_137-90-0** and **live_simple_138-91-0**: removed CEST and
instead use the Europe/London time

**parallel_84**: Changed prompt to make the first calculate_displacement
call clearer + added an answer for fourth function call in ground truth

**parallel_155**: Changed ground truth to not expect it to be explicitly
set.

**parallel_158**: Added 2 missing function calls in ground truth

**parallel_170**: Not Fixed. Rationale: The question is asking "What
would my $5000 investment be worth after different time periods?" which
is a common investment analysis scenario. The
calculate_compound_interest function calculates the total compound
interest for a given principal over a specified time period, not
sequential compounding thus our phrasing of the question is asking for
separate calculations, not dependent sequential calculations.

**live_multiple_44**-17-0: Removed last part of the prompt asking the
question to summarize. Removes any confusion in the prompt and aligns
with ground truth.

**multi_turn_base_8**: Includes proper diff message in the twitter post

**multi_turn_base_10**: Updated the ground truth for all relevant
entries changing “note.md” into “notes.md”.

**multi_turn_base_18**: Added an ‘@’ sign in front of Jerry in the
ground truth for correctness.

**multi_turn_base_34**: Fixed in PR #1206 

**multi_turn_base_35**: Fixed. Ground truth properly has the echo()
content match the diff() command’s output

**multi_turn_base_78**: Changed ground truth from ‘mentions’ to ‘tags’
to properly incorporate the hashtags properly for the respective
entries.

**multi_turn_base_79**: Not Fixed: Rationale: Python automatically
converts integers to floats when passed to float parameters, and the
existing integer values in the ground truth are valid and will be
properly handled by the function.

**multi_turn_base_97**: This issue was persistent in the missing param,
missing function, and context datasets and have been updated to include
the period in the message quotes in the prompt.

**multi_turn_base_104**: Switched order of get watchlist to be correct.

**mutli_turn_base_166**: This issue was persistent in the missing param,
missing function, and context datasets and have been updated to include
‘concerns’ in the ground truth instead of ‘concern’.

**multi_turn_base_169, 170, 194, 197, 199**: Fixed in a previous PR -
the contact customer support function no longer returns the message
parameter, so exact message matching is not required and the ground
truth can remain as-is. Since the function output doesn't include the
message content anymore, the response checker won't fail on message
variations, making the specific phrasing in the ground truth irrelevant
for evaluation purposes while still providing meaningful parameter
content.

**multi_turn_base_177**: Fixed in Previous PR -- changed around the
prompt to not need the get_flight_cost() function anymore, therefore
there are no further changes needed in the dataset.

**multi_turn_base_178**: purchase_insurance() is executed in turn 0 now.

**multi_turn_base_180**: Updated the prompt to include the specific
description message so that the LLM’s answer matches the ground truth
for the entry.

**multi_turn_base_183**: Includes tags now to be correct.

**multi_turn_miss_param_0**: Added another file in the initial_config
contents so now it is not clear which file is being referenced to. There
is a final_report.pdf and a report_final.pdf meaning that the
clarification does become necessary and the parameter will be missing
until the clarification is read.

General Cases

**Echo for File Creation**: Echo doesn’t include the ability to create a
file if one doesn’t exist already, meaning that touch is still necessary
to call beforehand.

**Missing mv Function**: We will have a separate PR to fix this, as this
involves a lot of changes. Plan: We will modify the respective entries
to exclude cp() until mv is available, to prevent any alternate paths
for obtaining mv() functionality.

**Get_symbol_by_name**: Used a made-up company stock and added the
mapping for it in the get_symbol_by_name function, meaning that the LLM
should know to check this function to get the stock symbol.

**Chained CD Commands**: Our function doc string specifies that “You can
only change one folder at a time.” Thus, the LLM should know not to do
chained cd commands.

Different way to Test Missing: We will have a separate PR to fix this.

Ambiguous Prompts

**multi_turn_base_15**: Fixed in PR #1206 

**multi_turn_base_21**: Fixed by changing "ensure" to "verify"

**multi_turn_miss_func_93**: The dataset has already been changed to
make the prompt more clear!

**multi_turn_base_41**: Given the context of the turns along with the
API, it is clear that the prompt is referring to the message and
therefore I believe we can keep it here and it should be good to add
some diversity to the way that we prompt our messages to keep things
fresh!

**irrelevance_7**: Fixed. Give a draw circle function instead of the
relevant function.

Further Evaluation – Additional Potential Issues

**multi_turn_base_21**: This issue was persistent in the missing param,
missing function, and context datasets and have been updated to include
the period in the ground truth.

**multi_turn_miss_param_38**: Added multiple directories such that
calling find() won’t reveal which directory is actually needing to be
used. Each of the new directories have a document with the same name as
well to ensure the missing param is actually missing.

**multi_turn_miss_param_95, multi_turn_miss_param_101**: Updated the
initial_config to have the test cases start with the user logged in.

**multi_turn_miss_param_124**: Changed prompt to be more clear for what
is being asked.

**multi_turn_miss_param_131**: Removed the adding of this supervisor,
and instead hid the param of Alice as the advisor.

**multi_turn_miss_param_156**: Fixed. Hid parameter of location of
travel instead.

**multi_turn_base_187, multi_turn_miss_param_187**: Changed prompt for
clarity

**multi_turn_miss_param_191**: Added second credit card leaving the
model confused on which one to use until the parameter is no longer
missing

**multi_turn_miss_param_193**: Changed the entry to have the correct
flight cost of 4700 dollars.

Additional Ambiguous Prompts

**multi_turn_miss_param_24**: Already fixed since there are multiple
files now to choose from so the model needs to wait for the user to
specify which two files they want the diff from.

---------

Co-authored-by: Huanzhi Mao <[email protected]>
@chughtapan
Copy link
Contributor

chughtapan commented Oct 13, 2025

@amitojsingh2022

  1. multi_turn_base_178 and multi_turn_base_187 -- I don't understand your comment here: booking_record is in the initial config and is provided, the model should have enough information to get the correct ground truth for this prompt. The initial config is NOT provided in the system instructions or the user prompt, and there is no tool which allows to retrieve bookings.
  2. multi_turn_base_194 -- This deviates from the docs - https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html - "taking extra steps should not be penalized", "model result is considered correct if it contains the ground truth as a subset,
    even if it contains additional function calls or takes a different trajectory"
  3. multi_turn_117 - get_account_info can provide the account id so this should be valid
  4. multi_turn_base_104 -- same point as 194 above -- extra steps should not be penalized
  5. multi_turn_base_21 -- "ensure" both are identical is confusing - consider changing the word to something else e.g., "check if"
  6. multi_turn_base_93 -- "I would like to increase the amount of fuel in my car to completely full, but first I need to ascertain the present level to determine the appropriate amount to add" -- I think that instructions is unnecessarily complex and ambiguous, and it's hard to understand that you only want to check fuel. Consider rewriting it to say "I would like to check how much fuel can be added or something.."

@HuanzhiMao HuanzhiMao added the BFCL-Dataset BFCL Dataset-Related Issue label Oct 18, 2025
@ShishirPatil ShishirPatil merged commit 659c716 into ShishirPatil:main Oct 20, 2025
@HuanzhiMao
Copy link
Collaborator

@amitojsingh2022

  1. multi_turn_base_178 and multi_turn_base_187 -- I don't understand your comment here: booking_record is in the initial config and is provided, the model should have enough information to get the correct ground truth for this prompt. The initial config is NOT provided in the system instructions or the user prompt, and there is no tool which allows to retrieve bookings.
  2. multi_turn_base_194 -- This deviates from the docs - https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html - "taking extra steps should not be penalized", "model result is considered correct if it contains the ground truth as a subset,
    even if it contains additional function calls or takes a different trajectory"
  3. multi_turn_117 - get_account_info can provide the account id so this should be valid
  4. multi_turn_base_104 -- same point as 194 above -- extra steps should not be penalized
  5. multi_turn_base_21 -- "ensure" both are identical is confusing - consider changing the word to something else e.g., "check if"
  6. multi_turn_base_93 -- "I would like to increase the amount of fuel in my car to completely full, but first I need to ascertain the present level to determine the appropriate amount to add" -- I think that instructions is unnecessarily complex and ambiguous, and it's hard to understand that you only want to check fuel. Consider rewriting it to say "I would like to check how much fuel can be added or something.."

Hi @chughtapan ,
Thank you for bringing this up. Some of the entries have been addressed back in #1160; the description for this PR is outdated, which results in some confusion. The rest of the entries have been fixed in #1224.

HuanzhiMao added a commit that referenced this pull request Oct 22, 2025
PR #1206 was merged before review was complete, which introduced some
incorrect dataset fixes. This PR reverts the problematic changes and
applies the correct updates.

Summary of changes:

multi_turn_base_21: Addressed in #1160. Reverting changes made in #1206.

multi_turn_base_23: Addressed in #1160. Reverting changes made in #1206.
echo can’t create new file (need touch)

Multi_turn_base_93: Updated user query to make the instruction clearer. 

Multi_turn_base_104: Addressed in #1160. Reverting changes made in
#1206.

multi_turn_base_117: Removed make_transaction, and replaced it with
withdraw_funds.

multi_turn_base_162: Updated the implementation of
`compute_exchange_rate` to always return a float.

Multi_turn_base_178: Updated the query to include the booking_id
information.

multi_turn_base_187: Updated the query to include the booking_id
information.

multi_turn_base_194: Updated the travel_booking backend to remove token
expiration decrement on invalid access token.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFCL-Dataset BFCL Dataset-Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants