-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[BFCL] - Additional Dataset Fixes, Builds off Issues 1133 PR #1206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ShishirPatil
merged 8 commits into
ShishirPatil:main
from
HuanzhiMao:as/Gorilla1133Extension
Oct 20, 2025
Merged
[BFCL] - Additional Dataset Fixes, Builds off Issues 1133 PR #1206
ShishirPatil
merged 8 commits into
ShishirPatil:main
from
HuanzhiMao:as/Gorilla1133Extension
Oct 20, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
HuanzhiMao
added a commit
that referenced
this pull request
Oct 13, 2025
Fix #1133 . We thank the community for raising these dataset issues. **live_simple_137-90-0** and **live_simple_138-91-0**: removed CEST and instead use the Europe/London time **parallel_84**: Changed prompt to make the first calculate_displacement call clearer + added an answer for fourth function call in ground truth **parallel_155**: Changed ground truth to not expect it to be explicitly set. **parallel_158**: Added 2 missing function calls in ground truth **parallel_170**: Not Fixed. Rationale: The question is asking "What would my $5000 investment be worth after different time periods?" which is a common investment analysis scenario. The calculate_compound_interest function calculates the total compound interest for a given principal over a specified time period, not sequential compounding thus our phrasing of the question is asking for separate calculations, not dependent sequential calculations. **live_multiple_44**-17-0: Removed last part of the prompt asking the question to summarize. Removes any confusion in the prompt and aligns with ground truth. **multi_turn_base_8**: Includes proper diff message in the twitter post **multi_turn_base_10**: Updated the ground truth for all relevant entries changing “note.md” into “notes.md”. **multi_turn_base_18**: Added an ‘@’ sign in front of Jerry in the ground truth for correctness. **multi_turn_base_34**: Fixed in PR #1206 **multi_turn_base_35**: Fixed. Ground truth properly has the echo() content match the diff() command’s output **multi_turn_base_78**: Changed ground truth from ‘mentions’ to ‘tags’ to properly incorporate the hashtags properly for the respective entries. **multi_turn_base_79**: Not Fixed: Rationale: Python automatically converts integers to floats when passed to float parameters, and the existing integer values in the ground truth are valid and will be properly handled by the function. **multi_turn_base_97**: This issue was persistent in the missing param, missing function, and context datasets and have been updated to include the period in the message quotes in the prompt. **multi_turn_base_104**: Switched order of get watchlist to be correct. **mutli_turn_base_166**: This issue was persistent in the missing param, missing function, and context datasets and have been updated to include ‘concerns’ in the ground truth instead of ‘concern’. **multi_turn_base_169, 170, 194, 197, 199**: Fixed in a previous PR - the contact customer support function no longer returns the message parameter, so exact message matching is not required and the ground truth can remain as-is. Since the function output doesn't include the message content anymore, the response checker won't fail on message variations, making the specific phrasing in the ground truth irrelevant for evaluation purposes while still providing meaningful parameter content. **multi_turn_base_177**: Fixed in Previous PR -- changed around the prompt to not need the get_flight_cost() function anymore, therefore there are no further changes needed in the dataset. **multi_turn_base_178**: purchase_insurance() is executed in turn 0 now. **multi_turn_base_180**: Updated the prompt to include the specific description message so that the LLM’s answer matches the ground truth for the entry. **multi_turn_base_183**: Includes tags now to be correct. **multi_turn_miss_param_0**: Added another file in the initial_config contents so now it is not clear which file is being referenced to. There is a final_report.pdf and a report_final.pdf meaning that the clarification does become necessary and the parameter will be missing until the clarification is read. General Cases **Echo for File Creation**: Echo doesn’t include the ability to create a file if one doesn’t exist already, meaning that touch is still necessary to call beforehand. **Missing mv Function**: We will have a separate PR to fix this, as this involves a lot of changes. Plan: We will modify the respective entries to exclude cp() until mv is available, to prevent any alternate paths for obtaining mv() functionality. **Get_symbol_by_name**: Used a made-up company stock and added the mapping for it in the get_symbol_by_name function, meaning that the LLM should know to check this function to get the stock symbol. **Chained CD Commands**: Our function doc string specifies that “You can only change one folder at a time.” Thus, the LLM should know not to do chained cd commands. Different way to Test Missing: We will have a separate PR to fix this. Ambiguous Prompts **multi_turn_base_15**: Fixed in PR #1206 **multi_turn_base_21**: Fixed by changing "ensure" to "verify" **multi_turn_miss_func_93**: The dataset has already been changed to make the prompt more clear! **multi_turn_base_41**: Given the context of the turns along with the API, it is clear that the prompt is referring to the message and therefore I believe we can keep it here and it should be good to add some diversity to the way that we prompt our messages to keep things fresh! **irrelevance_7**: Fixed. Give a draw circle function instead of the relevant function. Further Evaluation – Additional Potential Issues **multi_turn_base_21**: This issue was persistent in the missing param, missing function, and context datasets and have been updated to include the period in the ground truth. **multi_turn_miss_param_38**: Added multiple directories such that calling find() won’t reveal which directory is actually needing to be used. Each of the new directories have a document with the same name as well to ensure the missing param is actually missing. **multi_turn_miss_param_95, multi_turn_miss_param_101**: Updated the initial_config to have the test cases start with the user logged in. **multi_turn_miss_param_124**: Changed prompt to be more clear for what is being asked. **multi_turn_miss_param_131**: Removed the adding of this supervisor, and instead hid the param of Alice as the advisor. **multi_turn_miss_param_156**: Fixed. Hid parameter of location of travel instead. **multi_turn_base_187, multi_turn_miss_param_187**: Changed prompt for clarity **multi_turn_miss_param_191**: Added second credit card leaving the model confused on which one to use until the parameter is no longer missing **multi_turn_miss_param_193**: Changed the entry to have the correct flight cost of 4700 dollars. Additional Ambiguous Prompts **multi_turn_miss_param_24**: Already fixed since there are multiple files now to choose from so the model needs to wait for the user to specify which two files they want the diff from. --------- Co-authored-by: Huanzhi Mao <[email protected]>
|
ShishirPatil
approved these changes
Oct 20, 2025
Hi @chughtapan , |
HuanzhiMao
added a commit
that referenced
this pull request
Oct 22, 2025
PR #1206 was merged before review was complete, which introduced some incorrect dataset fixes. This PR reverts the problematic changes and applies the correct updates. Summary of changes: multi_turn_base_21: Addressed in #1160. Reverting changes made in #1206. multi_turn_base_23: Addressed in #1160. Reverting changes made in #1206. echo can’t create new file (need touch) Multi_turn_base_93: Updated user query to make the instruction clearer. Multi_turn_base_104: Addressed in #1160. Reverting changes made in #1206. multi_turn_base_117: Removed make_transaction, and replaced it with withdraw_funds. multi_turn_base_162: Updated the implementation of `compute_exchange_rate` to always return a float. Multi_turn_base_178: Updated the query to include the booking_id information. multi_turn_base_187: Updated the query to include the booking_id information. multi_turn_base_194: Updated the travel_booking backend to remove token expiration decrement on invalid access token.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
multi_turn_base_11: uses find to find all the files and not ls .. could specific prompt to not use find or use the most basic function – separate PR incoming soon
multi_turn_base_15: change last entries to last entry
multi_turn_base_21: Addressed in #1160
multi_turn_base_23: Addressed in #1160. echo doesn’t have ability to create new file (need touch)
multi_turn_base_24: add the word and then to remove confusion and have the ordering match the ground truth
multi_turn_base_33: remove the part about jotting them down to avoid any confusion and keep the ground truth as correct
multi_turn_base_34: added the word whole to avoid confusion on using the last line or not, keeping the ground truth to be correct
multi_turn_base_39: added the start as Enter the project and populate… this way it knows that cd is required and keeps everything consistent with the possible answers
Multi_turn_base_93: Updated user query to make the instruction clearer.
Multi_turn_base_104: Addressed in #1160
multi_turn_base_117: Removed make_transaction, and replaced it with withdraw_funds.
multi_turn_base_162: Updated the implementation of
compute_exchange_rateto always return a float.Multi_turn_base_178: Updated the query to include the booking_id information.
Multi_turn_base_185: just added business CLASS journey to take away any and all confusions that may have resulted in a model to check the pricing for economy class
multi_turn_base_187: Updated the query to include the booking_id information.
multi_turn_base_194: Updated the travel_booking backend to remove token expiration decrement on invalid access token