|
2 | 2 |
|
3 | 3 | All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file. |
4 | 4 |
|
5 | | -- [Jul 8, 2025] [#1098](https://github.com/ShishirPatil/gorilla/pull/1098): |
| 5 | +- [Jul 17, 2025] [#1019](https://github.com/ShishirPatil/gorilla/pull/1019): BFCL V4 release: |
| 6 | + |
| 7 | + 1. **New agentic domain** |
| 8 | + - Introduces the agentic domain with two categories: Web Search and Memory Management. |
| 9 | + - For more information, please see our accompanying [blog posts](https://gorilla.cs.berkeley.edu/blog.html). |
| 10 | + 2. **Revised overall-accuracy formula** |
| 11 | + |
| 12 | + - As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks. |
| 13 | + |
| 14 | + | Segment | Old % | New % | |
| 15 | + | ----------- | ----: | -----: | |
| 16 | + | Live | 33 | **10** | |
| 17 | + | Non-Live | 33 | **10** | |
| 18 | + | Irrelevance | 0 | **10** | |
| 19 | + | Multi-Turn | 33 | **30** | |
| 20 | + | Agentic | 0 | **40** | |
| 21 | + |
| 22 | + 3. **Leaderboard / model cleanup** |
| 23 | + - Retires several deprecated models from the leaderboard. |
| 24 | + - Removes unused model handlers to improve maintainability. |
| 25 | + 4. **Address #602** |
| 26 | + - `Non-Live Acc` and `Live Acc` score calculation now excludes the Irrelevance/Relevance category scores. |
| 27 | + 5. **Resolve #1094.** |
| 28 | + 6. **Codebase refactor** |
| 29 | + - Reorganizes the response-generation pipeline and related modules for easier maintenance. |
| 30 | + - Simplify the response-generation pipeline logic for locally-hosted models. |
| 31 | + - Introduce `enums.py` |
| 32 | + 7. **Test category rename** |
| 33 | + The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns. |
| 34 | + - `simple` --> `simple_python` |
| 35 | + - `java` --> `simple_java` |
| 36 | + - `javascript` --> `simple_javascript` |
| 37 | + 8. **Directory layout overhaul** |
| 38 | + Results and scores now use a _two-level_ hierarchy: |
| 39 | + |
| 40 | + ```text |
| 41 | + result/<model>/<general_category>/<category>.json |
| 42 | + score/<model>/<general_category>/<category>.json |
| 43 | + ``` |
| 44 | +
|
| 45 | + `general_category` ∈ { **non_live**, **live**, **multi_turn**, **agentic**, **format_sensitivity** } |
| 46 | +
|
| 47 | + • For _agentic-memory_ tasks, an extra level distinguishes the memory backend: |
| 48 | +
|
| 49 | + ```text |
| 50 | + result/<model>/agentic/<memory_backend>/<category>.json |
| 51 | + ``` |
| 52 | +
|
| 53 | + Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files. |
| 54 | + 9. **New model support** |
| 55 | + Adds support for the following models: |
| 56 | + - `claude-opus-4-1-20250805` |
| 57 | + - `gpt-5-2025-08-07` |
| 58 | + - `gpt-5-mini-2025-08-07` |
| 59 | + - `gpt-5-nano-2025-08-07` |
| 60 | + - `Qwen/Qwen3-30B-A3B-Instruct-2507` |
| 61 | + - `Qwen/Qwen3-235B-A22B-Instruct-2507` |
| 62 | + - `Qwen/Qwen3-4B-Instruct-2507` |
| 63 | +
|
| 64 | +- [Jul 8, 2025] [#1098](https://github.com/Shishirtil/gorilla/pull/1098): |
6 | 65 | - Re-introduce latency statistics for locally hosted models |
7 | 66 | - Update cost calculation to cover the entire dataset batch, instead of the average cost per 1k function calls |
8 | 67 | - [Jul 6, 2025] [#1100](https://github.com/ShishirPatil/gorilla/pull/1100): Add the following new models to the leaderboard: |
@@ -225,11 +284,11 @@ All notable changes to the Berkeley Function Calling Leaderboard will be documen |
225 | 284 | - [Nov 18, 2024] [#768](https://github.com/ShishirPatil/gorilla/pull/768), [#770](https://github.com/ShishirPatil/gorilla/pull/770): Resolve issues in Gemini models (FC mode) related to handling scenarios with no tools available and cases where the model output is empty. |
226 | 285 | - [Nov 17, 2024] [#767](https://github.com/ShishirPatil/gorilla/pull/767): Fix price and latency calculation. A merge conflict results in a duplicate line, and counting the input and output token for each entry multiple times. |
227 | 286 | - [Nov 15, 2024] [#762](https://github.com/ShishirPatil/gorilla/pull/762): Supply `data_multi_turn.csv` for multi-turn evaluation results |
228 | | -- [Nov 14, 2024] [#760](https://github.com/ShishirPatil/gorilla/pull/760), [#761](https://github.com/ShishirPatil/gorilla/pull/761): Upstream `google-cloud-aiplatform` library fixed typecasting bugs in Function Calling. Updated to version `1.72.0` and remove the workaround patch introduced in [#648](https://github.com/ShishirPatil/gorilla/pull/648). |
| 287 | +- [Nov 14, 2024] [#760](https://github.com/ShishirPatil/gorilla/pull/760), [#761](https://github.com/ShishirPatil/gorilla/pull/761): Upstream `google-cloud-aiplatform` library fixed typecasting bugs in Function Calling. Updated to version `1.72.0` and remove the workaround patch introduced in [#648](https://github.com/ShishirPatil/gorilla/pull/648). |
229 | 288 | - [Nov 14, 2024] [#747](https://github.com/ShishirPatil/gorilla/pull/747): Minor Grammatical Corrections to `DEFAULT_SYSTEM_PROMPT` that is supplied to all prompting models. |
230 | 289 | - [Nov 13, 2024] [#737](https://github.com/ShishirPatil/gorilla/pull/737), [#739](https://github.com/ShishirPatil/gorilla/pull/739), [#740](https://github.com/ShishirPatil/gorilla/pull/740), [#763](https://github.com/ShishirPatil/gorilla/pull/763), [#772](https://github.com/ShishirPatil/gorilla/pull/772), [#789](https://github.com/ShishirPatil/gorilla/pull/789), [#804](https://github.com/ShishirPatil/gorilla/pull/804): Bug fix in the dataset and possible answers for the live and multi-turn categories. |
231 | 290 | - [Nov 11, 2024] [#746](https://github.com/ShishirPatil/gorilla/pull/746): Improve inference log readability; inference log is now included as part of the model result file. For details on how to interpret the inference log, please refer to the [LOG_GUIDE.md](https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/LOG_GUIDE.md). |
232 | | -- [Nov 9, 2024] [#749](https://github.com/ShishirPatil/gorilla/pull/749): Remove `Llama-3.2-3B-Instruct-FC` and `Llama-3.2-1B-Instruct-FC` from the leaderboard. According to the [official Llama documentation](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-tool-calling-(1b/3b)-), these models perform function calling using the prompt-style chat template rather than the specialized function-calling format. |
| 291 | +- [Nov 9, 2024] [#749](https://github.com/ShishirPatil/gorilla/pull/749): Remove `Llama-3.2-3B-Instruct-FC` and `Llama-3.2-1B-Instruct-FC` from the leaderboard. According to the [official Llama documentation](<https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-tool-calling-(1b/3b)->), these models perform function calling using the prompt-style chat template rather than the specialized function-calling format. |
233 | 292 | - [Nov 8, 2024] [#720](https://github.com/ShishirPatil/gorilla/pull/720): Add new model `BitAgent/GoGoAgent` to the leaderboard. |
234 | 293 | - [Oct 30, 2024] [#725](https://github.com/ShishirPatil/gorilla/pull/725), [#733](https://github.com/ShishirPatil/gorilla/pull/733): Update evaluation metric for multi-turn categories: |
235 | 294 | - Introduce a new response-based checker, which works alongside with the existing state-based checker. |
@@ -277,7 +336,7 @@ All notable changes to the Berkeley Function Calling Leaderboard will be documen |
277 | 336 | - `microsoft/Phi-3-small-8k-instruct` |
278 | 337 | - `microsoft/Phi-3-mini-128k-instruct` |
279 | 338 | - `microsoft/Phi-3-mini-4k-instruct` |
280 | | -- [Sept 25, 2024] [#660](https://github.com/ShishirPatil/gorilla/pull/660): Bug fix in `parse_nested_value` function to handle nested dictionary values properly. |
| 339 | +- [Sept 25, 2024] [#660](https://github.com/ShishirPatil/gorilla/pull/660): Bug fix in `parse_nested_value` function to handle nested dictionary values properly. |
281 | 340 | - [Sept 24, 2024] [#657](https://github.com/ShishirPatil/gorilla/pull/657): Add the following new models to the leaderboard: |
282 | 341 | - `meta-llama/Llama-3.2-1B-Instruct` |
283 | 342 | - `meta-llama/Llama-3.2-1B-Instruct-FC` |
|
0 commit comments