Skip to content

Commit 58f57e9

Browse files
authored
BFCL V4 Release (#1019)
> ❗️**Important**: This PR introduces breaking changes and is **NOT** backward-compatible. # BFCL V4 💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new benchmark focused on tool-calling in real-world agentic settings — including: 🔍 Web search with multi-hop reasoning and error recovery 🧠 Evaluating Tool-Calling for Memory ⚠️ Evaluating Format Sensitivity ## Change Log 1. **New agentic domain** - Introduces the agentic domain with two categories: Web Search and Memory Management. - For more information, please see our accompanying [blog posts](https://gorilla.cs.berkeley.edu/blog.html). 2. **Revised overall-accuracy formula** - As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks. | Segment | Old % | New % | | ----------- | ----: | -----: | | Live | 33 | **10** | | Non-Live | 33 | **10** | | Irrelevance | 0 | **10** | | Multi-Turn | 33 | **30** | | Agentic | 0 | **40** | 3. **Leaderboard / model cleanup** - Retires several deprecated models from the leaderboard. - Removes unused model handlers to improve maintainability. 4. **Address #602** - `Non-Live Acc` and `Live Acc` score calculation now excludes the Irrelevance/Relevance category scores. 5. **Resolve #1094.** 6. **Codebase refactor** - Reorganizes the response-generation pipeline and related modules for easier maintenance. - Simplify the response-generation pipeline logic for locally-hosted models. - Introduce `enums.py` 7. **Test category rename** The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns. - `simple` --> `simple_python` - `java` --> `simple_java` - `javascript` --> `simple_javascript` 8. **Directory layout overhaul** Results and scores now use a _two-level_ hierarchy: ```text result/<model>/<general_category>/<category>.json score/<model>/<general_category>/<category>.json ``` `general_category` ∈ { **non_live**, **live**, **multi_turn**, **agentic**, **format_sensitivity** } • For _agentic-memory_ tasks, an extra level distinguishes the memory backend: ```text result/<model>/agentic/<memory_backend>/<category>.json ``` Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files. 9. **New model support** Adds support for the following models: - `claude-opus-4-1-20250805` - `gpt-5-2025-08-07` - `gpt-5-mini-2025-08-07` - `gpt-5-nano-2025-08-07` - `Qwen/Qwen3-30B-A3B-Instruct-2507` - `Qwen/Qwen3-235B-A22B-Instruct-2507` - `Qwen/Qwen3-4B-Instruct-2507`
1 parent cd9429c commit 58f57e9

File tree

150 files changed

+6808
-4000
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

150 files changed

+6808
-4000
lines changed

berkeley-function-call-leaderboard/CHANGELOG.md

Lines changed: 63 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,66 @@
22

33
All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file.
44

5-
- [Jul 8, 2025] [#1098](https://github.com/ShishirPatil/gorilla/pull/1098):
5+
- [Jul 17, 2025] [#1019](https://github.com/ShishirPatil/gorilla/pull/1019): BFCL V4 release:
6+
7+
1. **New agentic domain**
8+
- Introduces the agentic domain with two categories: Web Search and Memory Management.
9+
- For more information, please see our accompanying [blog posts](https://gorilla.cs.berkeley.edu/blog.html).
10+
2. **Revised overall-accuracy formula**
11+
12+
- As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks.
13+
14+
| Segment | Old % | New % |
15+
| ----------- | ----: | -----: |
16+
| Live | 33 | **10** |
17+
| Non-Live | 33 | **10** |
18+
| Irrelevance | 0 | **10** |
19+
| Multi-Turn | 33 | **30** |
20+
| Agentic | 0 | **40** |
21+
22+
3. **Leaderboard / model cleanup**
23+
- Retires several deprecated models from the leaderboard.
24+
- Removes unused model handlers to improve maintainability.
25+
4. **Address #602**
26+
- `Non-Live Acc` and `Live Acc` score calculation now excludes the Irrelevance/Relevance category scores.
27+
5. **Resolve #1094.**
28+
6. **Codebase refactor**
29+
- Reorganizes the response-generation pipeline and related modules for easier maintenance.
30+
- Simplify the response-generation pipeline logic for locally-hosted models.
31+
- Introduce `enums.py`
32+
7. **Test category rename**
33+
The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns.
34+
- `simple` --> `simple_python`
35+
- `java` --> `simple_java`
36+
- `javascript` --> `simple_javascript`
37+
8. **Directory layout overhaul**
38+
Results and scores now use a _two-level_ hierarchy:
39+
40+
```text
41+
result/<model>/<general_category>/<category>.json
42+
score/<model>/<general_category>/<category>.json
43+
```
44+
45+
`general_category` ∈ { **non_live**, **live**, **multi_turn**, **agentic**, **format_sensitivity** }
46+
47+
• For _agentic-memory_ tasks, an extra level distinguishes the memory backend:
48+
49+
```text
50+
result/<model>/agentic/<memory_backend>/<category>.json
51+
```
52+
53+
Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files.
54+
9. **New model support**
55+
Adds support for the following models:
56+
- `claude-opus-4-1-20250805`
57+
- `gpt-5-2025-08-07`
58+
- `gpt-5-mini-2025-08-07`
59+
- `gpt-5-nano-2025-08-07`
60+
- `Qwen/Qwen3-30B-A3B-Instruct-2507`
61+
- `Qwen/Qwen3-235B-A22B-Instruct-2507`
62+
- `Qwen/Qwen3-4B-Instruct-2507`
63+
64+
- [Jul 8, 2025] [#1098](https://github.com/Shishirtil/gorilla/pull/1098):
665
- Re-introduce latency statistics for locally hosted models
766
- Update cost calculation to cover the entire dataset batch, instead of the average cost per 1k function calls
867
- [Jul 6, 2025] [#1100](https://github.com/ShishirPatil/gorilla/pull/1100): Add the following new models to the leaderboard:
@@ -225,11 +284,11 @@ All notable changes to the Berkeley Function Calling Leaderboard will be documen
225284
- [Nov 18, 2024] [#768](https://github.com/ShishirPatil/gorilla/pull/768), [#770](https://github.com/ShishirPatil/gorilla/pull/770): Resolve issues in Gemini models (FC mode) related to handling scenarios with no tools available and cases where the model output is empty.
226285
- [Nov 17, 2024] [#767](https://github.com/ShishirPatil/gorilla/pull/767): Fix price and latency calculation. A merge conflict results in a duplicate line, and counting the input and output token for each entry multiple times.
227286
- [Nov 15, 2024] [#762](https://github.com/ShishirPatil/gorilla/pull/762): Supply `data_multi_turn.csv` for multi-turn evaluation results
228-
- [Nov 14, 2024] [#760](https://github.com/ShishirPatil/gorilla/pull/760), [#761](https://github.com/ShishirPatil/gorilla/pull/761): Upstream `google-cloud-aiplatform` library fixed typecasting bugs in Function Calling. Updated to version `1.72.0` and remove the workaround patch introduced in [#648](https://github.com/ShishirPatil/gorilla/pull/648).
287+
- [Nov 14, 2024] [#760](https://github.com/ShishirPatil/gorilla/pull/760), [#761](https://github.com/ShishirPatil/gorilla/pull/761): Upstream `google-cloud-aiplatform` library fixed typecasting bugs in Function Calling. Updated to version `1.72.0` and remove the workaround patch introduced in [#648](https://github.com/ShishirPatil/gorilla/pull/648).
229288
- [Nov 14, 2024] [#747](https://github.com/ShishirPatil/gorilla/pull/747): Minor Grammatical Corrections to `DEFAULT_SYSTEM_PROMPT` that is supplied to all prompting models.
230289
- [Nov 13, 2024] [#737](https://github.com/ShishirPatil/gorilla/pull/737), [#739](https://github.com/ShishirPatil/gorilla/pull/739), [#740](https://github.com/ShishirPatil/gorilla/pull/740), [#763](https://github.com/ShishirPatil/gorilla/pull/763), [#772](https://github.com/ShishirPatil/gorilla/pull/772), [#789](https://github.com/ShishirPatil/gorilla/pull/789), [#804](https://github.com/ShishirPatil/gorilla/pull/804): Bug fix in the dataset and possible answers for the live and multi-turn categories.
231290
- [Nov 11, 2024] [#746](https://github.com/ShishirPatil/gorilla/pull/746): Improve inference log readability; inference log is now included as part of the model result file. For details on how to interpret the inference log, please refer to the [LOG_GUIDE.md](https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/LOG_GUIDE.md).
232-
- [Nov 9, 2024] [#749](https://github.com/ShishirPatil/gorilla/pull/749): Remove `Llama-3.2-3B-Instruct-FC` and `Llama-3.2-1B-Instruct-FC` from the leaderboard. According to the [official Llama documentation](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-tool-calling-(1b/3b)-), these models perform function calling using the prompt-style chat template rather than the specialized function-calling format.
291+
- [Nov 9, 2024] [#749](https://github.com/ShishirPatil/gorilla/pull/749): Remove `Llama-3.2-3B-Instruct-FC` and `Llama-3.2-1B-Instruct-FC` from the leaderboard. According to the [official Llama documentation](<https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-tool-calling-(1b/3b)->), these models perform function calling using the prompt-style chat template rather than the specialized function-calling format.
233292
- [Nov 8, 2024] [#720](https://github.com/ShishirPatil/gorilla/pull/720): Add new model `BitAgent/GoGoAgent` to the leaderboard.
234293
- [Oct 30, 2024] [#725](https://github.com/ShishirPatil/gorilla/pull/725), [#733](https://github.com/ShishirPatil/gorilla/pull/733): Update evaluation metric for multi-turn categories:
235294
- Introduce a new response-based checker, which works alongside with the existing state-based checker.
@@ -277,7 +336,7 @@ All notable changes to the Berkeley Function Calling Leaderboard will be documen
277336
- `microsoft/Phi-3-small-8k-instruct`
278337
- `microsoft/Phi-3-mini-128k-instruct`
279338
- `microsoft/Phi-3-mini-4k-instruct`
280-
- [Sept 25, 2024] [#660](https://github.com/ShishirPatil/gorilla/pull/660): Bug fix in `parse_nested_value` function to handle nested dictionary values properly.
339+
- [Sept 25, 2024] [#660](https://github.com/ShishirPatil/gorilla/pull/660): Bug fix in `parse_nested_value` function to handle nested dictionary values properly.
281340
- [Sept 24, 2024] [#657](https://github.com/ShishirPatil/gorilla/pull/657): Add the following new models to the leaderboard:
282341
- `meta-llama/Llama-3.2-1B-Instruct`
283342
- `meta-llama/Llama-3.2-1B-Instruct-FC`

berkeley-function-call-leaderboard/CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ Regardless of mode or model type, you should implement the following methods to
142142

143143
## Join Our Community
144144

145-
- Have questions or need help? Join the [Gorilla Discord](https://discord.gg/grXXvj9Whz) and visit the `#leaderboard` channel.
145+
- Have questions or need help? Join the [Discord](https://discord.gg/grXXvj9Whz) and visit the `#leaderboard` channel.
146146
- Feel free to reach out if you have any questions, concerns, or would like guidance while adding your new model. We’re happy to assist!
147147

148148
---

berkeley-function-call-leaderboard/README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
- [Extra Dependencies for Self-Hosted Models](#extra-dependencies-for-self-hosted-models)
1212
- [Configuring Project Root Directory](#configuring-project-root-directory)
1313
- [Setting up Environment Variables](#setting-up-environment-variables)
14+
- [Configuring SerpAPI for Web Search Category](#configuring-serpapi-for-web-search-category)
1415
- [Running Evaluations](#running-evaluations)
1516
- [Generating LLM Responses](#generating-llm-responses)
1617
- [Selecting Models and Test Categories](#selecting-models-and-test-categories)
@@ -78,7 +79,7 @@ pip install bfcl-eval # Be careful not to confuse with the unrelated `bfcl` pro
7879

7980
For locally hosted models, choose one of the following backends, ensuring you have the right GPU and OS setup:
8081

81-
`sglang` is *much faster* than `vllm` but only supports newer GPUs with SM 80+ (Ampere etc).
82+
`sglang` is *much faster* than `vllm` in our specific multi-turn use case, but it only supports newer GPUs with SM 80+ (Ampere etc).
8283
If you are using an older GPU (T4/V100), you should use `vllm` instead as it supports a much wider range of GPUs.
8384

8485
**Using `vllm`:**
@@ -135,6 +136,10 @@ If you are running any proprietary models, make sure the model API keys are incl
135136

136137
The library looks for the `.env` file in the project root, i.e. `$BFCL_PROJECT_ROOT/.env`.
137138

139+
#### Configuring SerpAPI for Web Search Category
140+
141+
For the `web_search` test category, we use the [SerpAPI](https://serpapi.com/) service to perform web search. You need to sign up for an API key and add it to your `.env` file. You can also switch to other web search APIs by changing the `search_engine_query` function in `bfcl_eval/eval_checker/multi_turn_eval/func_source_code/web_search.py`.
142+
138143
---
139144

140145
## Running Evaluations
@@ -212,13 +217,13 @@ bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1
212217
bfcl generate \
213218
--model MODEL_NAME \
214219
--test-category TEST_CATEGORY \
215-
--backend {vllm|sglang} \
220+
--backend {sglang|vllm} \
216221
--num-gpus 1 \
217222
--gpu-memory-utilization 0.9 \
218223
--local-model-path /path/to/local/model # ← optional
219224
```
220225

221-
- Choose your backend using `--backend vllm` or `--backend sglang`. The default backend is `vllm`.
226+
- Choose your backend using `--backend sglang` or `--backend vllm`. The default backend is `sglang`.
222227
- Control GPU usage by adjusting `--num-gpus` (default `1`, relevant for multi-GPU tensor parallelism) and `--gpu-memory-utilization` (default `0.9`), which can help avoid out-of-memory errors.
223228
- `--local-model-path` (optional): Point this flag at a directory that already contains the model's files (`config.json`, tokenizer, weights, etc.). Use it only when you've pre‑downloaded the model and the weights live somewhere other than the default `$HF_HOME` cache.
224229

@@ -230,11 +235,11 @@ If you have a server already running (e.g., vLLM in a SLURM cluster), you can by
230235
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --skip-server-setup
231236
```
232237

233-
In addition, you should specify the endpoint and port used by the server. By default, the endpoint is `localhost` and the port is `1053`. These can be overridden by the `VLLM_ENDPOINT` and `VLLM_PORT` environment variables in the `.env` file:
238+
In addition, you should specify the endpoint and port used by the local server. By default, the endpoint is `localhost` and the port is `1053`. These can be overridden by the `LOCAL_SERVER_ENDPOINT` and `LOCAL_SERVER_PORT` environment variables in the `.env` file:
234239

235240
```bash
236-
VLLM_ENDPOINT=localhost
237-
VLLM_PORT=1053
241+
LOCAL_SERVER_ENDPOINT=localhost
242+
LOCAL_SERVER_PORT=1053
238243
```
239244

240245
#### (Alternate) Script Execution for Generation
@@ -312,9 +317,9 @@ For detailed steps, please see the [Contributing Guide](./CONTRIBUTING.md).
312317

313318
## Additional Resources
314319

315-
- [Gorilla Discord](https://discord.gg/grXXvj9Whz) (`#leaderboard` channel)
316-
- [Project Website](https://gorilla.cs.berkeley.edu/)
320+
- [Discord](https://discord.gg/grXXvj9Whz) (`#leaderboard` channel)
321+
- [Project Website](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard)
317322

318323
All the leaderboard statistics, and data used to train the models are released under Apache 2.0.
319-
Gorilla is an open source effort from UC Berkeley and we welcome contributors.
320-
Please email us your comments, criticisms, and questions. More information about the project can be found at [https://gorilla.cs.berkeley.edu/](https://gorilla.cs.berkeley.edu/)
324+
BFCL is an open source effort from UC Berkeley and we welcome contributors.
325+
For any comments, criticisms, or questions, please feel free to raise an issue or a PR. You can also reach us via [email](mailto:huanzhimao@berkeley.edu).

0 commit comments

Comments
 (0)