You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR packages the Berkeley Function Calling Leaderboard (BFCL) for
distribution on PyPI. In addition to the existing editable-install
workflow, users can now get a pre-built wheel in a single step:
```
pip install bfcl-eval
```
For versioning, we will use CalVer + serial approach, eg, `2025.06.08`,
`2025.06.08.1`, `2025.06.08.2`, `2025.06.09`...
## Usage options
| Scenario | Command |
|-----------------------------------|-----------------------------|
| Contribute code / Need customization | `pip install -e .` |
| Just run the evaluation | `pip install bfcl-eval` |
## Important 🛑
There is an **unrelated** package named `bfcl` on PyPI. Double-check to
make sure you’re installing `bfcl-eval`.
---
For wheel packaging purposes, the following project restructuring is
necessary:
- Rename folder `bfcl` → `bfcl_eval`
- Rename folder `utils` → `scripts`
- Move `data/`, `.env.example`, and `test_case_ids_to_generate.json`
into `bfcl_eval`
ResolveShishirPatil#1027
│ │ ├── parser/ # Parsing utilities for Java/JavaScript
38
38
│ │ ├── base_handler.py # Base handler blueprint
39
-
├── data/ # Datasets
39
+
│ ├── data/ # Datasets
40
+
│ ├── scripts/ # Helper scripts
40
41
├── result/ # Model responses
41
42
├── score/ # Evaluation results
42
-
├── utils/ # Helper scripts
43
43
```
44
44
45
45
To add a new model, focus primarily on the `model_handler` directory. You do not need to modify the parsing utilities in `model_handler/parser` or any other directories.
46
46
47
47
## Where to Begin
48
48
49
-
-**Base Handler:** Start by reviewing `bfcl/model_handler/base_handler.py`. All model handlers inherit from this base class. The `inference_single_turn` and `inference_multi_turn` methods defined there are helpful for understanding the model response generation pipeline. The `base_handler.py` contains many useful details in the docstrings of each abstract method, so be sure to review them.
50
-
- If your model is hosted locally, you should also look at `bfcl/model_handler/local_inference/base_oss_handler.py`.
49
+
-**Base Handler:** Start by reviewing `bfcl_eval/model_handler/base_handler.py`. All model handlers inherit from this base class. The `inference_single_turn` and `inference_multi_turn` methods defined there are helpful for understanding the model response generation pipeline. The `base_handler.py` contains many useful details in the docstrings of each abstract method, so be sure to review them.
50
+
- If your model is hosted locally, you should also look at `bfcl_eval/model_handler/local_inference/base_oss_handler.py`.
51
51
-**Reference Handlers:** Checkout some of the existing model handlers (such as `openai.py`, `claude.py`, etc); you can likely reuse some of the existing code if your new model outputs in a similar format.
52
52
- If your model is OpenAI-compatible, the `openai.py` handler will be helpful (and you might be able to just use it as is).
53
53
- If your model is locally hosted, the `llama_fc.py` handler or the `deepseek_coder.py` handler can be good starting points.
@@ -98,7 +98,7 @@ Regardless of mode or model type, you should implement the following methods to
98
98
99
99
## Updating Model Config Mapping
100
100
101
-
1.**Add a new entry in `bfcl/constants/model_config.py`**
101
+
1.**Add a new entry in `bfcl_eval/constants/model_config.py`**
102
102
103
103
Populate every field in the `ModelConfig` dataclass:
104
104
@@ -132,7 +132,7 @@ Regardless of mode or model type, you should implement the following methods to
132
132
4.**Update Supported Models**
133
133
134
134
1. Add your model to the list of supported models in `SUPPORTED_MODELS.md`. Include the model name and type (FC or Prompt) in the table.
135
-
2. Add a new entry in `bfcl/constants/supported_models.py` as well.
135
+
2. Add a new entry in `bfcl_eval/constants/supported_models.py` as well.
Copy file name to clipboardExpand all lines: berkeley-function-call-leaderboard/LOG_GUIDE.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,8 +30,8 @@ For single-turn categories, the only log entry available is the inference input
30
30
For multi-turn categories, we understand the provided ground truth may seem nonsensical without context. We have provided a utility script to simulate a conversation between the ground truth and the system:
31
31
32
32
```bash
33
-
cd berkeley-function-call-leaderboard/utils
33
+
cd berkeley-function-call-leaderboard/bfcl_eval/scripts
*Optional:* If using `sglang`, we recommend installing `flashinfer` for speedups. Find instructions [here](https://docs.flashinfer.ai/installation.html).
82
95
96
+
### Configuring Project Root Directory
97
+
98
+
**Important:** If you installed the package from PyPI (using `pip install bfcl-eval`), you **must** set the `BFCL_PROJECT_ROOT` environment variable to specify where the evaluation results and score files should be stored.
99
+
Otherwise, you'll need to navigate deep into the Python package's source code folder to access the evaluation results and configuration files.
100
+
101
+
For editable installations (using `pip install -e .`), setting `BFCL_PROJECT_ROOT` is *optional*--it defaults to the `berkeley-function-call-leaderboard` directory.
102
+
103
+
Set `BFCL_PROJECT_ROOT` as an environment variable in your shell environment:
- The `result/` folder (containing model responses) will be created at `$BFCL_PROJECT_ROOT/result/`
113
+
- The `score/` folder (containing evaluation results) will be created at `$BFCL_PROJECT_ROOT/score/`
114
+
- The library will look for the `.env` configuration file at `$BFCL_PROJECT_ROOT/.env` (see [Setting up Environment Variables](#setting-up-environment-variables))
115
+
83
116
### Setting up Environment Variables
84
117
85
-
We store environment variables in a `.env` file. We have provided a example `.env.example` file in the `gorilla/berkeley-function-call-leaderboard` directory. You should make a copy of this file, and fill in the necessary values.
118
+
We store API keys and other configuration variables (separate from the `BFCL_PROJECT_ROOT` variable mentioned above) in a `.env` file. A sample `.env.example` file is distributed with the package.
If you are running any proprietary models, make sure the model API keys are included in your `.env` file. Models like GPT, Claude, Mistral, Gemini, Nova, will require them.
93
135
136
+
The library looks for the `.env` file in the project root, i.e. `$BFCL_PROJECT_ROOT/.env`.
137
+
94
138
---
95
139
96
140
## Running Evaluations
@@ -108,10 +152,48 @@ You can provide multiple models or test categories by separating them with comma
#### Selecting Specific Test Cases with `--run-ids`
156
+
157
+
Sometimes you may only need to regenerate a handful of test entries—for instance when iterating on a new model or after fixing an inference bug. Passing the `--run-ids` flag lets you target **exact test IDs** rather than an entire category:
158
+
159
+
```bash
160
+
bfcl generate --model MODEL_NAME --run-ids # --test-category will be ignored
161
+
```
162
+
163
+
When this flag is set the generation pipeline reads a JSON file named
164
+
`test_case_ids_to_generate.json` located in the *project root* (the same
165
+
place where `.env` lives). The file should map each test category to a list of
166
+
IDs to run:
167
+
168
+
```json
169
+
{
170
+
"simple": ["simple_101", "simple_202"],
171
+
"multi_turn_base": ["multi_turn_base_14"]
172
+
}
173
+
```
174
+
175
+
> Note: When using `--run-ids`, the `--test-category` flag is ignored.
176
+
177
+
A sample file is provided at `bfcl_eval/test_case_ids_to_generate.json`; **copy it to your project root** so the CLI can pick it up regardless of your working directory:
Once `--run-ids` is provided only the IDs listed in the JSON will be evaluated.
192
+
111
193
#### Output and Logging
112
194
113
-
-All generated model responses are stored in `./result/` folder, organized by model and test category: `result/MODEL_NAME/BFCL_v3_TEST_CATEGORY_result.json`
114
-
-To use a custom directory for the result file, specify using `--result-dir`; path should be relative to the `berkeley-function-call-leaderboard` root folder,
195
+
-By default, generated model responses are stored in a `result/` folder under the project root (which defaults to the package directory): `result/MODEL_NAME/BFCL_v3_TEST_CATEGORY_result.json`.
196
+
-You can customise the location by setting the `BFCL_PROJECT_ROOT` environment variable or passing the `--result-dir` option.
115
197
116
198
An inference log is included with the model responses to help analyze/debug the model's performance, and to better understand the model behavior. For more verbose logging, use the `--include-input-log` flag. Refer to [LOG_GUIDE.md](./LOG_GUIDE.md) for details on how to interpret the inference logs.
- Use `--num-threads` to control the level of parallel inference. The default (`1`) means no parallelization.
125
-
- The maximum allowable threads depends on your API’s rate limits.
207
+
- The maximum allowable threads depends on your API's rate limits.
126
208
127
209
#### For Locally-hosted OSS Models
128
210
@@ -138,7 +220,7 @@ bfcl generate \
138
220
139
221
- Choose your backend using `--backend vllm` or `--backend sglang`. The default backend is `vllm`.
140
222
- Control GPU usage by adjusting `--num-gpus` (default `1`, relevant for multi-GPU tensor parallelism) and `--gpu-memory-utilization` (default `0.9`), which can help avoid out-of-memory errors.
141
-
-`--local-model-path` (optional): Point this flag at a directory that already contains the model’s files (`config.json`, tokenizer, weights, etc.). Use it only when you’ve pre‑downloaded the model and the weights live somewhere other than the default `$HF_HOME` cache.
223
+
-`--local-model-path` (optional): Point this flag at a directory that already contains the model's files (`config.json`, tokenizer, weights, etc.). Use it only when you've pre‑downloaded the model and the weights live somewhere other than the default `$HF_HOME` cache.
142
224
143
225
##### For Pre-existing OpenAI-compatible Endpoints
144
226
@@ -160,8 +242,7 @@ VLLM_PORT=1053
160
242
For those who prefer using script execution instead of the CLI, you can run the following command:
161
243
162
244
```bash
163
-
# Make sure you are inside the `berkeley-function-call-leaderboard` directory
When specifying multiple models or test categories, separate them with **spaces**, not commas. All other flags mentioned earlier are compatible with the script execution method as well.
The `MODEL_NAME` and `TEST_CATEGORY` options are the same as those used in the [Generating LLM Responses](#generating-llm-responses) section. For details, refer to [SUPPORTED_MODELS.md](./SUPPORTED_MODELS.md) and [TEST_CATEGORIES.md](./TEST_CATEGORIES.md).
180
261
181
-
If in the previous step you stored the model responses in a custom directory, you should specify it using the `--result-dir` flag; path should be relative to the `berkeley-function-call-leaderboard` root folder.
262
+
If in the previous step you stored the model responses in a custom directory, specify it using the `--result-dir` flag or set `BFCL_PROJECT_ROOT` so the evaluator can locate the files.
182
263
183
264
> Note: For unevaluated test categories, they will be marked as `N/A` in the evaluation result csv files.
184
265
> For summary columns (e.g., `Overall Acc`, `Non_Live Overall Acc`, `Live Overall Acc`, and `Multi Turn Overall Acc`), the score reported will treat all unevaluated categories as 0 during calculation.
185
266
186
267
#### Output Structure
187
268
188
-
Evaluation scores are stored in `./score/`, mirroring the structure of `./result/`: `score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json`
269
+
Evaluation scores are stored in a `score/` directory under the project root (defaults to the package directory), mirroring the structure of `result/`: `score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json`.
189
270
190
-
- To use a custom directory for the score file, specify using `--score-dir`; path should be relative to the `berkeley-function-call-leaderboard` root folder.
271
+
- To use a custom directory for the score file, set the `BFCL_PROJECT_ROOT` environment variable or specify `--score-dir`.
191
272
192
273
Additionally, four CSV files are generated in `./score/`:
193
274
@@ -211,9 +292,7 @@ Mkae sure you also set `WANDB_BFCL_PROJECT=ENTITY:PROJECT` in `.env`.
211
292
For those who prefer using script execution instead of the CLI, you can run the following command:
212
293
213
294
```bash
214
-
# Make sure you are inside the `berkeley-function-call-leaderboard/bfcl/eval_checker` directory
When specifying multiple models or test categories, separate them with **spaces**, not commas. All other flags mentioned earlier are compatible with the script execution method as well.
@@ -222,9 +301,9 @@ When specifying multiple models or test categories, separate them with **spaces*
222
301
223
302
We welcome contributions! To add a new model:
224
303
225
-
1. Review `bfcl/model_handler/base_handler.py` and/or `bfcl/model_handler/local_inference/base_oss_handler.py` (if your model is hosted locally).
304
+
1. Review `bfcl_eval/model_handler/base_handler.py` and/or `bfcl_eval/model_handler/local_inference/base_oss_handler.py` (if your model is hosted locally).
226
305
2. Implement a new handler class for your model.
227
-
3. Update `bfcl/constants/model_config.py`.
306
+
3. Update `bfcl_eval/constants/model_config.py`.
228
307
4. Submit a Pull Request.
229
308
230
309
For detailed steps, please see the [Contributing Guide](./CONTRIBUTING.md).
0 commit comments