Skip to content

Commit a9dd435

Browse files
[BFCL] Standardize TEST_CATEGORY Among eval_runner.py and openfunctions_evaluation.py (#506)
There are inconsistencies between the `test_category` argument that's used by `eval_checker/eval_runner.py` and `openfunctions_evaluation.py`. This PR partially addresses #501 and #502. --------- Co-authored-by: Shishir Patil <[email protected]>
1 parent 57a9fe8 commit a9dd435

File tree

6 files changed

+168
-154
lines changed

6 files changed

+168
-154
lines changed

berkeley-function-call-leaderboard/README.md

Lines changed: 91 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
## Introduction
1010
We introduce the Berkeley Function Leaderboard (BFCL), the **first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions**. Unlike previous function call evaluations, BFCL accounts for various forms of function calls, diverse function calling scenarios, and their executability. Additionally, we release Gorilla-Openfunctions-v2, the most advanced open-source model to date capable of handling multiple languages, parallel function calls, and multiple function calls simultaneously. A unique debugging feature of this model is its ability to output an "Error Message" when the provided function does not suit your task.
1111

12-
Read more about the technical details and interesting insights in our blog post!
12+
Read more about the technical details and interesting insights in our [blog post](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)!
1313

1414
![image](./architecture_diagram.png)
1515
### Install Dependencies
@@ -48,10 +48,9 @@ python apply_function_credential_config.py
4848
```
4949

5050

51-
## Berkeley Function-Calling Leaderboard Statistics
51+
## Evaluating different models on the BFCL
5252

53-
54-
Make sure models API keys are included in your environment variables.
53+
Make sure the model API keys are included in your environment variables. Running proprietary models like GPTs, Claude, Mistral-X will require them.
5554

5655
```bash
5756
export OPENAI_API_KEY=sk-XXXXXX
@@ -62,96 +61,22 @@ export COHERE_API_KEY=XXXXXX
6261
export NVIDIA_API_KEY=nvapi-XXXXXX
6362
```
6463

65-
To generate leaderboard statistics, there are two steps:
66-
67-
1. LLM Inference of the evaluation data from specific models
68-
69-
```bash
70-
python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
71-
```
72-
For TEST_CATEGORY, we have `executable_simple`, `executable_parallel_function`, `executable_multiple_function`, `executable_parallel_multiple_function`, `simple`, `relevance`, `parallel_function`, `multiple_function`, `parallel_multiple_function`, `java`, `javascript`, `rest`, `sql`, `chatable`.
73-
74-
If you want to run all evaluations at the same time, you can use `all` as the test category.
75-
76-
Running proprietary models like GPTs, Claude, Mistral-X will require an API-Key which can be supplied in `openfunctions_evaluation.py`.
77-
78-
If decided to run OSS model, openfunction evaluation uses vllm and therefore requires GPU for hosting and inferencing. If you have questions or concerns about evaluating OSS models, please reach out to us in our [discord channel](https://discord.gg/grXXvj9Whz).
64+
If decided to run OSS model, the generation script uses vllm and therefore requires GPU for hosting and inferencing. If you have questions or concerns about evaluating OSS models, please reach out to us in our [discord channel](https://discord.gg/grXXvj9Whz).
7965

66+
### Genrating LLM Responses
8067

81-
82-
83-
## Checking the Evaluation Results
84-
85-
### Running the Checker
86-
87-
Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:
68+
Use the following command for LLM inference of the evaluation dataset with specific models
8869

8970
```bash
90-
python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
91-
```
92-
93-
- `MODEL_NAME`: Optional. The name of the model you wish to evaluate. This parameter can accept multiple model names separated by spaces. Eg, `--model gorilla-openfunctions-v2 gpt-4-0125-preview`.
94-
- If no model name is provided, the script will run the checker on all models exist in the `result` folder. This path can be changed by modifying the `INPUT_PATH` variable in the `eval_runner.py` script.
95-
- `TEST_CATEGORY`: Optional. The category of tests to run. You can specify multiple categories separated by spaces. Available options include:
96-
- `all`: Run all test categories.
97-
- `ast`: Abstract Syntax Tree tests.
98-
- `executable`: Executable code evaluation tests.
99-
- `python`: Tests specific to Python code.
100-
- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
101-
- Individual test categories:
102-
- `simple`: Simple function calls.
103-
- `parallel_function`: Multiple function calls in parallel.
104-
- `multiple_function`: Multiple function calls in sequence.
105-
- `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
106-
- `executable_simple`: Executable function calls.
107-
- `executable_parallel_function`: Executable multiple function calls in parallel.
108-
- `executable_multiple_function`: Executable multiple function calls in sequence.
109-
- `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
110-
- `java`: Java function calls.
111-
- `javascript`: JavaScript function calls.
112-
- `rest`: REST API function calls.
113-
- `relevance`: Function calls with irrelevant function documentation.
114-
- If no test category is provided, the script will run all available test categories.
115-
> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
116-
117-
> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
118-
119-
> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
120-
121-
### Example Usage
122-
123-
If you want to run all tests for the `gorilla-openfunctions-v2` model, you can use the following command:
124-
125-
```bash
126-
python eval_runner.py --model gorilla-openfunctions-v2
127-
128-
```
129-
130-
If you want to evaluate all offline tests (do not require RapidAPI keys) for OpenAI GPT-3.5, you can use the following command:
131-
132-
```bash
133-
python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
134-
```
135-
136-
If you want to run `rest` tests for all GPT models, you can use the following command:
137-
138-
```bash
139-
python eval_runner.py --model gpt-3.5-turbo-0125 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview --test-category rest
140-
```
141-
142-
If you want to run `rest` and `javascript` tests for all GPT models and `gorilla-openfunctions-v2`, you can use the following command:
143-
144-
```bash
145-
python eval_runner.py --model gorilla-openfunctions-v2 gpt-3.5-turbo-0125 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview --test-category rest javascript
71+
python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
14672
```
14773

148-
### Model-Specific Optimization
149-
150-
Some companies have proposed some optimization strategies in their models' handler, which we (BFCL) think is unfair to other models, as those optimizations are not generalizable to all models. Therefore, we have disabled those optimizations during the evaluation process by default. You can enable those optimizations by setting the `USE_{COMPANY}_OPTIMIZATION` flag to `True` in the `model_handler/constants.py` file.
74+
For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section below.
15175

76+
If no `MODEL_NAME` is provided, the model `gorilla-openfunctions-v2` will be used by default. If no `TEST_CATEGORY` is provided, all test categories will be run by default.
15277

153-
## Models Available
154-
Below is *a table of models we support* to run our leaderboard evaluation against. If supported function calling (FC), we will follow its function calling format provided by official documentation. Otherwise, we will construct a system message to prompt the model to generate function calls in the right format.
78+
### Models Available
79+
Below is *a table of models we support* to run our leaderboard evaluation against. If the models support function calling (FC), we will follow its function calling format provided by official documentation. Otherwise, we use a consistent system message to prompt the model to generate function calls in the right format.
15580
|Model | Type |
15681
|---|---|
15782
|gorilla-openfunctions-v2 | Function Calling|
@@ -200,8 +125,85 @@ For model names with {.}, it means that the model has multiple versions. For exa
200125
For Mistral large and small models, we provide evaluation on both of their `Any` and `Auto` settings. More information about this can be found [here](https://docs.mistral.ai/guides/function-calling/).
201126

202127

203-
For inferencing `Gemini-1.0-pro`, you need to fill in `model_handler/gemini_handler.py` with your GCP project ID that has access to Vertex AI endpoint.
204-
For inferencing `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace and setup an endpoint for inference.
128+
For `Gemini-1.0-pro`, you need to fill in `model_handler/gemini_handler.py` with your GCP project ID that has access to Vertex AI endpoint.
129+
For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace and setup an endpoint for inference.
130+
131+
132+
### Available Test Category
133+
In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:
134+
135+
- `all`: Run all test categories.
136+
- This is the default option if no test category is provided.
137+
- `ast`: Abstract Syntax Tree tests.
138+
- `executable`: Executable code evaluation tests.
139+
- `python`: Tests specific to Python code.
140+
- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
141+
- `python-ast`: Python Abstract Syntax Tree tests.
142+
- Individual test categories:
143+
- `simple`: Simple function calls.
144+
- `parallel_function`: Multiple function calls in parallel.
145+
- `multiple_function`: Multiple function calls in sequence.
146+
- `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
147+
- `executable_simple`: Executable function calls.
148+
- `executable_parallel_function`: Executable multiple function calls in parallel.
149+
- `executable_multiple_function`: Executable multiple function calls in sequence.
150+
- `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
151+
- `java`: Java function calls.
152+
- `javascript`: JavaScript function calls.
153+
- `rest`: REST API function calls.
154+
- `relevance`: Function calls with irrelevant function documentation.
155+
- If no test category is provided, the script will run all available test categories. (same as `all`)
156+
157+
> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
158+
159+
> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
160+
161+
> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
162+
163+
164+
## Evaluating the LLM generations
165+
166+
### Running the Checker
167+
168+
Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:
169+
170+
```bash
171+
python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
172+
```
173+
174+
For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.
175+
176+
If no `MODEL_NAME` is provided, all available model results will be evaluated by default. If no `TEST_CATEGORY` is provided, all test categories will be run by default.
177+
178+
### Example Usage
179+
180+
If you want to run all tests for the `gorilla-openfunctions-v2` model, you can use the following command:
181+
182+
```bash
183+
python eval_runner.py --model gorilla-openfunctions-v2
184+
```
185+
186+
If you want to evaluate all offline tests (do not require RapidAPI keys) for OpenAI GPT-3.5, you can use the following command:
187+
188+
```bash
189+
python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
190+
```
191+
192+
If you want to run `rest` tests for a few Claude models, you can use the following command:
193+
194+
```bash
195+
python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
196+
```
197+
198+
If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
199+
200+
```bash
201+
python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
202+
```
203+
204+
### Model-Specific Optimization
205+
206+
Some companies have proposed some optimization strategies in their models' handler, which we (BFCL) think is unfair to other models, as those optimizations are not generalizable to all models. Therefore, we have disabled those optimizations during the evaluation process by default. You can enable those optimizations by setting the `USE_{COMPANY}_OPTIMIZATION` flag to `True` in the `model_handler/constants.py` file.
205207

206208

207209
## Changelog
@@ -254,7 +256,7 @@ For inferencing `Databrick-DBRX-instruct`, you need to create a Databrick Azure
254256

255257
## Contributing
256258

257-
To add a new model to the Function Calling Leaderboard, here are a few things you need to do:
259+
We welcome additions to the Function Calling Leaderboard! To add a new model, here are a few things you need to do:
258260

259261
1. Take a look at the `model_handler/handler.py`. This is the base handler object which all handlers are inherited from. Also, free feel to take a look at the existing model handers; very likely you can re-use some of the existing code if the new model outputs in a similar format.
260262
2. Create your handler and define the following functions

berkeley-function-call-leaderboard/apply_function_credential_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ def process_file(input_file_path, output_file_path):
6262
with open(output_file_path, "w") as f:
6363
for i, modified_line in enumerate(modified_data):
6464
f.write(modified_line)
65-
if i < len(data) - 1:
65+
if i < len(modified_data) - 1:
6666
f.write("\n")
6767

6868
print(f"All placeholders have been replaced for {input_file_path} 🦍.")

berkeley-function-call-leaderboard/eval_checker/eval_checker_constant.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,59 @@
1616
"executable_parallel_multiple_function": (1760, 1799),
1717
"multiple_function": (1800, 1999),
1818
}
19+
20+
TEST_COLLECTION_MAPPING = {
21+
"ast": [
22+
"simple",
23+
"multiple_function",
24+
"parallel_function",
25+
"parallel_multiple_function",
26+
"java",
27+
"javascript",
28+
"relevance",
29+
],
30+
"executable": [
31+
"executable_simple",
32+
"executable_multiple_function",
33+
"executable_parallel_function",
34+
"executable_parallel_multiple_function",
35+
"rest",
36+
],
37+
"all": [
38+
"simple",
39+
"multiple_function",
40+
"parallel_function",
41+
"parallel_multiple_function",
42+
"java",
43+
"javascript",
44+
"relevance",
45+
"executable_simple",
46+
"executable_multiple_function",
47+
"executable_parallel_function",
48+
"executable_parallel_multiple_function",
49+
"rest",
50+
],
51+
"non-python": [
52+
"java",
53+
"javascript",
54+
],
55+
"python": [
56+
"simple",
57+
"multiple_function",
58+
"parallel_function",
59+
"parallel_multiple_function",
60+
"relevance",
61+
"executable_simple",
62+
"executable_multiple_function",
63+
"executable_parallel_function",
64+
"executable_parallel_multiple_function",
65+
"rest",
66+
],
67+
"python-ast": [
68+
"simple",
69+
"multiple_function",
70+
"parallel_function",
71+
"parallel_multiple_function",
72+
"relevance",
73+
],
74+
}

berkeley-function-call-leaderboard/eval_checker/eval_runner.py

Lines changed: 3 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from checker import ast_checker, exec_checker, executable_checker_rest
66
from custom_exception import BadAPIStatusError
77
from eval_runner_helper import *
8+
from eval_checker_constant import TEST_COLLECTION_MAPPING
89
from tqdm import tqdm
910
import argparse
1011

@@ -430,56 +431,6 @@ def runner(model_names, test_categories, api_sanity_check):
430431
print(f"🏁 Evaluation completed. See {os.path.abspath(OUTPUT_PATH + 'data.csv')} for evaluation results.")
431432

432433

433-
ARG_PARSE_MAPPING = {
434-
"ast": [
435-
"simple",
436-
"multiple_function",
437-
"parallel_function",
438-
"parallel_multiple_function",
439-
"java",
440-
"javascript",
441-
"relevance",
442-
],
443-
"executable": [
444-
"executable_simple",
445-
"executable_multiple_function",
446-
"executable_parallel_function",
447-
"executable_parallel_multiple_function",
448-
"rest",
449-
],
450-
"all": [
451-
"simple",
452-
"multiple_function",
453-
"parallel_function",
454-
"parallel_multiple_function",
455-
"java",
456-
"javascript",
457-
"relevance",
458-
"executable_simple",
459-
"executable_multiple_function",
460-
"executable_parallel_function",
461-
"executable_parallel_multiple_function",
462-
"rest",
463-
],
464-
"non-python": [
465-
"java",
466-
"javascript",
467-
],
468-
"python": [
469-
"simple",
470-
"multiple_function",
471-
"parallel_function",
472-
"parallel_multiple_function",
473-
"relevance",
474-
"executable_simple",
475-
"executable_multiple_function",
476-
"executable_parallel_function",
477-
"executable_parallel_multiple_function",
478-
"rest",
479-
],
480-
}
481-
482-
483434
INPUT_PATH = "../result/"
484435
PROMPT_PATH = "../data/"
485436
POSSIBLE_ANSWER_PATH = "../data/possible_answer/"
@@ -518,8 +469,8 @@ def runner(model_names, test_categories, api_sanity_check):
518469
if args.test_category is not None:
519470
test_categories = []
520471
for test_category in args.test_category:
521-
if test_category in ARG_PARSE_MAPPING:
522-
test_categories.extend(ARG_PARSE_MAPPING[test_category])
472+
if test_category in TEST_COLLECTION_MAPPING:
473+
test_categories.extend(TEST_COLLECTION_MAPPING[test_category])
523474
else:
524475
test_categories.append(test_category)
525476

berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -621,7 +621,7 @@ def api_status_sanity_check_rest():
621621
ground_truth_dummy = load_file(REST_API_GROUND_TRUTH_FILE_PATH)
622622

623623
# Use the ground truth data to make sure the API is working correctly
624-
command = f"cd .. ; python apply_function_credential_config.py --input-file ./eval_checker/{REST_API_GROUND_TRUTH_FILE_PATH};"
624+
command = f"cd .. ; python apply_function_credential_config.py --input-path ./eval_checker/{REST_API_GROUND_TRUTH_FILE_PATH};"
625625
try:
626626
subprocess.run(command, shell=True, capture_output=True, text=True, check=True)
627627
except subprocess.CalledProcessError as e:

0 commit comments

Comments
 (0)