You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BFCL] Standardize TEST_CATEGORY Among eval_runner.py and openfunctions_evaluation.py (#506)
There are inconsistencies between the `test_category` argument that's
used by `eval_checker/eval_runner.py` and `openfunctions_evaluation.py`.
This PR partially addresses #501 and #502.
---------
Co-authored-by: Shishir Patil <[email protected]>
Copy file name to clipboardExpand all lines: berkeley-function-call-leaderboard/README.md
+91-89Lines changed: 91 additions & 89 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@
9
9
## Introduction
10
10
We introduce the Berkeley Function Leaderboard (BFCL), the **first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions**. Unlike previous function call evaluations, BFCL accounts for various forms of function calls, diverse function calling scenarios, and their executability. Additionally, we release Gorilla-Openfunctions-v2, the most advanced open-source model to date capable of handling multiple languages, parallel function calls, and multiple function calls simultaneously. A unique debugging feature of this model is its ability to output an "Error Message" when the provided function does not suit your task.
11
11
12
-
Read more about the technical details and interesting insights in our blog post!
12
+
Read more about the technical details and interesting insights in our [blog post](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)!
For TEST_CATEGORY, we have `executable_simple`, `executable_parallel_function`, `executable_multiple_function`, `executable_parallel_multiple_function`, `simple`, `relevance`, `parallel_function`, `multiple_function`, `parallel_multiple_function`, `java`, `javascript`, `rest`, `sql`, `chatable`.
73
-
74
-
If you want to run all evaluations at the same time, you can use `all` as the test category.
75
-
76
-
Running proprietary models like GPTs, Claude, Mistral-X will require an API-Key which can be supplied in `openfunctions_evaluation.py`.
77
-
78
-
If decided to run OSS model, openfunction evaluation uses vllm and therefore requires GPU for hosting and inferencing. If you have questions or concerns about evaluating OSS models, please reach out to us in our [discord channel](https://discord.gg/grXXvj9Whz).
64
+
If decided to run OSS model, the generation script uses vllm and therefore requires GPU for hosting and inferencing. If you have questions or concerns about evaluating OSS models, please reach out to us in our [discord channel](https://discord.gg/grXXvj9Whz).
79
65
66
+
### Genrating LLM Responses
80
67
81
-
82
-
83
-
## Checking the Evaluation Results
84
-
85
-
### Running the Checker
86
-
87
-
Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:
68
+
Use the following command for LLM inference of the evaluation dataset with specific models
-`MODEL_NAME`: Optional. The name of the model you wish to evaluate. This parameter can accept multiple model names separated by spaces. Eg, `--model gorilla-openfunctions-v2 gpt-4-0125-preview`.
94
-
- If no model name is provided, the script will run the checker on all models exist in the `result` folder. This path can be changed by modifying the `INPUT_PATH` variable in the `eval_runner.py` script.
95
-
-`TEST_CATEGORY`: Optional. The category of tests to run. You can specify multiple categories separated by spaces. Available options include:
96
-
-`all`: Run all test categories.
97
-
-`ast`: Abstract Syntax Tree tests.
98
-
-`executable`: Executable code evaluation tests.
99
-
-`python`: Tests specific to Python code.
100
-
-`non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
101
-
- Individual test categories:
102
-
-`simple`: Simple function calls.
103
-
-`parallel_function`: Multiple function calls in parallel.
104
-
-`multiple_function`: Multiple function calls in sequence.
105
-
-`parallel_multiple_function`: Multiple function calls in parallel and in sequence.
106
-
-`executable_simple`: Executable function calls.
107
-
-`executable_parallel_function`: Executable multiple function calls in parallel.
108
-
-`executable_multiple_function`: Executable multiple function calls in sequence.
109
-
-`executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
110
-
-`java`: Java function calls.
111
-
-`javascript`: JavaScript function calls.
112
-
-`rest`: REST API function calls.
113
-
-`relevance`: Function calls with irrelevant function documentation.
114
-
- If no test category is provided, the script will run all available test categories.
115
-
> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
116
-
117
-
> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
118
-
119
-
> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
120
-
121
-
### Example Usage
122
-
123
-
If you want to run all tests for the `gorilla-openfunctions-v2` model, you can use the following command:
Some companies have proposed some optimization strategies in their models' handler, which we (BFCL) think is unfair to other models, as those optimizations are not generalizable to all models. Therefore, we have disabled those optimizations during the evaluation process by default. You can enable those optimizations by setting the `USE_{COMPANY}_OPTIMIZATION` flag to `True` in the `model_handler/constants.py` file.
74
+
For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section below.
151
75
76
+
If no `MODEL_NAME` is provided, the model `gorilla-openfunctions-v2` will be used by default. If no `TEST_CATEGORY` is provided, all test categories will be run by default.
152
77
153
-
## Models Available
154
-
Below is *a table of models we support* to run our leaderboard evaluation against. If supported function calling (FC), we will follow its function calling format provided by official documentation. Otherwise, we will construct a system message to prompt the model to generate function calls in the right format.
78
+
###Models Available
79
+
Below is *a table of models we support* to run our leaderboard evaluation against. If the models support function calling (FC), we will follow its function calling format provided by official documentation. Otherwise, we use a consistent system message to prompt the model to generate function calls in the right format.
155
80
|Model | Type |
156
81
|---|---|
157
82
|gorilla-openfunctions-v2 | Function Calling|
@@ -200,8 +125,85 @@ For model names with {.}, it means that the model has multiple versions. For exa
200
125
For Mistral large and small models, we provide evaluation on both of their `Any` and `Auto` settings. More information about this can be found [here](https://docs.mistral.ai/guides/function-calling/).
201
126
202
127
203
-
For inferencing `Gemini-1.0-pro`, you need to fill in `model_handler/gemini_handler.py` with your GCP project ID that has access to Vertex AI endpoint.
204
-
For inferencing `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace and setup an endpoint for inference.
128
+
For `Gemini-1.0-pro`, you need to fill in `model_handler/gemini_handler.py` with your GCP project ID that has access to Vertex AI endpoint.
129
+
For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace and setup an endpoint for inference.
130
+
131
+
132
+
### Available Test Category
133
+
In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:
134
+
135
+
-`all`: Run all test categories.
136
+
- This is the default option if no test category is provided.
137
+
-`ast`: Abstract Syntax Tree tests.
138
+
-`executable`: Executable code evaluation tests.
139
+
-`python`: Tests specific to Python code.
140
+
-`non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
141
+
-`python-ast`: Python Abstract Syntax Tree tests.
142
+
- Individual test categories:
143
+
-`simple`: Simple function calls.
144
+
-`parallel_function`: Multiple function calls in parallel.
145
+
-`multiple_function`: Multiple function calls in sequence.
146
+
-`parallel_multiple_function`: Multiple function calls in parallel and in sequence.
147
+
-`executable_simple`: Executable function calls.
148
+
-`executable_parallel_function`: Executable multiple function calls in parallel.
149
+
-`executable_multiple_function`: Executable multiple function calls in sequence.
150
+
-`executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
151
+
-`java`: Java function calls.
152
+
-`javascript`: JavaScript function calls.
153
+
-`rest`: REST API function calls.
154
+
-`relevance`: Function calls with irrelevant function documentation.
155
+
- If no test category is provided, the script will run all available test categories. (same as `all`)
156
+
157
+
> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
158
+
159
+
> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
160
+
161
+
> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
162
+
163
+
164
+
## Evaluating the LLM generations
165
+
166
+
### Running the Checker
167
+
168
+
Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:
For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.
175
+
176
+
If no `MODEL_NAME` is provided, all available model results will be evaluated by default. If no `TEST_CATEGORY` is provided, all test categories will be run by default.
177
+
178
+
### Example Usage
179
+
180
+
If you want to run all tests for the `gorilla-openfunctions-v2` model, you can use the following command:
Some companies have proposed some optimization strategies in their models' handler, which we (BFCL) think is unfair to other models, as those optimizations are not generalizable to all models. Therefore, we have disabled those optimizations during the evaluation process by default. You can enable those optimizations by setting the `USE_{COMPANY}_OPTIMIZATION` flag to `True` in the `model_handler/constants.py` file.
205
207
206
208
207
209
## Changelog
@@ -254,7 +256,7 @@ For inferencing `Databrick-DBRX-instruct`, you need to create a Databrick Azure
254
256
255
257
## Contributing
256
258
257
-
To add a new model to the Function Calling Leaderboard, here are a few things you need to do:
259
+
We welcome additions to the Function Calling Leaderboard! To add a new model, here are a few things you need to do:
258
260
259
261
1. Take a look at the `model_handler/handler.py`. This is the base handler object which all handlers are inherited from. Also, free feel to take a look at the existing model handers; very likely you can re-use some of the existing code if the new model outputs in a similar format.
260
262
2. Create your handler and define the following functions
0 commit comments