Skip to content

[BFCL] Standardize TEST_CATEGORY Among eval_runner.py and openfunctions_evaluation.py #506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jul 19, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 39 additions & 39 deletions berkeley-function-call-leaderboard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,41 @@ Then, use `eval_data_compilation.py` to compile all files by
```bash
python eval_data_compilation.py
```
## Berkeley Function-Calling Leaderboard Statistics

## Available Test Category
In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:

- `all`: Run all test categories.
- `ast`: Abstract Syntax Tree tests.
- `executable`: Executable code evaluation tests.
- `python`: Tests specific to Python code.
- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
- `python-ast`: Python Abstract Syntax Tree tests.
- Individual test categories:
- `simple`: Simple function calls.
- `parallel_function`: Multiple function calls in parallel.
- `multiple_function`: Multiple function calls in sequence.
- `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
- `executable_simple`: Executable function calls.
- `executable_parallel_function`: Executable multiple function calls in parallel.
- `executable_multiple_function`: Executable multiple function calls in sequence.
- `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
- `java`: Java function calls.
- `javascript`: JavaScript function calls.
- `rest`: REST API function calls.
- `relevance`: Function calls with irrelevant function documentation.
- If no test category is provided, the script will run all available test categories. (same as `all`)

> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!

> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.

> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.

Make sure models API keys are included in your environment variables.

## Model Result Generation

Make sure the model API keys are included in your environment variables.

```bash
export OPENAI_API_KEY=sk-XXXXXX
Expand All @@ -73,17 +104,13 @@ To generate leaderboard statistics, there are two steps:
```bash
python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
```
For TEST_CATEGORY, we have `executable_simple`, `executable_parallel_function`, `executable_multiple_function`, `executable_parallel_multiple_function`, `simple`, `relevance`, `parallel_function`, `multiple_function`, `parallel_multiple_function`, `java`, `javascript`, `rest`, `sql`, `chatable`.

If you want to run all evaluations at the same time, you can use `all` as the test category.
For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.

Running proprietary models like GPTs, Claude, Mistral-X will require an API-Key which can be supplied in `openfunctions_evaluation.py`.

If decided to run OSS model, openfunction evaluation uses vllm and therefore requires GPU for hosting and inferencing. If you have questions or concerns about evaluating OSS models, please reach out to us in our [discord channel](https://discord.gg/grXXvj9Whz).




## Checking the Evaluation Results

### Running the Checker
Expand All @@ -94,41 +121,14 @@ Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` direct
python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
```

- `MODEL_NAME`: Optional. The name of the model you wish to evaluate. This parameter can accept multiple model names separated by spaces. Eg, `--model gorilla-openfunctions-v2 gpt-4-0125-preview`.
- If no model name is provided, the script will run the checker on all models exist in the `result` folder. This path can be changed by modifying the `INPUT_PATH` variable in the `eval_runner.py` script.
- `TEST_CATEGORY`: Optional. The category of tests to run. You can specify multiple categories separated by spaces. Available options include:
- `all`: Run all test categories.
- `ast`: Abstract Syntax Tree tests.
- `executable`: Executable code evaluation tests.
- `python`: Tests specific to Python code.
- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
- Individual test categories:
- `simple`: Simple function calls.
- `parallel_function`: Multiple function calls in parallel.
- `multiple_function`: Multiple function calls in sequence.
- `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
- `executable_simple`: Executable function calls.
- `executable_parallel_function`: Executable multiple function calls in parallel.
- `executable_multiple_function`: Executable multiple function calls in sequence.
- `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
- `java`: Java function calls.
- `javascript`: JavaScript function calls.
- `rest`: REST API function calls.
- `relevance`: Function calls with irrelevant function documentation.
- If no test category is provided, the script will run all available test categories.
> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!

> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.

> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.

### Example Usage

If you want to run all tests for the `gorilla-openfunctions-v2` model, you can use the following command:

```bash
python eval_runner.py --model gorilla-openfunctions-v2

```

If you want to evaluate all offline tests (do not require RapidAPI keys) for OpenAI GPT-3.5, you can use the following command:
Expand All @@ -137,16 +137,16 @@ If you want to evaluate all offline tests (do not require RapidAPI keys) for Ope
python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
```

If you want to run `rest` tests for all GPT models, you can use the following command:
If you want to run `rest` tests for a few Claude models, you can use the following command:

```bash
python eval_runner.py --model gpt-3.5-turbo-0125 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview --test-category rest
python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
```

If you want to run `rest` and `javascript` tests for all GPT models and `gorilla-openfunctions-v2`, you can use the following command:
If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:

```bash
python eval_runner.py --model gorilla-openfunctions-v2 gpt-3.5-turbo-0125 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview --test-category rest javascript
python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
```

### Model-Specific Optimization
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,59 @@
"executable_parallel_multiple_function": (1760, 1799),
"multiple_function": (1800, 1999),
}

TEST_COLLECTION_MAPPING = {
"ast": [
"simple",
"multiple_function",
"parallel_function",
"parallel_multiple_function",
"java",
"javascript",
"relevance",
],
"executable": [
"executable_simple",
"executable_multiple_function",
"executable_parallel_function",
"executable_parallel_multiple_function",
"rest",
],
"all": [
"simple",
"multiple_function",
"parallel_function",
"parallel_multiple_function",
"java",
"javascript",
"relevance",
"executable_simple",
"executable_multiple_function",
"executable_parallel_function",
"executable_parallel_multiple_function",
"rest",
],
"non-python": [
"java",
"javascript",
],
"python": [
"simple",
"multiple_function",
"parallel_function",
"parallel_multiple_function",
"relevance",
"executable_simple",
"executable_multiple_function",
"executable_parallel_function",
"executable_parallel_multiple_function",
"rest",
],
"python-ast": [
"simple",
"multiple_function",
"parallel_function",
"parallel_multiple_function",
"relevance",
],
}
55 changes: 3 additions & 52 deletions berkeley-function-call-leaderboard/eval_checker/eval_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from checker import ast_checker, exec_checker, executable_checker_rest
from custom_exception import BadAPIStatusError
from eval_runner_helper import *
from eval_checker_constant import TEST_COLLECTION_MAPPING
from tqdm import tqdm
import argparse

Expand Down Expand Up @@ -430,56 +431,6 @@ def runner(model_names, test_categories, api_sanity_check):
print(f"🏁 Evaluation completed. See {os.path.abspath(OUTPUT_PATH + 'data.csv')} for evaluation results.")


ARG_PARSE_MAPPING = {
"ast": [
"simple",
"multiple_function",
"parallel_function",
"parallel_multiple_function",
"java",
"javascript",
"relevance",
],
"executable": [
"executable_simple",
"executable_multiple_function",
"executable_parallel_function",
"executable_parallel_multiple_function",
"rest",
],
"all": [
"simple",
"multiple_function",
"parallel_function",
"parallel_multiple_function",
"java",
"javascript",
"relevance",
"executable_simple",
"executable_multiple_function",
"executable_parallel_function",
"executable_parallel_multiple_function",
"rest",
],
"non-python": [
"java",
"javascript",
],
"python": [
"simple",
"multiple_function",
"parallel_function",
"parallel_multiple_function",
"relevance",
"executable_simple",
"executable_multiple_function",
"executable_parallel_function",
"executable_parallel_multiple_function",
"rest",
],
}


INPUT_PATH = "../result/"
PROMPT_PATH = "../data/"
POSSIBLE_ANSWER_PATH = "../data/possible_answer/"
Expand Down Expand Up @@ -518,8 +469,8 @@ def runner(model_names, test_categories, api_sanity_check):
if args.test_category is not None:
test_categories = []
for test_category in args.test_category:
if test_category in ARG_PARSE_MAPPING:
test_categories.extend(ARG_PARSE_MAPPING[test_category])
if test_category in TEST_COLLECTION_MAPPING:
test_categories.extend(TEST_COLLECTION_MAPPING[test_category])
else:
test_categories.append(test_category)

Expand Down
27 changes: 16 additions & 11 deletions berkeley-function-call-leaderboard/openfunctions_evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from model_handler.handler_map import handler_map
from model_handler.model_style import ModelStyle
from model_handler.constant import USE_COHERE_OPTIMIZATION

from eval_checker.eval_checker_constant import TEST_COLLECTION_MAPPING

def get_args():
parser = argparse.ArgumentParser()
Expand All @@ -23,7 +23,7 @@ def get_args():
return args


test_categories = {
TEST_FILE_MAPPING = {
"executable_simple": "gorilla_openfunctions_v1_test_executable_simple.json",
"executable_parallel_function": "gorilla_openfunctions_v1_test_executable_parallel_function.json",
"executable_multiple_function": "gorilla_openfunctions_v1_test_executable_multiple_function.json",
Expand All @@ -45,14 +45,19 @@ def build_handler(model_name, temperature, top_p, max_tokens):
return handler


def load_file(test_category):
if test_category == "all":
test_cate, files_to_open = list(test_categories.keys()), list(
test_categories.values()
)
def load_file(test_categories):
test_to_run = []
files_to_open = []

if test_categories in TEST_COLLECTION_MAPPING:
test_to_run = TEST_COLLECTION_MAPPING[test_categories]
for test_name in test_to_run:
files_to_open.append(TEST_FILE_MAPPING[test_name])
else:
test_cate, files_to_open = [test_category], [test_categories[test_category]]
return test_cate, files_to_open
test_to_run.append(test_categories)
files_to_open.append(TEST_FILE_MAPPING[test_categories])

return test_to_run, files_to_open


if __name__ == "__main__":
Expand All @@ -69,8 +74,8 @@ def load_file(test_category):
for res in result[0]:
handler.write(res, "result.json")
else:
test_cate, files_to_open = load_file(args.test_category)
for test_category, file_to_open in zip(test_cate, files_to_open):
test_to_run, files_to_open = load_file(args.test_category)
for test_category, file_to_open in zip(test_to_run, files_to_open):
print("Generating: " + file_to_open)
test_cases = []
with open("./data/" + file_to_open) as f:
Expand Down