[BFCL] Standardize TEST_CATEGORY Among eval_runner.py and openfunctions_evaluation.py (#506)

HuanzhiMao · ShishirPatil · web-flow · commit a9dd43559fdc · 2024-07-18T21:17:32.000-07:00
There are inconsistencies between the `test_category` argument that's used by `eval_checker/eval_runner.py` and `openfunctions_evaluation.py`. This PR partially addresses #501 and #502. --------- Co-authored-by: Shishir Patil <30296397+ShishirPatil@users.noreply.github.com>
diff --git a/berkeley-function-call-leaderboard/README.md b/berkeley-function-call-leaderboard/README.md
@@ -9,7 +9,7 @@
 ## Introduction
 We introduce the Berkeley Function Leaderboard (BFCL), the **first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions**. Unlike previous function call evaluations, BFCL accounts for various forms of function calls, diverse function calling scenarios, and their executability. Additionally, we release Gorilla-Openfunctions-v2, the most advanced open-source model to date capable of handling multiple languages, parallel function calls, and multiple function calls simultaneously. A unique debugging feature of this model is its ability to output an "Error Message" when the provided function does not suit your task.
 
-Read more about the technical details and interesting insights in our blog post!
+Read more about the technical details and interesting insights in our [blog post](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)!
 
 ![image](./architecture_diagram.png)
 ### Install Dependencies
@@ -48,10 +48,9 @@ python apply_function_credential_config.py
 ```
 
 
-## Berkeley Function-Calling Leaderboard Statistics
+## Evaluating different models on the BFCL
 
-
-Make sure models API keys are included in your environment variables.
+Make sure the model API keys are included in your environment variables. Running proprietary models like GPTs, Claude, Mistral-X will require them.
 
 ```bash
 export OPENAI_API_KEY=sk-XXXXXX
@@ -62,96 +61,22 @@ export COHERE_API_KEY=XXXXXX
 export NVIDIA_API_KEY=nvapi-XXXXXX
 ```
 
-To generate leaderboard statistics, there are two steps:
-
-1. LLM Inference of the evaluation data from specific models
-
-```bash
-python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
-```
-For TEST_CATEGORY, we have `executable_simple`, `executable_parallel_function`, `executable_multiple_function`, `executable_parallel_multiple_function`, `simple`, `relevance`, `parallel_function`, `multiple_function`, `parallel_multiple_function`, `java`, `javascript`, `rest`, `sql`, `chatable`.
-
-If you want to run all evaluations at the same time, you can use `all` as the test category.
-
-Running proprietary models like GPTs, Claude, Mistral-X will require an API-Key which can be supplied in `openfunctions_evaluation.py`.
-
-If decided to run OSS model, openfunction evaluation uses vllm and therefore requires GPU for hosting and inferencing. If you have questions or concerns about evaluating OSS models, please reach out to us in our [discord channel](https://discord.gg/grXXvj9Whz).
+If decided to run OSS model, the generation script uses vllm and therefore requires GPU for hosting and inferencing. If you have questions or concerns about evaluating OSS models, please reach out to us in our [discord channel](https://discord.gg/grXXvj9Whz).
 
+### Genrating LLM Responses
 
-
-
-## Checking the Evaluation Results
-
-### Running the Checker
-
-Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:
+Use the following command for LLM inference of the evaluation dataset with specific models
 
 ```bash
-python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
-```
-
-- `MODEL_NAME`: Optional. The name of the model you wish to evaluate. This parameter can accept multiple model names separated by spaces. Eg, `--model gorilla-openfunctions-v2 gpt-4-0125-preview`.
-    - If no model name is provided, the script will run the checker on all models exist in the `result` folder. This path can be changed by modifying the `INPUT_PATH` variable in the `eval_runner.py` script.
-- `TEST_CATEGORY`: Optional. The category of tests to run. You can specify multiple categories separated by spaces. Available options include:
-    - `all`: Run all test categories.
-    - `ast`: Abstract Syntax Tree tests.
-    - `executable`: Executable code evaluation tests.
-    - `python`: Tests specific to Python code.
-    - `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
-    - Individual test categories:
-        - `simple`: Simple function calls.
-        - `parallel_function`: Multiple function calls in parallel.
-        - `multiple_function`: Multiple function calls in sequence.
-        - `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
-        - `executable_simple`: Executable function calls.
-        - `executable_parallel_function`: Executable multiple function calls in parallel.
-        - `executable_multiple_function`: Executable multiple function calls in sequence.
-        - `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
-        - `java`: Java function calls.
-        - `javascript`: JavaScript function calls.
-        - `rest`: REST API function calls.
-        - `relevance`: Function calls with irrelevant function documentation.
-    - If no test category is provided, the script will run all available test categories.
-> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API! 
-
-> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
-
-> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
-
-### Example Usage
-
-If you want to run all tests for the `gorilla-openfunctions-v2` model, you can use the following command:
-
-```bash
-python eval_runner.py --model gorilla-openfunctions-v2
-
-```
-
-If you want to evaluate all offline tests (do not require RapidAPI keys) for OpenAI GPT-3.5, you can use the following command:
-
-```bash
-python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
-```
-
-If you want to run `rest` tests for all GPT models, you can use the following command:
-
-```bash
-python eval_runner.py --model gpt-3.5-turbo-0125 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview --test-category rest
-```
-
-If you want to run `rest` and `javascript` tests for all GPT models and `gorilla-openfunctions-v2`, you can use the following command:
-
-```bash
-python eval_runner.py --model gorilla-openfunctions-v2 gpt-3.5-turbo-0125 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview --test-category rest javascript
+python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
 ```
 
-### Model-Specific Optimization
-
-Some companies have proposed some optimization strategies in their models' handler, which we (BFCL) think is unfair to other models, as those optimizations are not generalizable to all models. Therefore, we have disabled those optimizations during the evaluation process by default. You can enable those optimizations by setting the `USE_{COMPANY}_OPTIMIZATION` flag to `True` in the `model_handler/constants.py` file.
+For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section below.
 
+If no `MODEL_NAME` is provided, the model `gorilla-openfunctions-v2` will be used by default. If no `TEST_CATEGORY` is provided, all test categories will be run by default.
 
-## Models Available
-Below is *a table of models we support* to run our leaderboard evaluation against. If supported function calling (FC), we will follow its function calling format provided by official documentation. Otherwise, we will construct a system message to prompt the model to generate function calls in the right format.
+### Models Available
+Below is *a table of models we support* to run our leaderboard evaluation against. If the models support function calling (FC), we will follow its function calling format provided by official documentation. Otherwise, we use a consistent system message to prompt the model to generate function calls in the right format.
 |Model | Type |
 |---|---|
 |gorilla-openfunctions-v2 | Function Calling|
@@ -200,8 +125,85 @@ For model names with {.}, it means that the model has multiple versions. For exa
 For Mistral large and small models, we provide evaluation on both of their `Any` and `Auto` settings. More information about this can be found [here](https://docs.mistral.ai/guides/function-calling/).
 
 
-For inferencing `Gemini-1.0-pro`, you need to fill in `model_handler/gemini_handler.py` with your GCP project ID that has access to Vertex AI endpoint.
-For inferencing `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace and setup an endpoint for inference. 
+For `Gemini-1.0-pro`, you need to fill in `model_handler/gemini_handler.py` with your GCP project ID that has access to Vertex AI endpoint.
+For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace and setup an endpoint for inference. 
+
+
+### Available Test Category
+In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:
+
+- `all`: Run all test categories.
+    - This is the default option if no test category is provided.
+- `ast`: Abstract Syntax Tree tests.
+- `executable`: Executable code evaluation tests.
+- `python`: Tests specific to Python code.
+- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
+- `python-ast`: Python Abstract Syntax Tree tests.
+- Individual test categories:
+    - `simple`: Simple function calls.
+    - `parallel_function`: Multiple function calls in parallel.
+    - `multiple_function`: Multiple function calls in sequence.
+    - `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
+    - `executable_simple`: Executable function calls.
+    - `executable_parallel_function`: Executable multiple function calls in parallel.
+    - `executable_multiple_function`: Executable multiple function calls in sequence.
+    - `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
+    - `java`: Java function calls.
+    - `javascript`: JavaScript function calls.
+    - `rest`: REST API function calls.
+    - `relevance`: Function calls with irrelevant function documentation.
+- If no test category is provided, the script will run all available test categories. (same as `all`)
+
+> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
+
+> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
+
+> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
+
+
+## Evaluating the LLM generations
+
+### Running the Checker
+
+Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:
+
+```bash
+python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
+```
+
+For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.
+
+If no `MODEL_NAME` is provided, all available model results will be evaluated by default. If no `TEST_CATEGORY` is provided, all test categories will be run by default.
+
+### Example Usage
+
+If you want to run all tests for the `gorilla-openfunctions-v2` model, you can use the following command:
+
+```bash
+python eval_runner.py --model gorilla-openfunctions-v2
+```
+
+If you want to evaluate all offline tests (do not require RapidAPI keys) for OpenAI GPT-3.5, you can use the following command:
+
+```bash
+python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
+```
+
+If you want to run `rest` tests for a few Claude models, you can use the following command:
+
+```bash
+python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
+```
+
+If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
+
+```bash
+python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
+```
+
+### Model-Specific Optimization
+
+Some companies have proposed some optimization strategies in their models' handler, which we (BFCL) think is unfair to other models, as those optimizations are not generalizable to all models. Therefore, we have disabled those optimizations during the evaluation process by default. You can enable those optimizations by setting the `USE_{COMPANY}_OPTIMIZATION` flag to `True` in the `model_handler/constants.py` file.
 
 
 ## Changelog
@@ -254,7 +256,7 @@ For inferencing `Databrick-DBRX-instruct`, you need to create a Databrick Azure
 
 ## Contributing
 
-To add a new model to the Function Calling Leaderboard, here are a few things you need to do:
+We welcome additions to the Function Calling Leaderboard! To add a new model, here are a few things you need to do:
 
 1. Take a look at the `model_handler/handler.py`. This is the base handler object which all handlers are inherited from. Also, free feel to take a look at the existing model handers; very likely you can re-use some of the existing code if the new model outputs in a similar format.
 2. Create your handler and define the following functions 
diff --git a/berkeley-function-call-leaderboard/apply_function_credential_config.py b/berkeley-function-call-leaderboard/apply_function_credential_config.py
@@ -62,7 +62,7 @@ def process_file(input_file_path, output_file_path):
     with open(output_file_path, "w") as f:
         for i, modified_line in enumerate(modified_data):
             f.write(modified_line)
-            if i < len(data) - 1:
+            if i < len(modified_data) - 1:
                 f.write("\n")        
             
     print(f"All placeholders have been replaced for {input_file_path} 🦍.")
diff --git a/berkeley-function-call-leaderboard/eval_checker/eval_checker_constant.py b/berkeley-function-call-leaderboard/eval_checker/eval_checker_constant.py
@@ -16,3 +16,59 @@
     "executable_parallel_multiple_function": (1760, 1799),
     "multiple_function": (1800, 1999),
 }
+
+TEST_COLLECTION_MAPPING = {
+    "ast": [
+        "simple",
+        "multiple_function",
+        "parallel_function",
+        "parallel_multiple_function",
+        "java",
+        "javascript",
+        "relevance",
+    ],
+    "executable": [
+        "executable_simple",
+        "executable_multiple_function",
+        "executable_parallel_function",
+        "executable_parallel_multiple_function",
+        "rest",
+    ],
+    "all": [
+        "simple",
+        "multiple_function",
+        "parallel_function",
+        "parallel_multiple_function",
+        "java",
+        "javascript",
+        "relevance",
+        "executable_simple",
+        "executable_multiple_function",
+        "executable_parallel_function",
+        "executable_parallel_multiple_function",
+        "rest",
+    ],
+    "non-python": [
+        "java",
+        "javascript",
+    ],
+    "python": [
+        "simple",
+        "multiple_function",
+        "parallel_function",
+        "parallel_multiple_function",
+        "relevance",
+        "executable_simple",
+        "executable_multiple_function",
+        "executable_parallel_function",
+        "executable_parallel_multiple_function",
+        "rest",
+    ],
+    "python-ast": [
+        "simple",
+        "multiple_function",
+        "parallel_function",
+        "parallel_multiple_function",
+        "relevance",
+    ],
+}
diff --git a/berkeley-function-call-leaderboard/eval_checker/eval_runner.py b/berkeley-function-call-leaderboard/eval_checker/eval_runner.py
@@ -5,6 +5,7 @@
 from checker import ast_checker, exec_checker, executable_checker_rest
 from custom_exception import BadAPIStatusError
 from eval_runner_helper import *
+from eval_checker_constant import TEST_COLLECTION_MAPPING
 from tqdm import tqdm
 import argparse
 
@@ -430,56 +431,6 @@ def runner(model_names, test_categories, api_sanity_check):
     print(f"🏁 Evaluation completed. See {os.path.abspath(OUTPUT_PATH + 'data.csv')} for evaluation results.")
 
 
-ARG_PARSE_MAPPING = {
-    "ast": [
-        "simple",
-        "multiple_function",
-        "parallel_function",
-        "parallel_multiple_function",
-        "java",
-        "javascript",
-        "relevance",
-    ],
-    "executable": [
-        "executable_simple",
-        "executable_multiple_function",
-        "executable_parallel_function",
-        "executable_parallel_multiple_function",
-        "rest",
-    ],
-    "all": [
-        "simple",
-        "multiple_function",
-        "parallel_function",
-        "parallel_multiple_function",
-        "java",
-        "javascript",
-        "relevance",
-        "executable_simple",
-        "executable_multiple_function",
-        "executable_parallel_function",
-        "executable_parallel_multiple_function",
-        "rest",
-    ],
-    "non-python": [
-        "java",
-        "javascript",
-    ],
-    "python": [
-        "simple",
-        "multiple_function",
-        "parallel_function",
-        "parallel_multiple_function",
-        "relevance",
-        "executable_simple",
-        "executable_multiple_function",
-        "executable_parallel_function",
-        "executable_parallel_multiple_function",
-        "rest",
-    ],
-}
-
-
 INPUT_PATH = "../result/"
 PROMPT_PATH = "../data/"
 POSSIBLE_ANSWER_PATH = "../data/possible_answer/"
@@ -518,8 +469,8 @@ def runner(model_names, test_categories, api_sanity_check):
     if args.test_category is not None:
         test_categories = []
         for test_category in args.test_category:
-            if test_category in ARG_PARSE_MAPPING:
-                test_categories.extend(ARG_PARSE_MAPPING[test_category])
+            if test_category in TEST_COLLECTION_MAPPING:
+                test_categories.extend(TEST_COLLECTION_MAPPING[test_category])
             else:
                 test_categories.append(test_category)
 
diff --git a/berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py b/berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py
@@ -621,7 +621,7 @@ def api_status_sanity_check_rest():
     ground_truth_dummy = load_file(REST_API_GROUND_TRUTH_FILE_PATH)
 
     # Use the ground truth data to make sure the API is working correctly
-    command = f"cd .. ; python apply_function_credential_config.py --input-file ./eval_checker/{REST_API_GROUND_TRUTH_FILE_PATH};"
+    command = f"cd .. ; python apply_function_credential_config.py --input-path ./eval_checker/{REST_API_GROUND_TRUTH_FILE_PATH};"
     try:
         subprocess.run(command, shell=True, capture_output=True, text=True, check=True)
     except subprocess.CalledProcessError as e:
diff --git a/berkeley-function-call-leaderboard/openfunctions_evaluation.py b/berkeley-function-call-leaderboard/openfunctions_evaluation.py