ShishirPatil · ShishirPatil · Jul 19, 2024 · Jul 6, 2024 · Jul 6, 2024 · Jul 7, 2024
diff --git a/berkeley-function-call-leaderboard/README.md b/berkeley-function-call-leaderboard/README.md
@@ -52,10 +52,41 @@ Then, use `eval_data_compilation.py` to compile all files by
 ```bash
 python eval_data_compilation.py
 ```
-## Berkeley Function-Calling Leaderboard Statistics
 
+## Available Test Category
+In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:
+
+- `all`: Run all test categories.
+- `ast`: Abstract Syntax Tree tests.
+- `executable`: Executable code evaluation tests.
+- `python`: Tests specific to Python code.
+- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
+- `python-ast`: Python Abstract Syntax Tree tests.
+- Individual test categories:
+    - `simple`: Simple function calls.
+    - `parallel_function`: Multiple function calls in parallel.
+    - `multiple_function`: Multiple function calls in sequence.
+    - `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
+    - `executable_simple`: Executable function calls.
+    - `executable_parallel_function`: Executable multiple function calls in parallel.
+    - `executable_multiple_function`: Executable multiple function calls in sequence.
+    - `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
+    - `java`: Java function calls.
+    - `javascript`: JavaScript function calls.
+    - `rest`: REST API function calls.
+    - `relevance`: Function calls with irrelevant function documentation.
+- If no test category is provided, the script will run all available test categories. (same as `all`)
+
+> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
+
+> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
+
+> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
 
-Make sure models API keys are included in your environment variables.
+
+## Model Result Generation
+
+Make sure the model API keys are included in your environment variables.
 
 ```bash
 export OPENAI_API_KEY=sk-XXXXXX
@@ -73,17 +104,13 @@ To generate leaderboard statistics, there are two steps:
 ```bash
 python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
 ```
-For TEST_CATEGORY, we have `executable_simple`, `executable_parallel_function`, `executable_multiple_function`, `executable_parallel_multiple_function`, `simple`, `relevance`, `parallel_function`, `multiple_function`, `parallel_multiple_function`, `java`, `javascript`, `rest`, `sql`, `chatable`.
 
-If you want to run all evaluations at the same time, you can use `all` as the test category.
+For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.
 
 Running proprietary models like GPTs, Claude, Mistral-X will require an API-Key which can be supplied in `openfunctions_evaluation.py`.
 
 If decided to run OSS model, openfunction evaluation uses vllm and therefore requires GPU for hosting and inferencing. If you have questions or concerns about evaluating OSS models, please reach out to us in our [discord channel](https://discord.gg/grXXvj9Whz).
 
-
-
-
 ## Checking the Evaluation Results
 
 ### Running the Checker
@@ -94,41 +121,14 @@ Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` direct
 python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
 ```
 
-- `MODEL_NAME`: Optional. The name of the model you wish to evaluate. This parameter can accept multiple model names separated by spaces. Eg, `--model gorilla-openfunctions-v2 gpt-4-0125-preview`.
-    - If no model name is provided, the script will run the checker on all models exist in the `result` folder. This path can be changed by modifying the `INPUT_PATH` variable in the `eval_runner.py` script.
-- `TEST_CATEGORY`: Optional. The category of tests to run. You can specify multiple categories separated by spaces. Available options include:
-    - `all`: Run all test categories.
-    - `ast`: Abstract Syntax Tree tests.
-    - `executable`: Executable code evaluation tests.
-    - `python`: Tests specific to Python code.
-    - `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
-    - Individual test categories:
-        - `simple`: Simple function calls.
-        - `parallel_function`: Multiple function calls in parallel.
-        - `multiple_function`: Multiple function calls in sequence.
-        - `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
-        - `executable_simple`: Executable function calls.
-        - `executable_parallel_function`: Executable multiple function calls in parallel.
-        - `executable_multiple_function`: Executable multiple function calls in sequence.
-        - `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
-        - `java`: Java function calls.
-        - `javascript`: JavaScript function calls.
-        - `rest`: REST API function calls.
-        - `relevance`: Function calls with irrelevant function documentation.
-    - If no test category is provided, the script will run all available test categories.
-> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API! 
-
-> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
-
-> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
+For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.
 
 ### Example Usage
 
 If you want to run all tests for the `gorilla-openfunctions-v2` model, you can use the following command:
 
 ```bash
 python eval_runner.py --model gorilla-openfunctions-v2
-
 ```
 
 If you want to evaluate all offline tests (do not require RapidAPI keys) for OpenAI GPT-3.5, you can use the following command:
@@ -137,16 +137,16 @@ If you want to evaluate all offline tests (do not require RapidAPI keys) for Ope
 python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
 ```
 
-If you want to run `rest` tests for all GPT models, you can use the following command:
+If you want to run `rest` tests for a few Claude models, you can use the following command:
 
 ```bash
-python eval_runner.py --model gpt-3.5-turbo-0125 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview --test-category rest
+python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
 ```
 
-If you want to run `rest` and `javascript` tests for all GPT models and `gorilla-openfunctions-v2`, you can use the following command:
+If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
 
 ```bash
-python eval_runner.py --model gorilla-openfunctions-v2 gpt-3.5-turbo-0125 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview --test-category rest javascript
+python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
 ```
 
 ### Model-Specific Optimization

diff --git a/berkeley-function-call-leaderboard/eval_checker/eval_checker_constant.py b/berkeley-function-call-leaderboard/eval_checker/eval_checker_constant.py
@@ -16,3 +16,59 @@
     "executable_parallel_multiple_function": (1760, 1799),
     "multiple_function": (1800, 1999),
 }
+
+TEST_COLLECTION_MAPPING = {
+    "ast": [
+        "simple",
+        "multiple_function",
+        "parallel_function",
+        "parallel_multiple_function",
+        "java",
+        "javascript",
+        "relevance",
+    ],
+    "executable": [
+        "executable_simple",
+        "executable_multiple_function",
+        "executable_parallel_function",
+        "executable_parallel_multiple_function",
+        "rest",
+    ],
+    "all": [
+        "simple",
+        "multiple_function",
+        "parallel_function",
+        "parallel_multiple_function",
+        "java",
+        "javascript",
+        "relevance",
+        "executable_simple",
+        "executable_multiple_function",
+        "executable_parallel_function",
+        "executable_parallel_multiple_function",
+        "rest",
+    ],
+    "non-python": [
+        "java",
+        "javascript",
+    ],
+    "python": [
+        "simple",
+        "multiple_function",
+        "parallel_function",
+        "parallel_multiple_function",
+        "relevance",
+        "executable_simple",
+        "executable_multiple_function",
+        "executable_parallel_function",
+        "executable_parallel_multiple_function",
+        "rest",
+    ],
+    "python-ast": [
+        "simple",
+        "multiple_function",
+        "parallel_function",
+        "parallel_multiple_function",
+        "relevance",
+    ],
+}
diff --git a/berkeley-function-call-leaderboard/eval_checker/eval_runner.py b/berkeley-function-call-leaderboard/eval_checker/eval_runner.py
@@ -5,6 +5,7 @@
 from checker import ast_checker, exec_checker, executable_checker_rest
 from custom_exception import BadAPIStatusError
 from eval_runner_helper import *
+from eval_checker_constant import TEST_COLLECTION_MAPPING
 from tqdm import tqdm
 import argparse
 
@@ -430,56 +431,6 @@ def runner(model_names, test_categories, api_sanity_check):
     print(f"🏁 Evaluation completed. See {os.path.abspath(OUTPUT_PATH + 'data.csv')} for evaluation results.")
 
 
-ARG_PARSE_MAPPING = {
-    "ast": [
-        "simple",
-        "multiple_function",
-        "parallel_function",
-        "parallel_multiple_function",
-        "java",
-        "javascript",
-        "relevance",
-    ],
-    "executable": [
-        "executable_simple",
-        "executable_multiple_function",
-        "executable_parallel_function",
-        "executable_parallel_multiple_function",
-        "rest",
-    ],
-    "all": [
-        "simple",
-        "multiple_function",
-        "parallel_function",
-        "parallel_multiple_function",
-        "java",
-        "javascript",
-        "relevance",
-        "executable_simple",
-        "executable_multiple_function",
-        "executable_parallel_function",
-        "executable_parallel_multiple_function",
-        "rest",
-    ],
-    "non-python": [
-        "java",
-        "javascript",
-    ],
-    "python": [
-        "simple",
-        "multiple_function",
-        "parallel_function",
-        "parallel_multiple_function",
-        "relevance",
-        "executable_simple",
-        "executable_multiple_function",
-        "executable_parallel_function",
-        "executable_parallel_multiple_function",
-        "rest",
-    ],
-}
-
-
 INPUT_PATH = "../result/"
 PROMPT_PATH = "../data/"
 POSSIBLE_ANSWER_PATH = "../data/possible_answer/"
@@ -518,8 +469,8 @@ def runner(model_names, test_categories, api_sanity_check):
     if args.test_category is not None:
         test_categories = []
         for test_category in args.test_category:
-            if test_category in ARG_PARSE_MAPPING:
-                test_categories.extend(ARG_PARSE_MAPPING[test_category])
+            if test_category in TEST_COLLECTION_MAPPING:
+                test_categories.extend(TEST_COLLECTION_MAPPING[test_category])
             else:
                 test_categories.append(test_category)
 

diff --git a/berkeley-function-call-leaderboard/openfunctions_evaluation.py b/berkeley-function-call-leaderboard/openfunctions_evaluation.py
@@ -3,7 +3,7 @@
 from model_handler.handler_map import handler_map
 from model_handler.model_style import ModelStyle
 from model_handler.constant import USE_COHERE_OPTIMIZATION
-
+from eval_checker.eval_checker_constant import TEST_COLLECTION_MAPPING
 
 def get_args():
     parser = argparse.ArgumentParser()
@@ -23,7 +23,7 @@ def get_args():
     return args
 
 
-test_categories = {
+TEST_FILE_MAPPING = {
     "executable_simple": "gorilla_openfunctions_v1_test_executable_simple.json",
     "executable_parallel_function": "gorilla_openfunctions_v1_test_executable_parallel_function.json",
     "executable_multiple_function": "gorilla_openfunctions_v1_test_executable_multiple_function.json",
@@ -45,14 +45,19 @@ def build_handler(model_name, temperature, top_p, max_tokens):
     return handler
 
 
-def load_file(test_category):
-    if test_category == "all":
-        test_cate, files_to_open = list(test_categories.keys()), list(
-            test_categories.values()
-        )
+def load_file(test_categories):   
+    test_to_run = []
+    files_to_open = []
+
+    if test_categories in TEST_COLLECTION_MAPPING:
+        test_to_run = TEST_COLLECTION_MAPPING[test_categories]
+        for test_name in test_to_run:
+            files_to_open.append(TEST_FILE_MAPPING[test_name])
     else:
-        test_cate, files_to_open = [test_category], [test_categories[test_category]]
-    return test_cate, files_to_open
+        test_to_run.append(test_categories)
+        files_to_open.append(TEST_FILE_MAPPING[test_categories])
+
+    return test_to_run, files_to_open
 
 
 if __name__ == "__main__":
@@ -69,8 +74,8 @@ def load_file(test_category):
         for res in result[0]:
             handler.write(res, "result.json")
     else:
-        test_cate, files_to_open = load_file(args.test_category)
-        for test_category, file_to_open in zip(test_cate, files_to_open):
+        test_to_run, files_to_open = load_file(args.test_category)
+        for test_category, file_to_open in zip(test_to_run, files_to_open):
             print("Generating: " + file_to_open)
             test_cases = []
             with open("./data/" + file_to_open) as f: