Skip to content

Commit 1fab38f

Browse files
authored
Merge pull request #2 from ShishirPatil/main
2 parents 2f57b41 + 00f2c67 commit 1fab38f

File tree

10 files changed

+156
-33
lines changed

10 files changed

+156
-33
lines changed

berkeley-function-call-leaderboard/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ To run the executable test categories, there are 4 API keys to fill out:
8080
The `apply_function_credential_config.py` inputs an input file, optionally an outputs file. If the output file is not given as an argument, it will overwrites your original file with the cleaned data.
8181

8282
```bash
83-
python apply_function_credential_config.py --input_file ./data/gorilla_openfunctions_v1_test_rest.json
83+
python apply_function_credential_config.py --input-file ./data/gorilla_openfunctions_v1_test_rest.json
8484
```
8585

8686
Then, use `eval_data_compilation.py` to compile all files by using
@@ -106,7 +106,7 @@ To generate leaderboard statistics, there are two steps:
106106
1. Inference the evaluation data and obtain the results from specific models
107107

108108
```bash
109-
python openfunctions_evaluation.py --model MODEL_NAME --test_category TEST_CATEGORY
109+
python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
110110
```
111111
For TEST_CATEGORY, we have `executable_simple`, `executable_parallel_function`, `executable_multiple_function`, `executable_parallel_multiple_function`, `simple`, `relevance`, `parallel_function`, `multiple_function`, `parallel_multiple_function`, `java`, `javascript`, `rest`, `sql`, `chatable`.
112112

@@ -185,9 +185,9 @@ Below is *a table of models we support* to run our leaderboard evaluation agains
185185
|gorilla-openfunctions-v2 | Function Calling|
186186
|gpt-3.5-turbo-0125-FC| Function Calling|
187187
|gpt-3.5-turbo-0125| Prompt|
188-
|gpt-4-{0613,1106-preview,0125-preview}-FC| Function Calling|
189-
|gpt-4-{0613,1106-preview,0125-preview}|Prompt|
190-
|glaiveai/glaive-function-calling-v1 💻| Function Calling|
188+
|gpt-4-{0613,1106-preview,0125-preview,turbo-2024-04-09}-FC| Function Calling|
189+
|gpt-4-{0613,1106-preview,0125-preview,turbo-2024-04-09}| Prompt|
190+
|glaiveai/glaive-function-calling-v1 💻| Function Calling|
191191
|Nexusflow-Raven-v2 | Function Calling|
192192
|fire-function-v1-FC | Function Calling|
193193
|mistral-large-2402-FC-{Any,Auto} | Function Calling|
@@ -196,9 +196,8 @@ Below is *a table of models we support* to run our leaderboard evaluation agains
196196
|mistral-small-2402-FC-{Any,Auto} | Function Calling|
197197
|mistral-small-2402 | Prompt|
198198
|mistral-tiny-2312 | Prompt|
199-
|claude-3-{opus,sonnet}-20240229-FC | Function Calling |
200-
|claude-3-haiku-20240307-FC | Function Calling |
201-
|claude-3-{opus,sonnet}-20240229 | Prompt |
199+
|claude-3-{opus-20240229,sonnet-20240229,haiku-20240307}-FC | Function Calling |
200+
|claude-3-{opus-20240229,sonnet-20240229,haiku-20240307} | Prompt |
202201
|claude-{2.1,instant-1.2}| Prompt|
203202
|gemini-1.0-pro | Function Calling|
204203
|databrick-dbrx-instruct | Prompt|
@@ -222,6 +221,7 @@ For inferencing `Databrick-DBRX-instruct`, you need to create a Databrick Azure
222221

223222
## Changelog
224223

224+
* [April 16, 2024] [#366](https://github.com/ShishirPatil/gorilla/pull/366): Switch to use Anthropic's new Tool Use Beta `tools-2024-04-04` when generating Claude 3 FC series data. `gpt-4-turbo-2024-04-09` and `gpt-4-turbo-2024-04-09-FC` are also added to the leaderboard.
225225
* [April 11, 2024] [#347](https://github.com/ShishirPatil/gorilla/pull/347): Add the 95th percentile latency to the leaderboard statistics. This metric is useful for understanding the latency distribution of the models, especially the worst-case scenario.
226226
* [April 10, 2024] [#339](https://github.com/ShishirPatil/gorilla/pull/339): Introduce REST API sanity check for the executable test category. It ensures that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, the evaluation process will be stopped by default as the result will be inaccurate. Users can choose to bypass this check by setting the `--skip-api-sanity-check` flag.
227227
* [April 9, 2024] [#338](https://github.com/ShishirPatil/gorilla/pull/338): Bug fix in the evaluation datasets (including both prompts and function docs). Bug fix for possible answers as well.

berkeley-function-call-leaderboard/apply_function_credential_config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33

44

55
parser = argparse.ArgumentParser(description="Replace placeholders in the function credential config file.")
6-
parser.add_argument("--input_file", help="Path to the function credential config file.", required=True)
7-
parser.add_argument("--output_file", help="Path to the output file.", default="")
6+
parser.add_argument("--input-file", help="Path to the function credential config file.", required=True)
7+
parser.add_argument("--output-file", help="Path to the output file.", default="")
88
args = parser.parse_args()
99

1010
# Load the configuration with actual API keys

berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,14 +66,26 @@
6666
"OpenAI",
6767
"Proprietary",
6868
],
69+
"gpt-4-turbo-2024-04-09-FC": [
70+
"GPT-4-turbo-2024-04-09 (FC)",
71+
"https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo",
72+
"OpenAI",
73+
"Proprietary",
74+
],
75+
"gpt-4-turbo-2024-04-09": [
76+
"GPT-4-turbo-2024-04-09 (Prompt)",
77+
"https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo",
78+
"OpenAI",
79+
"Proprietary",
80+
],
6981
"gorilla-openfunctions-v2": [
7082
"Gorilla-OpenFunctions-v2 (FC)",
7183
"https://gorilla.cs.berkeley.edu/blogs/7_open_functions_v2.html",
7284
"Gorilla LLM",
7385
"Apache 2.0",
7486
],
7587
"claude-3-opus-20240229-FC": [
76-
"Claude-3-Opus-20240229 (FC)",
88+
"Claude-3-Opus-20240229 (FC tools-2024-04-04)",
7789
"https://www.anthropic.com/news/claude-3-family",
7890
"Anthropic",
7991
"Proprietary",
@@ -103,7 +115,7 @@
103115
"Proprietary",
104116
],
105117
"claude-3-sonnet-20240229-FC": [
106-
"Claude-3-Sonnet-20240229 (FC)",
118+
"Claude-3-Sonnet-20240229 (FC tools-2024-04-04)",
107119
"https://www.anthropic.com/news/claude-3-family",
108120
"Anthropic",
109121
"Proprietary",
@@ -115,7 +127,13 @@
115127
"Proprietary",
116128
],
117129
"claude-3-haiku-20240307-FC": [
118-
"Claude-3-Haiku-20240307 (FC)",
130+
"Claude-3-Haiku-20240307 (FC tools-2024-04-04)",
131+
"https://www.anthropic.com/news/claude-3-family",
132+
"Anthropic",
133+
"Proprietary",
134+
],
135+
"claude-3-haiku-20240307": [
136+
"Claude-3-Haiku-20240307 (Prompt)",
119137
"https://www.anthropic.com/news/claude-3-family",
120138
"Anthropic",
121139
"Proprietary",
@@ -279,6 +297,8 @@
279297
"gpt-4-1106-preview": 10,
280298
"gpt-4-0125-preview": 10,
281299
"gpt-4-0125-preview-FC": 10,
300+
"gpt-4-turbo-2024-04-09-FC": 10,
301+
"gpt-4-turbo-2024-04-09": 10,
282302
"gpt-4-0613": 30,
283303
"gpt-4-0613-FC": 30,
284304
"gpt-3.5-turbo-0125": 1.5,
@@ -302,6 +322,8 @@
302322
"mistral-small-2402-FC-Any": 6,
303323
"mistral-small-2402-FC-Auto": 6,
304324
"mistral-tiny-2312": 0.25,
325+
"gpt-4-turbo-2024-04-09-FC": 30,
326+
"gpt-4-turbo-2024-04-09": 30,
305327
"gpt-4-1106-preview": 30,
306328
"gpt-4-1106-preview-FC": 30,
307329
"gpt-4-0125-preview-FC": 30,
@@ -314,7 +336,6 @@
314336
"databricks-dbrx-instruct": 6.75,
315337
}
316338

317-
318339
# The latency of the open-source models are hardcoded here.
319340
# Because we do batching when generating the data, so the latency is not accurate from the result data.
320341
# This is the latency for the whole batch of data.
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
from model_handler.handler import BaseHandler
2+
from anthropic import Anthropic
3+
from anthropic.types import TextBlock
4+
from anthropic.types.beta.tools import ToolUseBlock
5+
from model_handler.model_style import ModelStyle
6+
from model_handler.claude_prompt_handler import ClaudePromptingHandler
7+
from model_handler.utils import (
8+
convert_to_tool,
9+
augment_prompt_by_languge,
10+
language_specific_pre_processing,
11+
ast_parse,
12+
convert_to_function_call
13+
)
14+
from model_handler.constant import GORILLA_TO_OPENAPI
15+
import os, time, json
16+
17+
18+
class ClaudeFCHandler(BaseHandler):
19+
def __init__(self, model_name, temperature=0.7, top_p=1, max_tokens=1000) -> None:
20+
super().__init__(model_name, temperature, top_p, max_tokens)
21+
self.model_style = ModelStyle.Anthropic_Prompt
22+
23+
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
24+
25+
def inference(self, prompt, functions, test_category):
26+
if "FC" not in self.model_name:
27+
handler = ClaudePromptingHandler(self.model_name, self.temperature, self.top_p, self.max_tokens)
28+
return handler.inference(prompt, functions, test_category)
29+
else:
30+
prompt = augment_prompt_by_languge(prompt, test_category)
31+
functions = language_specific_pre_processing(functions, test_category, True)
32+
if type(functions) is not list:
33+
functions = [functions]
34+
claude_tool = convert_to_tool(
35+
functions, GORILLA_TO_OPENAPI, self.model_style, test_category, True
36+
)
37+
message = [{"role": "user", "content": prompt}]
38+
start_time = time.time()
39+
response = self.client.beta.tools.messages.create(
40+
model=self.model_name.strip("-FC"),
41+
max_tokens=self.max_tokens,
42+
tools=claude_tool,
43+
messages=message,
44+
)
45+
latency = time.time() - start_time
46+
text_outputs = []
47+
tool_call_outputs = []
48+
for content in response.content:
49+
if isinstance(content, TextBlock):
50+
text_outputs.append(content.text)
51+
elif isinstance(content, ToolUseBlock):
52+
tool_call_outputs.append({content.name: json.dumps(content.input)})
53+
result = tool_call_outputs if tool_call_outputs else text_outputs[0]
54+
return result, {"input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "latency": latency}
55+
56+
def decode_ast(self,result,language="Python"):
57+
if "FC" not in self.model_name:
58+
decoded_output = ast_parse(result,language)
59+
else:
60+
decoded_output = []
61+
for invoked_function in result:
62+
name = list(invoked_function.keys())[0]
63+
params = json.loads(invoked_function[name])
64+
if language == "Python":
65+
pass
66+
else:
67+
# all values of the json are casted to string for java and javascript
68+
for key in params:
69+
params[key] = str(params[key])
70+
decoded_output.append({name: params})
71+
return decoded_output
72+
73+
def decode_execute(self,result):
74+
if "FC" not in self.model_name:
75+
decoded_output = ast_parse(result)
76+
execution_list = []
77+
for function_call in decoded_output:
78+
for key, value in function_call.items():
79+
execution_list.append(
80+
f"{key}({','.join([f'{k}={repr(v)}' for k, v in value.items()])})"
81+
)
82+
return execution_list
83+
else:
84+
function_call = convert_to_function_call(result)
85+
return function_call

berkeley-function-call-leaderboard/model_handler/claude_handler.py renamed to berkeley-function-call-leaderboard/model_handler/claude_prompt_handler.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@
1818
from anthropic import Anthropic
1919

2020

21-
class ClaudeHandler(BaseHandler):
21+
class ClaudePromptingHandler(BaseHandler):
2222
def __init__(self, model_name, temperature=0.7, top_p=1, max_tokens=1000) -> None:
2323
super().__init__(model_name, temperature, top_p, max_tokens)
24-
self.model_style = ModelStyle.Anthropic
24+
self.model_style = ModelStyle.Anthropic_Prompt
2525

2626
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
2727

berkeley-function-call-leaderboard/model_handler/constant.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,11 +112,16 @@
112112
"any": str,
113113
}
114114

115+
# If there is any underscore in folder name, you should change it to `/` in the following strings
115116
UNDERSCORE_TO_DOT = [
117+
"gpt-4-turbo-2024-04-09-FC",
116118
"gpt-4-1106-preview-FC",
117119
"gpt-4-0125-preview-FC",
118120
"gpt-4-0613-FC",
119121
"gpt-3.5-turbo-0125-FC",
122+
"claude-3-opus-20240229-FC",
123+
"claude-3-sonnet-20240229-FC",
124+
"claude-3-haiku-20240307-FC",
120125
"mistral-large-2402-FC",
121126
"mistral-large-2402-FC-Any",
122127
"mistral-large-2402-FC-Auto",
@@ -128,7 +133,7 @@
128133
"meetkai/functionary-medium-v2.2-FC",
129134
"meetkai/functionary-small-v2.4-FC",
130135
"meetkai/functionary-medium-v2.4-FC",
131-
"NousResearch/Hermes-2-Pro-Mistral-7B"
136+
"NousResearch/Hermes-2-Pro-Mistral-7B",
132137
]
133138

134139
TEST_CATEGORIES = {

berkeley-function-call-leaderboard/model_handler/handler_map.py

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from model_handler.gorilla_handler import GorillaHandler
22
from model_handler.gpt_handler import OpenAIHandler
3-
from model_handler.claude_handler import ClaudeHandler
3+
from model_handler.claude_fc_handler import ClaudeFCHandler
4+
from model_handler.claude_prompt_handler import ClaudePromptingHandler
45
from model_handler.mistral_handler import MistralHandler
56
from model_handler.firework_ai_handler import FireworkAIHandler
67
from model_handler.nexus_handler import NexusHandler
@@ -16,6 +17,8 @@
1617
handler_map = {
1718
"gorilla-openfunctions-v0": GorillaHandler,
1819
"gorilla-openfunctions-v2": GorillaHandler,
20+
"gpt-4-turbo-2024-04-09-FC": OpenAIHandler,
21+
"gpt-4-turbo-2024-04-09": OpenAIHandler,
1922
"gpt-4-1106-preview-FC": OpenAIHandler,
2023
"gpt-4-1106-preview": OpenAIHandler,
2124
"gpt-4-0125-preview-FC": OpenAIHandler,
@@ -24,13 +27,14 @@
2427
"gpt-4-0613": OpenAIHandler,
2528
"gpt-3.5-turbo-0125-FC": OpenAIHandler,
2629
"gpt-3.5-turbo-0125": OpenAIHandler,
27-
"claude-2.1": ClaudeHandler,
28-
"claude-instant-1.2": ClaudeHandler,
29-
"claude-3-opus-20240229": ClaudeHandler,
30-
"claude-3-opus-20240229-FC": ClaudeHandler,
31-
"claude-3-sonnet-20240229": ClaudeHandler,
32-
"claude-3-sonnet-20240229-FC": ClaudeHandler,
33-
"claude-3-haiku-20240307-FC": ClaudeHandler,
30+
"claude-2.1": ClaudePromptingHandler,
31+
"claude-instant-1.2": ClaudePromptingHandler,
32+
"claude-3-opus-20240229": ClaudePromptingHandler,
33+
"claude-3-opus-20240229-FC": ClaudeFCHandler,
34+
"claude-3-sonnet-20240229": ClaudePromptingHandler,
35+
"claude-3-sonnet-20240229-FC": ClaudeFCHandler,
36+
"claude-3-haiku-20240307": ClaudePromptingHandler,
37+
"claude-3-haiku-20240307-FC": ClaudeFCHandler,
3438
"mistral-large-2402": MistralHandler,
3539
"mistral-large-2402-FC-Any": MistralHandler,
3640
"mistral-large-2402-FC-Auto": MistralHandler,

berkeley-function-call-leaderboard/model_handler/model_style.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
class ModelStyle(Enum):
55
Gorilla = "gorilla"
66
OpenAI = "gpt"
7-
Anthropic = "claude"
7+
Anthropic_FC = "claude"
8+
Anthropic_Prompt = "claude"
89
Mistral = "mistral"
910
Google = "google"
1011
FIREWORK_AI = "firework_ai"

berkeley-function-call-leaderboard/model_handler/utils.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ def convert_to_tool(
6363
or model_style == ModelStyle.Mistral
6464
or model_style == ModelStyle.Google
6565
or model_style == ModelStyle.OSSMODEL
66+
or model_style == ModelStyle.Anthropic_FC
6667
):
6768
# OAI does not support "." in the function name so we replace it with "_". ^[a-zA-Z0-9_-]{1,64}$ is the regex for the name.
6869
item["name"] = re.sub(r"\.", "_", item["name"])
@@ -77,7 +78,8 @@ def convert_to_tool(
7778
ModelStyle.OpenAI,
7879
ModelStyle.Mistral,
7980
ModelStyle.Google,
80-
ModelStyle.Anthropic,
81+
ModelStyle.Anthropic_Prompt,
82+
ModelStyle.Anthropic_FC,
8183
ModelStyle.FIREWORK_AI,
8284
ModelStyle.OSSMODEL,
8385
]
@@ -92,21 +94,24 @@ def convert_to_tool(
9294
for key, value in properties.items():
9395
if value["type"] in JS_TYPE_CONVERSION:
9496
properties[key]["type"] = "string"
97+
if model_style == ModelStyle.Anthropic_FC:
98+
item["input_schema"] = item["parameters"]
99+
del item["parameters"]
95100
if model_style == ModelStyle.Google:
96101
# Remove fields that are not supported by Gemini today.
97102
for params in item["parameters"]["properties"].values():
98103
if "default" in params:
99-
params["description"] += str(params["default"])
104+
params["description"] += "The Default is:" + str(params["default"])
100105
del params["default"]
101106
if "optional" in params:
102107
del params["optional"]
103108
if "maximum" in params:
104109
del params["maximum"]
105110
if "additionalProperties" in params:
106-
params["description"] += str(params["additionalProperties"])
111+
params["description"] += "The additional properties:" +str(params["additionalProperties"])
107112
del params["additionalProperties"]
108113
if model_style in [
109-
ModelStyle.Anthropic,
114+
ModelStyle.Anthropic_Prompt,
110115
ModelStyle.Google,
111116
ModelStyle.OSSMODEL,
112117
]:
@@ -261,6 +266,8 @@ def augment_prompt_by_languge(prompt, test_category):
261266
def language_specific_pre_processing(function, test_category, string_param):
262267
if type(function) is dict:
263268
function = [function]
269+
if len(function) == 0:
270+
return function
264271
for item in function:
265272
properties = item["parameters"]["properties"]
266273
if test_category == "java":

berkeley-function-call-leaderboard/openfunctions_evaluation.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@ def get_args():
88
# Refer to model_choice for supported models.
99
parser.add_argument("--model", type=str, default="gorilla-openfunctions-v2")
1010
# Refer to test_categories for supported categories.
11-
parser.add_argument("--test_category", type=str, default="all")
11+
parser.add_argument("--test-category", type=str, default="all")
1212

1313
# Parameters for the model that you want to test.
1414
parser.add_argument("--temperature", type=float, default=0.7)
15-
parser.add_argument("--top_p", type=float, default=1)
16-
parser.add_argument("--max_tokens", type=int, default=1200)
15+
parser.add_argument("--top-p", type=float, default=1)
16+
parser.add_argument("--max-tokens", type=int, default=1200)
1717
parser.add_argument("--num-gpus", default=1, type=int)
1818
parser.add_argument("--timeout", default=60, type=int)
1919

0 commit comments

Comments
 (0)