ShishirPatil
diff --git a/‎berkeley-function-call-leaderboard/CHANGELOG.md‎
Lines changed: 63 additions & 4 deletions b/‎berkeley-function-call-leaderboard/CHANGELOG.md‎
Lines changed: 63 additions & 4 deletions
diff --git a/‎berkeley-function-call-leaderboard/CONTRIBUTING.md‎
Lines changed: 1 addition & 1 deletion b/‎berkeley-function-call-leaderboard/CONTRIBUTING.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎berkeley-function-call-leaderboard/README.md‎
Lines changed: 15 additions & 10 deletions b/‎berkeley-function-call-leaderboard/README.md‎
Lines changed: 15 additions & 10 deletions
@@ -2,7 +2,66 @@
 
 All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file.
 
-- [Jul 8, 2025] [#1098](https://github.com/ShishirPatil/gorilla/pull/1098):
+- [Jul 17, 2025] [#1019](https://github.com/ShishirPatil/gorilla/pull/1019): BFCL V4 release:
+
+  1. **New agentic domain**
+     - Introduces the agentic domain with two categories: Web Search and Memory Management.
+     - For more information, please see our accompanying [blog posts](https://gorilla.cs.berkeley.edu/blog.html).
+  2. **Revised overall-accuracy formula**
+
+     - As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks.
+
+     | Segment     | Old % |  New % |
+     | ----------- | ----: | -----: |
+     | Live        |    33 | **10** |
+     | Non-Live    |    33 | **10** |
+     | Irrelevance |     0 | **10** |
+     | Multi-Turn  |    33 | **30** |
+     | Agentic     |     0 | **40** |
+
+  3. **Leaderboard / model cleanup**
+     - Retires several deprecated models from the leaderboard.
+     - Removes unused model handlers to improve maintainability.
+  4. **Address #602**
+     - `Non-Live Acc` and `Live Acc` score calculation now excludes the Irrelevance/Relevance category scores.
+  5. **Resolve #1094.**
+  6. **Codebase refactor**
+     - Reorganizes the response-generation pipeline and related modules for easier maintenance.
+     - Simplify the response-generation pipeline logic for locally-hosted models.
+     - Introduce `enums.py`
+  7. **Test category rename**
+     The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns.
+     - `simple` --> `simple_python`
+     - `java` --> `simple_java`
+     - `javascript` --> `simple_javascript`
+  8. **Directory layout overhaul**
+     Results and scores now use a _two-level_ hierarchy:
+
+     ```text
+     result/<model>/<general_category>/<category>.json
+     score/<model>/<general_category>/<category>.json
+     ```
+
+     `general_category` ∈ { **non_live**, **live**, **multi_turn**, **agentic**, **format_sensitivity** }
+
+     • For _agentic-memory_ tasks, an extra level distinguishes the memory backend:
+
+     ```text
+     result/<model>/agentic/<memory_backend>/<category>.json
+     ```
+
+     Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files.
+  9. **New model support**
+     Adds support for the following models:
+     - `claude-opus-4-1-20250805`
+     - `gpt-5-2025-08-07`
+     - `gpt-5-mini-2025-08-07`
+     - `gpt-5-nano-2025-08-07`
+     - `Qwen/Qwen3-30B-A3B-Instruct-2507`
+     - `Qwen/Qwen3-235B-A22B-Instruct-2507`
+     - `Qwen/Qwen3-4B-Instruct-2507`
+
+- [Jul 8, 2025] [#1098](https://github.com/Shishirtil/gorilla/pull/1098):
   - Re-introduce latency statistics for locally hosted models
   - Update cost calculation to cover the entire dataset batch, instead of the average cost per 1k function calls
 - [Jul 6, 2025] [#1100](https://github.com/ShishirPatil/gorilla/pull/1100): Add the following new models to the leaderboard:
@@ -225,11 +284,11 @@ All notable changes to the Berkeley Function Calling Leaderboard will be documen
 - [Nov 18, 2024] [#768](https://github.com/ShishirPatil/gorilla/pull/768), [#770](https://github.com/ShishirPatil/gorilla/pull/770): Resolve issues in Gemini models (FC mode) related to handling scenarios with no tools available and cases where the model output is empty.
 - [Nov 17, 2024] [#767](https://github.com/ShishirPatil/gorilla/pull/767): Fix price and latency calculation. A merge conflict results in a duplicate line, and counting the input and output token for each entry multiple times.
 - [Nov 15, 2024] [#762](https://github.com/ShishirPatil/gorilla/pull/762): Supply `data_multi_turn.csv` for multi-turn evaluation results
-- [Nov 14, 2024] [#760](https://github.com/ShishirPatil/gorilla/pull/760), [#761](https://github.com/ShishirPatil/gorilla/pull/761): Upstream  `google-cloud-aiplatform` library fixed typecasting bugs in Function Calling. Updated to version `1.72.0` and remove the workaround patch introduced in [#648](https://github.com/ShishirPatil/gorilla/pull/648).
+- [Nov 14, 2024] [#760](https://github.com/ShishirPatil/gorilla/pull/760), [#761](https://github.com/ShishirPatil/gorilla/pull/761): Upstream `google-cloud-aiplatform` library fixed typecasting bugs in Function Calling. Updated to version `1.72.0` and remove the workaround patch introduced in [#648](https://github.com/ShishirPatil/gorilla/pull/648).
 - [Nov 14, 2024] [#747](https://github.com/ShishirPatil/gorilla/pull/747): Minor Grammatical Corrections to `DEFAULT_SYSTEM_PROMPT` that is supplied to all prompting models.
 - [Nov 13, 2024] [#737](https://github.com/ShishirPatil/gorilla/pull/737), [#739](https://github.com/ShishirPatil/gorilla/pull/739), [#740](https://github.com/ShishirPatil/gorilla/pull/740), [#763](https://github.com/ShishirPatil/gorilla/pull/763), [#772](https://github.com/ShishirPatil/gorilla/pull/772), [#789](https://github.com/ShishirPatil/gorilla/pull/789), [#804](https://github.com/ShishirPatil/gorilla/pull/804): Bug fix in the dataset and possible answers for the live and multi-turn categories.
 - [Nov 11, 2024] [#746](https://github.com/ShishirPatil/gorilla/pull/746): Improve inference log readability; inference log is now included as part of the model result file. For details on how to interpret the inference log, please refer to the [LOG_GUIDE.md](https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/LOG_GUIDE.md).
-- [Nov 9, 2024] [#749](https://github.com/ShishirPatil/gorilla/pull/749): Remove `Llama-3.2-3B-Instruct-FC` and `Llama-3.2-1B-Instruct-FC` from the leaderboard. According to the [official Llama documentation](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-tool-calling-(1b/3b)-), these models perform function calling using the prompt-style chat template rather than the specialized function-calling format.
+- [Nov 9, 2024] [#749](https://github.com/ShishirPatil/gorilla/pull/749): Remove `Llama-3.2-3B-Instruct-FC` and `Llama-3.2-1B-Instruct-FC` from the leaderboard. According to the [official Llama documentation](<https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-tool-calling-(1b/3b)->), these models perform function calling using the prompt-style chat template rather than the specialized function-calling format.
 - [Nov 8, 2024] [#720](https://github.com/ShishirPatil/gorilla/pull/720): Add new model `BitAgent/GoGoAgent` to the leaderboard.
 - [Oct 30, 2024] [#725](https://github.com/ShishirPatil/gorilla/pull/725), [#733](https://github.com/ShishirPatil/gorilla/pull/733): Update evaluation metric for multi-turn categories:
   - Introduce a new response-based checker, which works alongside with the existing state-based checker.
@@ -277,7 +336,7 @@ All notable changes to the Berkeley Function Calling Leaderboard will be documen
   - `microsoft/Phi-3-small-8k-instruct`
   - `microsoft/Phi-3-mini-128k-instruct`
   - `microsoft/Phi-3-mini-4k-instruct`
-- [Sept 25, 2024] [#660](https://github.com/ShishirPatil/gorilla/pull/660): Bug fix in `parse_nested_value` function to handle nested dictionary values properly. 
+- [Sept 25, 2024] [#660](https://github.com/ShishirPatil/gorilla/pull/660): Bug fix in `parse_nested_value` function to handle nested dictionary values properly.
 - [Sept 24, 2024] [#657](https://github.com/ShishirPatil/gorilla/pull/657): Add the following new models to the leaderboard:
   - `meta-llama/Llama-3.2-1B-Instruct`
   - `meta-llama/Llama-3.2-1B-Instruct-FC`
 
@@ -142,7 +142,7 @@ Regardless of mode or model type, you should implement the following methods to
 
 ## Join Our Community
 
-- Have questions or need help? Join the [Gorilla Discord](https://discord.gg/grXXvj9Whz) and visit the `#leaderboard` channel.
+- Have questions or need help? Join the [Discord](https://discord.gg/grXXvj9Whz) and visit the `#leaderboard` channel.
 - Feel free to reach out if you have any questions, concerns, or would like guidance while adding your new model. We’re happy to assist!
 
 ---
 
@@ -11,6 +11,7 @@
     - [Extra Dependencies for Self-Hosted Models](#extra-dependencies-for-self-hosted-models)
     - [Configuring Project Root Directory](#configuring-project-root-directory)
     - [Setting up Environment Variables](#setting-up-environment-variables)
+      - [Configuring SerpAPI for Web Search Category](#configuring-serpapi-for-web-search-category)
   - [Running Evaluations](#running-evaluations)
     - [Generating LLM Responses](#generating-llm-responses)
       - [Selecting Models and Test Categories](#selecting-models-and-test-categories)
@@ -78,7 +79,7 @@ pip install bfcl-eval  # Be careful not to confuse with the unrelated `bfcl` pro
 
 For locally hosted models, choose one of the following backends, ensuring you have the right GPU and OS setup:
 
-`sglang` is *much faster* than `vllm` but only supports newer GPUs with SM 80+ (Ampere etc).
+`sglang` is *much faster* than `vllm` in our specific multi-turn use case, but it only supports newer GPUs with SM 80+ (Ampere etc).
 If you are using an older GPU (T4/V100), you should use `vllm` instead as it supports a much wider range of GPUs.
 
 **Using `vllm`:**
@@ -135,6 +136,10 @@ If you are running any proprietary models, make sure the model API keys are incl
 
 The library looks for the `.env` file in the project root, i.e. `$BFCL_PROJECT_ROOT/.env`.
 
+#### Configuring SerpAPI for Web Search Category
+
+For the `web_search` test category, we use the [SerpAPI](https://serpapi.com/) service to perform web search. You need to sign up for an API key and add it to your `.env` file. You can also switch to other web search APIs by changing the `search_engine_query` function in `bfcl_eval/eval_checker/multi_turn_eval/func_source_code/web_search.py`.
+
 ---
 
 ## Running Evaluations
@@ -212,13 +217,13 @@ bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1
 bfcl generate \
   --model MODEL_NAME \
   --test-category TEST_CATEGORY \
-  --backend {vllm|sglang} \
+  --backend {sglang|vllm} \
   --num-gpus 1 \
   --gpu-memory-utilization 0.9 \
   --local-model-path /path/to/local/model   # ← optional
 ```
 
-- Choose your backend using `--backend vllm` or `--backend sglang`. The default backend is `vllm`.
+- Choose your backend using `--backend sglang` or `--backend vllm`. The default backend is `sglang`.
 - Control GPU usage by adjusting `--num-gpus` (default `1`, relevant for multi-GPU tensor parallelism) and `--gpu-memory-utilization` (default `0.9`), which can help avoid out-of-memory errors.
 - `--local-model-path` (optional): Point this flag at a directory that already contains the model's files (`config.json`, tokenizer, weights, etc.). Use it only when you've pre‑downloaded the model and the weights live somewhere other than the default `$HF_HOME` cache.
 
@@ -230,11 +235,11 @@ If you have a server already running (e.g., vLLM in a SLURM cluster), you can by
 bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --skip-server-setup
 ```
 
-In addition, you should specify the endpoint and port used by the server. By default, the endpoint is `localhost` and the port is `1053`. These can be overridden by the `VLLM_ENDPOINT` and `VLLM_PORT` environment variables in the `.env` file:
+In addition, you should specify the endpoint and port used by the local server. By default, the endpoint is `localhost` and the port is `1053`. These can be overridden by the `LOCAL_SERVER_ENDPOINT` and `LOCAL_SERVER_PORT` environment variables in the `.env` file:
 
 ```bash
-VLLM_ENDPOINT=localhost
-VLLM_PORT=1053
+LOCAL_SERVER_ENDPOINT=localhost
+LOCAL_SERVER_PORT=1053
 ```
 
 #### (Alternate) Script Execution for Generation
@@ -312,9 +317,9 @@ For detailed steps, please see the [Contributing Guide](./CONTRIBUTING.md).
 
 ## Additional Resources
 
-- [Gorilla Discord](https://discord.gg/grXXvj9Whz) (`#leaderboard` channel)
-- [Project Website](https://gorilla.cs.berkeley.edu/)
+- [Discord](https://discord.gg/grXXvj9Whz) (`#leaderboard` channel)
+- [Project Website](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard)
 
 All the leaderboard statistics, and data used to train the models are released under Apache 2.0.
-Gorilla is an open source effort from UC Berkeley and we welcome contributors.
-Please email us your comments, criticisms, and questions. More information about the project can be found at [https://gorilla.cs.berkeley.edu/](https://gorilla.cs.berkeley.edu/)
+BFCL is an open source effort from UC Berkeley and we welcome contributors.
+For any comments, criticisms, or questions, please feel free to raise an issue or a PR. You can also reach us via [email](mailto:huanzhimao@berkeley.edu).