Final Phase Result Submission Now Open!

The final round dataset is now available! Please download curebench_testset_phase2.jsonl from the Data section on Kaggle.
All data in this new release is private, meaning no results will be shown before the deadline.
You may still use the first-round test set for evaluation to receive immediate feedback and continue refining your model.
After the competition concludes, we will collect valid submissions for curebench_testset_phase2.jsonl and perform offline evaluation to determine the final rankings. Evaluation metrics will be announced along with the final results.
Please ensure that you submit your results for curebench_testset_phase2.jsonl before the final deadline, and that your meta_file includes all required information as outlined in our GitHub repository.

CURE-Bench Starter Kit

A simple inference framework for the CURE-Bench bio-medical AI competition. This starter kit provides an easy-to-use interface for generating submission data in CSV format.

Updates

2025.08.08: Question&Answer page: We have created a Q&A page to share all our responses to questions from participants, ensuring fair competition.

2025.09.10: Added starterkit code and tutorials for running GPT-OSS-20B, OpenAI’s 20B open-weight reasoning model.

Quick Start

Installation Dependencies

pip install -r requirements.txt

Baseline Setup

If you want to use the ChatGPT baseline:

Set up your Azure OpenAI resource
Configure environment variables:

export AZURE_OPENAI_API_KEY_O1="your-api-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"

If you want to use the open-ended models, (e.g., Qwen, GPT-OSS-20B): For local models, ensure you have sufficient GPU memory:

# Install CUDA-compatible PyTorch if needed
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transfomers

📁 Project Structure

├── eval_framework.py      # Main evaluation framework
├── dataset_utils.py       # Dataset loading utilities
├── run.py                 # Command-line evaluation script
├── metadata_config.json   # Example metadata configuration
├── requirements.txt       # Python dependencies
└── competition_results/   # Output directory for your results

Dataset Preparation

Download the val and test dataset from the Kaggle site:

https://www.kaggle.com/competitions/cure-bench

For val set, configure datasets in your metadata_config_val.json file with the following structure:

{
  "dataset": {
    "dataset_name": "cure_bench_pharse_1",
    "dataset_path": "/path/to/your/curebench_valset.jsonl",
    "description": "CureBench 2025 val questions"
  }
}

For test set, configure datasets in your metadata_config_test.json file with the following structure:

{
  "dataset": {
    "dataset_name": "cure_bench_pharse_1",
    "dataset_path": "/path/to/your/curebench_testset.jsonl",
    "description": "CureBench 2025 test questions"
  }
}

Usage Examples

Basic Evaluation with Config File

# Run with configuration file (recommended)
python run.py --config metadata_config_test.json

🔧 Configuration

Metadata Configuration

Create a metadata_config_val.json file. Below is an example:

{
  "metadata": {
    "model_name": "gpt-4o-1120",
    "model_type": "ChatGPTModel",
    "track": "internal_reasoning",
    "base_model_type": "API",
    "base_model_name": "gpt-4o-1120",
    "dataset": "cure_bench_phase_1",
    "additional_info": "Zero-shot ChatGPT run",
    "average_tokens_per_question": "",
    "average_tools_per_question": "",
    "tool_category_coverage": ""
  },
  "dataset": {
    "dataset_name": "cure_bench_phase_1",
    "dataset_path": "/path/to/curebench_valset.jsonl",
    "description": "CureBench 2025 val questions"
  },
  "output_dir": "competition_results",
  "output_file": "submission.csv"
}

Notes:

Other API models and open-weight models (e.g. Qwen) can be used in the same way
For fine-tuned model (e.g. GPT-OSS-20B) replace "model_name" with your fine-tuned checkpoint, e.g.:

"model_name": "myuser/gpt-oss-20b-curebench-ft"

Required Metadata Fields

model_name: Display name of your model
track: Either "internal_reasoning" or "agentic_reasoning"
base_model_type: Either "API" or "OpenWeighted"
base_model_name: Name of the underlying model
dataset: Name of the dataset

Note: You can leave the following fields empty for the first round of submissions: additional_info,average_tokens_per_question, average_tools_per_question, and tool_category_coverage. Please ensure these fields are filled for the final submission.

Question Type Support

The framework handles three distinct question types:

Multiple Choice: Questions with lettered options (A, B, C, D, E)
Open-ended Multiple Choice: Open-ended questions converted to multiple choice format
Open-ended: Free-form text answers

Output Format

The framework generates submission files in CSV format with a zip package containing metadata. The CSV structure includes:

id: Question identifier
prediction: Model's answer (choice for multiple choice, text for open-ended)
reasoning: Model's reasoning process
choice: The choice for the multi-choice questions.

The metadata structure (example):

{
  "meta_data": {
    "model_name": "gpt-4o-1120",
    "track": "internal_reasoning",
    "model_type": "ChatGPTModel",
    "base_model_type": "API", 
    "base_model_name": "gpt-4o-1120",
    "dataset": "cure_bench_pharse_1",
    "additional_info": "Zero-shot ChatGPT run",
    "average_tokens_per_question": "",
    "average_tools_per_question": "",
    "tool_category_coverage": ""
  }
}

Model Tutorials

Step-by-step tutorial for running OpenAI’s open-weight 20B model on CUREBench: tutorials/gpt-oss-20b/tutorial_gptoss20b.md

Support

For issues and questions:

Check the error messages (they're usually helpful!)
Ensure all dependencies are installed
Review the examples in this README
Open an Github Issue.

Happy competing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Final Phase Result Submission Now Open!

CURE-Bench Starter Kit

Updates

Quick Start

Installation Dependencies

Baseline Setup

📁 Project Structure

Dataset Preparation

Usage Examples

Basic Evaluation with Config File

🔧 Configuration

Metadata Configuration

Required Metadata Fields

Question Type Support

Output Format

Model Tutorials

Support

About

Uh oh!

Releases 1

Packages

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
tutorials/gpt_oss_20b		tutorials/gpt_oss_20b
.gitignore		.gitignore
QA.md		QA.md
README.md		README.md
dataset_utils.py		dataset_utils.py
eval_framework.py		eval_framework.py
metadata_config_test.json		metadata_config_test.json
metadata_config_val.json		metadata_config_val.json
requirements.txt		requirements.txt
run.py		run.py

mims-harvard/CUREBench

Folders and files

Latest commit

History

Repository files navigation

Final Phase Result Submission Now Open!

CURE-Bench Starter Kit

Updates

Quick Start

Installation Dependencies

Baseline Setup

📁 Project Structure

Dataset Preparation

Usage Examples

Basic Evaluation with Config File

🔧 Configuration

Metadata Configuration

Required Metadata Fields

Question Type Support

Output Format

Model Tutorials

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 5

Uh oh!

Languages

Packages