- The final round dataset is now available! Please download
curebench_testset_phase2.jsonlfrom the Data section on Kaggle. - All data in this new release is private, meaning no results will be shown before the deadline.
- You may still use the first-round test set for evaluation to receive immediate feedback and continue refining your model.
- After the competition concludes, we will collect valid submissions for
curebench_testset_phase2.jsonland perform offline evaluation to determine the final rankings. Evaluation metrics will be announced along with the final results. - Please ensure that you submit your results for
curebench_testset_phase2.jsonlbefore the final deadline, and that yourmeta_fileincludes all required information as outlined in our GitHub repository.
A simple inference framework for the CURE-Bench bio-medical AI competition. This starter kit provides an easy-to-use interface for generating submission data in CSV format.
2025.08.08: Question&Answer page: We have created a Q&A page to share all our responses to questions from participants, ensuring fair competition.
2025.09.10: Added starterkit code and tutorials for running GPT-OSS-20B, OpenAI’s 20B open-weight reasoning model.
pip install -r requirements.txtIf you want to use the ChatGPT baseline:
- Set up your Azure OpenAI resource
- Configure environment variables:
export AZURE_OPENAI_API_KEY_O1="your-api-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"If you want to use the open-ended models, (e.g., Qwen, GPT-OSS-20B): For local models, ensure you have sufficient GPU memory:
# Install CUDA-compatible PyTorch if needed
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transfomers├── eval_framework.py # Main evaluation framework
├── dataset_utils.py # Dataset loading utilities
├── run.py # Command-line evaluation script
├── metadata_config.json # Example metadata configuration
├── requirements.txt # Python dependencies
└── competition_results/ # Output directory for your results
Download the val and test dataset from the Kaggle site:
https://www.kaggle.com/competitions/cure-bench
For val set, configure datasets in your metadata_config_val.json file with the following structure:
{
"dataset": {
"dataset_name": "cure_bench_pharse_1",
"dataset_path": "/path/to/your/curebench_valset.jsonl",
"description": "CureBench 2025 val questions"
}
}For test set, configure datasets in your metadata_config_test.json file with the following structure:
{
"dataset": {
"dataset_name": "cure_bench_pharse_1",
"dataset_path": "/path/to/your/curebench_testset.jsonl",
"description": "CureBench 2025 test questions"
}
}# Run with configuration file (recommended)
python run.py --config metadata_config_test.jsonCreate a metadata_config_val.json file. Below is an example:
{
"metadata": {
"model_name": "gpt-4o-1120",
"model_type": "ChatGPTModel",
"track": "internal_reasoning",
"base_model_type": "API",
"base_model_name": "gpt-4o-1120",
"dataset": "cure_bench_phase_1",
"additional_info": "Zero-shot ChatGPT run",
"average_tokens_per_question": "",
"average_tools_per_question": "",
"tool_category_coverage": ""
},
"dataset": {
"dataset_name": "cure_bench_phase_1",
"dataset_path": "/path/to/curebench_valset.jsonl",
"description": "CureBench 2025 val questions"
},
"output_dir": "competition_results",
"output_file": "submission.csv"
}Notes:
- Other API models and open-weight models (e.g. Qwen) can be used in the same way
- For fine-tuned model (e.g. GPT-OSS-20B) replace
"model_name"with your fine-tuned checkpoint, e.g.:
"model_name": "myuser/gpt-oss-20b-curebench-ft"model_name: Display name of your modeltrack: Either "internal_reasoning" or "agentic_reasoning"base_model_type: Either "API" or "OpenWeighted"base_model_name: Name of the underlying modeldataset: Name of the dataset
Note: You can leave the following fields empty for the first round of submissions:
additional_info,average_tokens_per_question, average_tools_per_question, and tool_category_coverage.
Please ensure these fields are filled for the final submission.
The framework handles three distinct question types:
- Multiple Choice: Questions with lettered options (A, B, C, D, E)
- Open-ended Multiple Choice: Open-ended questions converted to multiple choice format
- Open-ended: Free-form text answers
The framework generates submission files in CSV format with a zip package containing metadata. The CSV structure includes:
id: Question identifierprediction: Model's answer (choice for multiple choice, text for open-ended)reasoning: Model's reasoning processchoice: The choice for the multi-choice questions.
The metadata structure (example):
{
"meta_data": {
"model_name": "gpt-4o-1120",
"track": "internal_reasoning",
"model_type": "ChatGPTModel",
"base_model_type": "API",
"base_model_name": "gpt-4o-1120",
"dataset": "cure_bench_pharse_1",
"additional_info": "Zero-shot ChatGPT run",
"average_tokens_per_question": "",
"average_tools_per_question": "",
"tool_category_coverage": ""
}
}- Step-by-step tutorial for running OpenAI’s open-weight 20B model on CUREBench: tutorials/gpt-oss-20b/tutorial_gptoss20b.md
For issues and questions:
- Check the error messages (they're usually helpful!)
- Ensure all dependencies are installed
- Review the examples in this README
- Open an Github Issue.
Happy competing!