-
Notifications
You must be signed in to change notification settings - Fork 672
add nejm ai benchmark #2063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add nejm ai benchmark #2063
Conversation
opencompass/datasets/nejmaibench.py
Outdated
return dataset | ||
|
||
|
||
class nejmaibenchEvaluator(BaseEvaluator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class 类大写
不要将 csv 文件直接上传至 OC库内,参考https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/new_dataset.html |
参考https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/new_dataset.html 添加到dataset-index.yml |
已经按照要求完成修改,麻烦老师再次审核 |
NEJM_All_Questions_And_Answers.csv |
解决了resolve conflicts |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* support nejm ai benchmark * add dataset files * revise gen name * revise gen name * revise class name & remove csv file & add dataset-index.yml info * update * update --------- Co-authored-by: MaiziXiao <[email protected]>
Motivation
We would like to add support for the NEJM-AI Benchmark (https://huggingface.co/datasets/SeanWu25/NEJM-AI_Benchmarking_Medical_Language_Models) in OpenCompass. This will enable systematic evaluation of existing LLMs on clinical question-answering tasks drawn from New England Journal of Medicine articles, driving advances in model performance in the medical domain and promoting broader adoption of LLMs in scientific research.
Modification
datasets/nejmaibench.py
definingNEJMAIBenchmarkDataset
, including download, parsing, and example generation.datasets/__init__.py
to register the new dataset.configs/datasets/nejm_ai_benchmark/nejmaibench_gen.py
to define prompt templates and generation settings for zero-shot and few-shot evaluation.configs/datasets/nejm_ai_benchmark/nejmaibench_llmjudge_gen.py
to configure pairwise LLMJudge evaluation for the benchmark.configs/datasets/nejm_ai_benchmark/
containing all configuration files needed to run both generation and LLMJudge pipelines on NEJM-AI.The file of this benchmark
NEJM_All_Questions_And_Answers.csv
Checklist
Before PR:
After PR: