add nejm ai benchmark #2063

mar-cry · 2025-04-29T09:22:24Z

Motivation

We would like to add support for the NEJM-AI Benchmark (https://huggingface.co/datasets/SeanWu25/NEJM-AI_Benchmarking_Medical_Language_Models) in OpenCompass. This will enable systematic evaluation of existing LLMs on clinical question-answering tasks drawn from New England Journal of Medicine articles, driving advances in model performance in the medical domain and promoting broader adoption of LLMs in scientific research.

Modification

New dataset implementation
- Added datasets/nejmaibench.py defining NEJMAIBenchmarkDataset, including download, parsing, and example generation.
- Updated datasets/__init__.py to register the new dataset.
Generation script
- Created configs/datasets/nejm_ai_benchmark/nejmaibench_gen.py to define prompt templates and generation settings for zero-shot and few-shot evaluation.
LLMJUDGE config
- Added configs/datasets/nejm_ai_benchmark/nejmaibench_llmjudge_gen.py to configure pairwise LLMJudge evaluation for the benchmark.
Config registration
- Introduced a new folder configs/datasets/nejm_ai_benchmark/ containing all configuration files needed to run both generation and LLMJudge pipelines on NEJM-AI.

The file of this benchmark
NEJM_All_Questions_And_Answers.csv

Checklist

Before PR:

[√] Pre-commit or other linting tools are used to fix the potential lint issues.
[√] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
[√] The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
[√] The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

[√] If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
[√] CLA has been signed and all committers have signed the CLA in this PR.

MaiziXiao · 2025-05-07T12:27:07Z

opencompass/datasets/nejmaibench.py

+        return dataset
+
+
+class nejmaibenchEvaluator(BaseEvaluator):


Class 类大写

MaiziXiao · 2025-05-07T12:28:02Z

不要将 csv 文件直接上传至 OC库内，参考https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/new_dataset.html

MaiziXiao · 2025-05-07T12:28:14Z

参考https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/new_dataset.html 添加到dataset-index.yml

mar-cry · 2025-05-07T14:39:50Z

已经按照要求完成修改，麻烦老师再次审核

mar-cry · 2025-05-07T14:48:26Z

NEJM_All_Questions_And_Answers.csv
该benchmark的数据集文件

mar-cry · 2025-05-08T04:57:19Z

解决了resolve conflicts

MaiziXiao · 2025-05-08T07:27:21Z

使用了generic_llmjudge_postprocess 代替数据集自定义 llmjudge_postprogess
上传相关数据集文件至 OC oss

MaiziXiao

LGTM

* support nejm ai benchmark * add dataset files * revise gen name * revise gen name * revise class name & remove csv file & add dataset-index.yml info * update * update --------- Co-authored-by: MaiziXiao <[email protected]>

support nejm ai benchmark

48ac21f

mm-assistant bot assigned liushz Apr 29, 2025

mar-cry added 3 commits April 29, 2025 09:23

add dataset files

f953ad3

revise gen name

5ee3655

revise gen name

e7b04af

MaiziXiao reviewed May 7, 2025

View reviewed changes

opencompass/datasets/nejmaibench.py Outdated

return dataset

class nejmaibenchEvaluator(BaseEvaluator):

Copy link

Contributor

MaiziXiao May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class 类大写

revise class name & remove csv file & add dataset-index.yml info

adc33cd

resove dataset-index conflicts

23fb3c7

update

1acb3c3

MaiziXiao temporarily deployed to prod May 8, 2025 07:27 — with GitHub Actions Inactive

update

295f10e

MaiziXiao temporarily deployed to prod May 8, 2025 08:26 — with GitHub Actions Inactive

MaiziXiao approved these changes May 8, 2025

View reviewed changes

MaiziXiao merged commit a685ed7 into open-compass:main May 8, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add nejm ai benchmark #2063

add nejm ai benchmark #2063

Uh oh!

mar-cry commented Apr 29, 2025 •

edited

Loading

Uh oh!

MaiziXiao May 7, 2025

Uh oh!

MaiziXiao commented May 7, 2025

Uh oh!

MaiziXiao commented May 7, 2025

Uh oh!

mar-cry commented May 7, 2025

Uh oh!

mar-cry commented May 7, 2025

Uh oh!

mar-cry commented May 8, 2025

Uh oh!

MaiziXiao commented May 8, 2025

Uh oh!

MaiziXiao left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add nejm ai benchmark #2063

add nejm ai benchmark #2063

Uh oh!

Conversation

mar-cry commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Checklist

Uh oh!

MaiziXiao May 7, 2025

Choose a reason for hiding this comment

Uh oh!

MaiziXiao commented May 7, 2025

Uh oh!

MaiziXiao commented May 7, 2025

Uh oh!

mar-cry commented May 7, 2025

Uh oh!

mar-cry commented May 7, 2025

Uh oh!

mar-cry commented May 8, 2025

Uh oh!

MaiziXiao commented May 8, 2025

Uh oh!

MaiziXiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mar-cry commented Apr 29, 2025 •

edited

Loading