Skip to content

Conversation

mar-cry
Copy link
Contributor

@mar-cry mar-cry commented Apr 29, 2025

Motivation

We would like to add support for the NEJM-AI Benchmark (https://huggingface.co/datasets/SeanWu25/NEJM-AI_Benchmarking_Medical_Language_Models) in OpenCompass. This will enable systematic evaluation of existing LLMs on clinical question-answering tasks drawn from New England Journal of Medicine articles, driving advances in model performance in the medical domain and promoting broader adoption of LLMs in scientific research.

Modification

  • New dataset implementation
    • Added datasets/nejmaibench.py defining NEJMAIBenchmarkDataset, including download, parsing, and example generation.
    • Updated datasets/__init__.py to register the new dataset.
  • Generation script
    • Created configs/datasets/nejm_ai_benchmark/nejmaibench_gen.py to define prompt templates and generation settings for zero-shot and few-shot evaluation.
  • LLMJUDGE config
    • Added configs/datasets/nejm_ai_benchmark/nejmaibench_llmjudge_gen.py to configure pairwise LLMJudge evaluation for the benchmark.
  • Config registration
    • Introduced a new folder configs/datasets/nejm_ai_benchmark/ containing all configuration files needed to run both generation and LLMJudge pipelines on NEJM-AI.

The file of this benchmark
NEJM_All_Questions_And_Answers.csv

Checklist

Before PR:

  • [√] Pre-commit or other linting tools are used to fix the potential lint issues.
  • [√] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • [√] The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • [√] The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • [√] If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • [√] CLA has been signed and all committers have signed the CLA in this PR.

return dataset


class nejmaibenchEvaluator(BaseEvaluator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class 类大写

@MaiziXiao
Copy link
Contributor

不要将 csv 文件直接上传至 OC库内,参考https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/new_dataset.html

@MaiziXiao
Copy link
Contributor

@mar-cry
Copy link
Contributor Author

mar-cry commented May 7, 2025

已经按照要求完成修改,麻烦老师再次审核

@mar-cry
Copy link
Contributor Author

mar-cry commented May 7, 2025

NEJM_All_Questions_And_Answers.csv
该benchmark的数据集文件

@mar-cry
Copy link
Contributor Author

mar-cry commented May 8, 2025

解决了resolve conflicts

@MaiziXiao
Copy link
Contributor

  1. 使用了generic_llmjudge_postprocess 代替数据集自定义 llmjudge_postprogess
  2. 上传相关数据集文件至 OC oss

Copy link
Contributor

@MaiziXiao MaiziXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MaiziXiao MaiziXiao merged commit a685ed7 into open-compass:main May 8, 2025
8 checks passed
stephen-nju pushed a commit to stephen-nju/opencompass that referenced this pull request May 14, 2025
* support nejm ai benchmark

* add dataset files

* revise gen name

* revise gen name

* revise class name & remove csv file & add dataset-index.yml info

* update

* update

---------

Co-authored-by: MaiziXiao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants