Add PHYBench #2125

suencgo · 2025-05-27T08:46:38Z

##Motivation
This PR adds support for the PHYBench dataset to the OpenCompass framework. PHYBench is a benchmark designed for evaluating large language models on symbolic physics problems with structured LaTeX answers. The goal is to enable high-fidelity evaluation of models' reasoning capabilities in physics using expression-level symbolic comparison.

##Modification
Added PhyBenchDataset for loading the dataset from a local JSON file.

Implemented a custom evaluator MathEEDEvaluator using the EED (Extended Edit Distance) metric for symbolic similarity.

Integrated three utility files: EED.py, extended_zss.py, and latex_pre_process.py, which are used by the evaluator to process and compare symbolic math expressions.

Registered the dataset and evaluator in phybench_gen.py under the configs/datasets/PHYBench directory.

Configured the dataset's metadata in datasets_info.py to support local loading.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.
PHYBench-fullques_v1.json

MaiziXiao

LGTM

add phybench

811161c

mm-assistant bot assigned liushz May 27, 2025

suencgo temporarily deployed to prod June 4, 2025 07:16 — with GitHub Actions Inactive

phybench fix

8cb2b28

MaiziXiao approved these changes Jun 4, 2025

View reviewed changes

Myhs-phz approved these changes Jun 4, 2025

View reviewed changes

Myhs-phz temporarily deployed to prod June 4, 2025 08:42 — with GitHub Actions Inactive

update

ed20293

MaiziXiao had a problem deploying to prod June 4, 2025 10:14 — with GitHub Actions Failure

update

a6d706d

MaiziXiao temporarily deployed to prod June 4, 2025 10:35 — with GitHub Actions Inactive

MaiziXiao merged commit 80ec846 into open-compass:main Jun 4, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PHYBench #2125

Add PHYBench #2125

Uh oh!

suencgo commented May 27, 2025 •

edited

Loading

Uh oh!

MaiziXiao left a comment

Uh oh!

Uh oh!

Uh oh!

Add PHYBench #2125

Add PHYBench #2125

Uh oh!

Conversation

suencgo commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

MaiziXiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

suencgo commented May 27, 2025 •

edited

Loading