Skip to content

Conversation

suencgo
Copy link
Contributor

@suencgo suencgo commented May 27, 2025

##Motivation
This PR adds support for the PHYBench dataset to the OpenCompass framework. PHYBench is a benchmark designed for evaluating large language models on symbolic physics problems with structured LaTeX answers. The goal is to enable high-fidelity evaluation of models' reasoning capabilities in physics using expression-level symbolic comparison.

##Modification
Added PhyBenchDataset for loading the dataset from a local JSON file.

Implemented a custom evaluator MathEEDEvaluator using the EED (Extended Edit Distance) metric for symbolic similarity.

Integrated three utility files: EED.py, extended_zss.py, and latex_pre_process.py, which are used by the evaluator to process and compare symbolic math expressions.

Registered the dataset and evaluator in phybench_gen.py under the configs/datasets/PHYBench directory.

Configured the dataset's metadata in datasets_info.py to support local loading.

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.
    PHYBench-fullques_v1.json

Copy link
Contributor

@MaiziXiao MaiziXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MaiziXiao MaiziXiao merged commit 80ec846 into open-compass:main Jun 4, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants