Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
##Motivation
This PR adds support for the PHYBench dataset to the OpenCompass framework. PHYBench is a benchmark designed for evaluating large language models on symbolic physics problems with structured LaTeX answers. The goal is to enable high-fidelity evaluation of models' reasoning capabilities in physics using expression-level symbolic comparison.
##Modification
Added PhyBenchDataset for loading the dataset from a local JSON file.
Implemented a custom evaluator MathEEDEvaluator using the EED (Extended Edit Distance) metric for symbolic similarity.
Integrated three utility files: EED.py, extended_zss.py, and latex_pre_process.py, which are used by the evaluator to process and compare symbolic math expressions.
Registered the dataset and evaluator in phybench_gen.py under the configs/datasets/PHYBench directory.
Configured the dataset's metadata in datasets_info.py to support local loading.
Checklist
Before PR:
After PR:
PHYBench-fullques_v1.json