Skip to content

[Feature] Add a new example config for intern-s1 benchmarks #2220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through

## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

- **\[2025.07.26\]** OpenCompass now supports Intern-S1 related general and scientific evaluation benchmarks. Please check [Intern-S1 Configs](examples/eval_bench_intern_s1.py) for more details! 🔥🔥🔥
- **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
Expand Down
1 change: 1 addition & 0 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@

## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

- **\[2025.07.26\]** OpenCompass 现已支持Intern-S1相关的通用及科学评测基准,请参阅[Intern-S1相关评测配置文件](examples/eval_bench_intern_s1.py)了解详情!🔥🔥🔥
- **\[2025.04.01\]** OpenCompass 现已支持 `CascadeEvaluator`,允许多个评估器按顺序工作,可以为更复杂的评估场景创建自定义评估流程,查看[文档](docs/zh_cn/advanced_guides/llm_judge.md)了解具体用法!🔥🔥🔥
- **\[2025.03.11\]** 现已支持 `SuperGPQA` 覆盖285 个研究生学科的知识能力评测,欢迎尝试!🔥🔥🔥
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/zh_cn/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
Expand Down
170 changes: 170 additions & 0 deletions examples/eval_bench_intern_s1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# flake8: noqa

from mmengine.config import read_base

from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask


#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import aime2025_datasets
from opencompass.configs.datasets.gpqa.gpqa_cascade_eval_gen_772ea0 import (
gpqa_datasets,
)
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_nocot_genericllmeval_gen_08c1de import (
mmlu_pro_datasets,
)
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import (
ifeval_datasets,
)
from opencompass.configs.datasets.SmolInstruct.smolinstruct_0shot_instruct_gen import (
smolinstruct_datasets_0shot_instruct as smolinstruct_datasets,
)
from opencompass.configs.datasets.ChemBench.ChemBench_llmjudge_gen_c584cf import (
chembench_datasets,
)
from opencompass.configs.datasets.matbench.matbench_llm_judge_gen_0e9276 import (
matbench_datasets,
)
from opencompass.configs.datasets.ProteinLMBench.ProteinLMBench_llmjudge_gen_a67965 import (
proteinlmbench_datasets,
)

# Summary Groups
from opencompass.configs.summarizers.groups.mmlu_pro import (
mmlu_pro_summary_groups,
)

# Models
from opencompass.configs.models.qwen3.lmdeploy_qwen3_0_6b import \
models as qwen3_model

#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
# Only take LCB generation for evaluation

datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')),
[]) + [LCBCodeGeneration_dataset]

# LLM judge config: using LLM to evaluate predictions
judge_cfg = dict()

for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
if 'llm_evaluator' in item['eval_cfg']['evaluator'].keys() and 'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg


#######################################################################
# PART 2 Datset Summarizer #
#######################################################################

summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
)

summary_groups.extend(
[
{
'name': 'ChemBench',
'subsets': [
'ChemBench_Name_Conversion',
'ChemBench_Property_Prediction',
'ChemBench_Mol2caption',
'ChemBench_Caption2mol',
'ChemBench_Product_Prediction',
'ChemBench_Retrosynthesis',
'ChemBench_Yield_Prediction',
'ChemBench_Temperature_Prediction',
],
},
]
)

summarizer = dict(
dataset_abbrs=[
'Knowledge',
['mmlu_pro', 'accuracy'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['GPQA_diamond', 'accuracy (8 runs average)'],
['GPQA_diamond', 'G-Pass@8_0.0'],
'',
'Math Calculation',
['aime2025', 'accuracy (32 runs average)'],
['aime2025', 'G-Pass@32_0.0'],
'',
'Academic',
['ChemBench', 'naive_average'],
['ProteinLMBench', 'accuracy'],
'',
'SmolInstruct',
['NC-I2F-0shot-instruct', 'score'],
['NC-I2S-0shot-instruct', 'score'],
['NC-S2F-0shot-instruct', 'score'],
['NC-S2I-0shot-instruct', 'score'],
['PP-ESOL-0shot-instruct', 'score'],
['PP-Lipo-0shot-instruct', 'score'],
['PP-BBBP-0shot-instruct', 'accuracy'],
['PP-ClinTox-0shot-instruct', 'accuracy'],
['PP-HIV-0shot-instruct', 'accuracy'],
['PP-SIDER-0shot-instruct', 'accuracy'],
['MC-0shot-instruct', 'score'],
['MG-0shot-instruct', 'score'],
['FS-0shot-instruct', 'score'],
['RS-0shot-instruct', 'score'],
'',
['matbench_expt_gap', 'mae'],
['matbench_steels', 'mae'],
['matbench_expt_is_metal', 'accuracy'],
['matbench_glass', 'accuracy'],
'',
],
summary_groups=summary_groups,
)

#######################################################################
# PART 3 Models List #
#######################################################################

models = sum([v for k, v in locals().items() if k.endswith('_model')], [])

#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################

# infer with local runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask),
),
)

# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)

#######################################################################
# PART 5 Utils Configuaration #
#######################################################################

work_dir = './outputs/oc_bench_intern_s1'