Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through

## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

- **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
Expand Down
3 changes: 2 additions & 1 deletion README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,9 @@

## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

- **\[2025.04.01\]** OpenCompass 现已支持 `CascadeEvaluator`,允许多个评估器按顺序工作,可以为更复杂的评估场景创建自定义评估流程,查看[文档](docs/zh_cn/advanced_guides/llm_judge.md)了解具体用法!🔥🔥🔥
- **\[2025.03.11\]** 现已支持 `SuperGPQA` 覆盖285 个研究生学科的知识能力评测,欢迎尝试!🔥🔥🔥
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/en/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/zh_cn/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具:用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。
- **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py),你可以通过简单地配置复现官方评测结果。
Expand Down
106 changes: 104 additions & 2 deletions docs/en/advanced_guides/llm_judge.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ export OC_JUDGE_API_BASE=http://172.30.56.1:4000/v1

Note that by default, OpenCompass will use these three environment variables, but if you use configuration files to configure the evaluation service, these environment variables will not take effect.

### ### Using LLM for Evaluation via Configuration Files
### Using LLM for Evaluation via Configuration Files

To set up an LLM judge evaluation, you'll need to configure three main components:

Expand Down Expand Up @@ -264,6 +264,108 @@ Example evaluation output:
}
```

## CascadeEvaluator

OpenCompass also provides a CascadeEvaluator that combines the strengths of rule-based evaluation and LLM-based evaluation. The cascade evaluator has two modes:

1. **Cascade Mode (parallel=False)**: First evaluates all samples with a rule-based evaluator, then only sends samples that were deemed incorrect by the rule-based evaluation to an LLM judge for re-evaluation. This approach reduces reliance on LLM judgments while maintaining accuracy, thus lowering evaluation costs and time.

2. **Parallel Mode (parallel=True)**: Evaluates all samples with both the rule-based evaluator and LLM judge, then considers a sample correct if either method marks it as correct. This approach can increase the leniency of evaluation but may result in higher costs since all samples require LLM evaluation.

### Configuring CascadeEvaluator

Here's an example of how to configure the CascadeEvaluator:

```python
# Define a rule-based evaluator
rule_evaluator = dict(type=MATHEvaluator)

# Define an LLM judge evaluator
llm_judge_evaluator = dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=YOUR_JUDGE_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=YourDataset,
path='path/to/your/dataset',
reader_cfg=reader_cfg,
),
judge_cfg=dict(), # Can use environment variables to configure the judge model
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
)

# Configure cascade evaluator (cascade mode)
cascade_evaluator = dict(
type=CascadeEvaluator,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I import this CascadeEvaluator ?

llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=False # Cascade mode
)

# For parallel mode, set parallel=True
parallel_evaluator = dict(
type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=True # Parallel mode
)

# Use the cascade evaluator in your dataset evaluation config
eval_cfg = dict(evaluator=cascade_evaluator)
```

### Evaluation Results

The cascade evaluator outputs detailed evaluation statistics including:

- Accuracy of the rule-based evaluation
- Accuracy of the LLM evaluation (for samples that failed rule-based evaluation in cascade mode)
- Final combined accuracy

Example output:

```python
{
'accuracy': 85.0, # Final accuracy
'cascade_stats': {
'total_samples': 100,
'rule_correct': 70, # Number of samples correct by rule evaluation
'rule_accuracy': 70.0, # Accuracy of rule evaluation
'llm_evaluated': 30, # Number of samples evaluated by LLM (failed samples in cascade mode)
'llm_correct': 15, # Number of samples correct by LLM evaluation
'llm_accuracy': 50.0, # Accuracy of LLM evaluation
'final_correct': 85, # Total correct samples
'final_accuracy': 85.0, # Final accuracy
'parallel_mode': False, # Whether parallel mode was used
},
'details': [
# Detailed evaluation results for each sample
]
}
```

The cascade evaluator is particularly useful for:

1. Scenarios that require balancing evaluation cost and accuracy
2. Cases where rule-based evaluators are available but might not be comprehensive
3. Evaluation tasks that need more nuanced judgment for edge cases

## Complete Example

For a complete working example, refer to the `eval_llm_judge.py` file in the examples directory, which demonstrates how to evaluate mathematical problem-solving using an LLM judge.
For a complete working example using GenericLLMEvaluator
, refer to the `eval_llm_judge.py` file in the examples directory, which demonstrates how to evaluate mathematical problem-solving .

For a complete working example using CascadeEvaluator, refer to the `eval_cascade_evaluator.py` file in the examples directory, which demonstrates how to evaluate mathematical problem-solving .
103 changes: 102 additions & 1 deletion docs/zh_cn/advanced_guides/llm_judge.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,107 @@ GenericLLMEvaluator专为使用LLM作为评判器评估模型输出而设计。
}
```

## 级联评估器 (CascadeEvaluator)

OpenCompass还提供了级联评估器`CascadeEvaluator`,它结合了规则式评估和LLM评估的优势。级联评估器有两种模式:

1. **级联模式(Cascade Mode, parallel=False)**:首先使用规则式评估器评估所有样本,然后只将规则式评估认为不正确的样本发送给LLM评判器进行重新评估。这种方式可以在保持准确性的同时减少对LLM评判的依赖,从而降低评估成本和时间。

2. **并行模式(Parallel Mode, parallel=True)**:使用规则式评估器和LLM评判器同时评估所有样本,如果任何一个评估器认为样本是正确的,则将该样本视为正确。这种方式可以提高评估的宽容度,但可能会导致更高的成本,因为所有样本都需要LLM评估。

### 配置CascadeEvaluator

以下是配置`CascadeEvaluator`的示例:

```python
# 定义规则式评估器
rule_evaluator = dict(type=MATHEvaluator)

# 定义LLM评判器
llm_judge_evaluator = dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="你是一个负责评估模型输出正确性和质量的助手。",
)
],
round=[
dict(role='HUMAN', prompt=YOUR_JUDGE_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=YourDataset,
path='path/to/your/dataset',
reader_cfg=reader_cfg,
),
judge_cfg=dict(), # 可以使用环境变量配置评判模型
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
)

# 配置级联评估器(级联模式)
cascade_evaluator = dict(
type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=False # 级联模式
)

# 如果需要并行模式,可以设置parallel=True
parallel_evaluator = dict(
type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=True # 并行模式
)

# 在数据集评估配置中使用级联评估器
eval_cfg = dict(evaluator=cascade_evaluator)
```

### 评估结果

级联评估器会输出详细的评估统计信息,包括:

- 规则评估的准确率
- LLM评估的准确率(针对规则评估失败的样本)
- 最终的综合准确率

输出示例:

```python
{
'accuracy': 85.0, # 最终准确率
'cascade_stats': {
'total_samples': 100,
'rule_correct': 70, # 规则评估认为正确的样本数
'rule_accuracy': 70.0, # 规则评估的准确率
'llm_evaluated': 30, # LLM评估的样本数(级联模式下为规则评估失败的样本数)
'llm_correct': 15, # LLM评估认为正确的样本数
'llm_accuracy': 50.0, # LLM评估的准确率
'final_correct': 85, # 最终正确的样本数
'final_accuracy': 85.0, # 最终准确率
'parallel_mode': False, # 是否是并行模式
},
'details': [
# 每个样本的详细评估结果
]
}
```

级联评估器特别适用于:

1. 需要平衡评估成本和准确性的场景
2. 有可用的规则式评估器但可能不够完善的情况
3. 需要对边界情况进行更精确判断的评估任务

## 完整示例

有关完整的工作示例,请参考examples目录中的`eval_llm_judge.py`文件,该文件演示了如何使用LLM评判器评估数学问题解决能力。
如果希望了解如何使用LLM评判器评估数学问题解决能力。请参考examples目录中的`eval_llm_judge.py`文件。

如果希望了解如何使用级联评估器评估数学问题解决能力。请参考examples目录中的`eval_cascade_evaluator.py`文件。
127 changes: 127 additions & 0 deletions examples/eval_cascade_evaluator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@

from mmengine.config import read_base

from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.openicl.icl_evaluator import MATHEvaluator
from opencompass.datasets import (
MATHDataset,
math_postprocess_v2,
normalize_final_answer,
)
#######################################################################
# PART 0 Essential Configs #
#######################################################################

with read_base():
# Datasets, Summarizer
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
models as lmdeploy_qwen2_5_7b_instruct_model,
)

reader_cfg = dict(input_columns=['problem'], output_column='solution')

infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)

########################## Evaluator #################################
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.

Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.

Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.

Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.


<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
<Gold Target Begin>: \n{solution}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n

Judging the correctness of candidates' answers:
""".strip()

llm_judge_evaluator = dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=MATHDataset,
path='opencompass/math',
file_name='test_prm800k_500.json',
),
judge_cfg=dict(),
)

rule_evaluator =dict(type=MATHEvaluator)
cascade_evaluator = dict(type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=False
)
########################## #################################
eval_cfg = dict()

# eval_cfg['evaluator'] = rule_evaluator
# eval_cfg['evaluator'] = llm_judge_evaluator
eval_cfg['evaluator'] = cascade_evaluator

math_datasets = [
dict(
abbr='math_prm800k_500',
type=MATHDataset,
path='opencompass/math',
file_name='test_prm800k_500.json',
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
)
]


datasets = math_datasets
models = lmdeploy_qwen2_5_7b_instruct_model


work_dir = 'math_prm800k_500_cascade_evaluator'
Loading