Skip to content

[Fix] Set correct paths for the examples #2198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/en/advanced_guides/circular_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,4 +110,4 @@ summarizer = dict(
)
```

For more complex evaluation examples, refer to this sample code: https://github.com/open-compass/opencompass/tree/main/configs/eval_circular.py
For more complex evaluation examples, refer to this sample code: https://github.com/open-compass/opencompass/tree/main/examples/eval_circular.py
4 changes: 2 additions & 2 deletions docs/en/advanced_guides/code_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ We also need model responses with randomness, thus setting the `generation_kwarg

Note: `num_return_sequences` must be greater than or equal to k, as pass@k itself is a probability estimate.

You can specifically refer to the following configuration file [configs/eval_code_passk.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_code_passk.py)
You can specifically refer to the following configuration file [examples/eval_code_passk.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_code_passk.py)

### For Models That Do Not Support Multiple Responses

Expand Down Expand Up @@ -101,4 +101,4 @@ For `mbpp`, modify the `type`, `eval_cfg.evaluator.type`, `reader_cfg.output_col

We also need model responses with randomness, thus setting the `generation_kwargs` parameter is necessary.

You can specifically refer to the following configuration file [configs/eval_code_passk_repeat_dataset.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_code_passk_repeat_dataset.py)
You can specifically refer to the following configuration file [examples/eval_code_passk_repeat_dataset.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_code_passk_repeat_dataset.py)
2 changes: 1 addition & 1 deletion docs/en/advanced_guides/code_eval_service.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ When the model inference and code evaluation services are running on the same ho

### Configuration File

We provide [the configuration file](https://github.com/open-compass/opencompass/blob/main/configs/eval_codegeex2.py) of using `humanevalx` for evaluation on `codegeex2` as reference.
We provide [the configuration file](https://github.com/open-compass/opencompass/blob/main/examples/eval_codegeex2.py) of using `humanevalx` for evaluation on `codegeex2` as reference.

The dataset and related post-processing configurations files can be found at this [link](https://github.com/open-compass/opencompass/tree/main/configs/datasets/humanevalx) with attention paid to the `evaluator` field in the humanevalx_eval_cfg_dict.

Expand Down
2 changes: 1 addition & 1 deletion docs/en/advanced_guides/contamination_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ will report the accuracy or perplexity of ceval on subsets composed of these thr

- If the performance of the three is relatively close, the contamination level of the model on that test set is light; otherwise, it is heavy.

The following configuration file can be referenced [link](https://github.com/open-compass/opencompass/blob/main/configs/eval_contamination.py):
The following configuration file can be referenced [link](https://github.com/open-compass/opencompass/blob/main/examples/eval_contamination.py):

```python
from mmengine.config import read_base
Expand Down
2 changes: 1 addition & 1 deletion docs/en/advanced_guides/evaluation_lightllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ else:
### Step-2: Evaluate the above model using OpenCompass.

```shell
python run.py configs/eval_lightllm.py
python run.py examples/eval_lightllm.py
```

You are expected to get the evaluation results after the inference and evaluation.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/advanced_guides/needleinahaystack_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ For other models, it is recommended to write your own config file (such as `exam
You can then run evaluation with:

```bash
python run.py configs/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
python run.py examples/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
```

No need to manually specify `--dataset`, `--models`, or `--summarizer` again.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/advanced_guides/objective_judgelm_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ OpenCompass currently supports most datasets that use `GenInferencer` for infere

### Step One: Building Evaluation Configurations, Using MATH as an Example

Below is the Config for evaluating the MATH dataset with JudgeLLM, with the evaluation model being *Llama3-8b-instruct* and the JudgeLLM being *Llama3-70b-instruct*. For more detailed config settings, please refer to `configs/eval_math_llm_judge.py`. The following is a brief version of the annotations to help users understand the meaning of the configuration file.
Below is the Config for evaluating the MATH dataset with JudgeLLM, with the evaluation model being *Llama3-8b-instruct* and the JudgeLLM being *Llama3-70b-instruct*. For more detailed config settings, please refer to `examples/eval_math_llm_judge.py`. The following is a brief version of the annotations to help users understand the meaning of the configuration file.

```python
# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
Expand Down
2 changes: 1 addition & 1 deletion docs/en/advanced_guides/prompt_attack.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ attack = dict(
Please use `--mode infer` when run the attack experiment, and set `PYTHONPATH` env.

```shell
python run.py configs/eval_attack.py --mode infer
python run.py examples/eval_attack.py --mode infer
```

All the results will be saved in `attack` folder.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/advanced_guides/subjective_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of

## Initiating Subjective Evaluation

Similar to existing objective evaluation methods, you can configure related settings in `configs/eval_subjective.py`.
Similar to existing objective evaluation methods, you can configure related settings in `examples/eval_subjective.py`.

### Basic Parameters: Specifying models, datasets, and judgemodels

Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/advanced_guides/circular_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,4 +108,4 @@ summarizer = dict(
)
```

更多复杂的评测案例可以参考这个样例代码: https://github.com/open-compass/opencompass/tree/main/configs/eval_circular.py
更多复杂的评测案例可以参考这个样例代码: https://github.com/open-compass/opencompass/tree/main/examples/eval_circular.py
4 changes: 2 additions & 2 deletions docs/zh_cn/advanced_guides/code_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ models = [
注意:`num_return_sequences` 必须大于等于k,本身pass@k是计算的概率估计。

具体可以参考以下配置文件
[configs/eval_code_passk.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_code_passk.py)
[examples/eval_code_passk.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_code_passk.py)

### 模型不支持多回复

Expand Down Expand Up @@ -103,4 +103,4 @@ models = [
另外我们需要模型的回复有随机性,同步需要设置`generation_kwargs`参数。

具体可以参考以下配置文件
[configs/eval_code_passk_repeat_dataset.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_code_passk_repeat_dataset.py)
[examples/eval_code_passk_repeat_dataset.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_code_passk_repeat_dataset.py)
2 changes: 1 addition & 1 deletion docs/zh_cn/advanced_guides/code_eval_service.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ telnet your_service_ip_address your_service_port

### 配置文件

我们已经提供了 huamaneval-x 在 codegeex2 上评估的\[配置文件\]作为参考(https://github.com/open-compass/opencompass/blob/main/configs/eval_codegeex2.py)。
我们已经提供了 huamaneval-x 在 codegeex2 上评估的\[配置文件\]作为参考(https://github.com/open-compass/opencompass/blob/main/examples/eval_codegeex2.py)。
其中数据集以及相关后处理的配置文件为这个[链接](https://github.com/open-compass/opencompass/tree/main/configs/datasets/humanevalx), 需要注意 humanevalx_eval_cfg_dict 中的evaluator 字段。

```python
Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/advanced_guides/contamination_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ gsm8k-ref-ppl f729ba average_ppl unknown 1.55 1.2

- 若三者性能较为接近,则模型在该测试集上的污染程度较轻;反之则污染程度较重。

我们可以参考使用以下配置文件 [link](https://github.com/open-compass/opencompass/blob/main/configs/eval_contamination.py):
我们可以参考使用以下配置文件 [link](https://github.com/open-compass/opencompass/blob/main/examples/eval_contamination.py):

```python
from mmengine.config import read_base
Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/advanced_guides/evaluation_lightllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ else:
### 第二步: 使用 OpenCompass 评测上述模型

```shell
python run.py configs/eval_lightllm.py
python run.py examples/eval_lightllm.py
```

当模型完成推理和指标计算后,我们便可获得模型的评测结果。
Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/advanced_guides/needleinahaystack_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ pip install vllm
当书写好测试的`config`文件后,我们可以命令行中通过`run.py`文件传入对应的config文件路径,例如:

```bash
python run.py configs/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
python run.py examples/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
```

注意,此时我们不需传入`--dataset, --models, --summarizer `等参数,因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-num-workers`的设定以调节并行工作的workers的数量。
Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/advanced_guides/objective_judgelm_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

### 第一步:构建评测配置,以MATH为例

下面是对MATH数据集进行JudgeLLM评测的Config,评测模型为*Llama3-8b-instruct*,JudgeLLM为*Llama3-70b-instruct*。更详细的config setting请参考 `configs/eval_math_llm_judge.py`,下面我们提供了部分简略版的注释,方便用户理解配置文件的含义。
下面是对MATH数据集进行JudgeLLM评测的Config,评测模型为*Llama3-8b-instruct*,JudgeLLM为*Llama3-70b-instruct*。更详细的config setting请参考 `examples/eval_math_llm_judge.py`,下面我们提供了部分简略版的注释,方便用户理解配置文件的含义。

```python
# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/advanced_guides/prompt_attack.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ attack = dict(
请当运行攻击实验的时候请使用 `--mode infer` 选项,并需要指定`PYTHONPATH`。

```shell
python run.py configs/eval_attack.py --mode infer
python run.py examples/eval_attack.py --mode infer
```

所有结果都将保存在名为“attack”的文件夹中。
Expand Down
4 changes: 2 additions & 2 deletions docs/zh_cn/advanced_guides/subjective_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@

## 启动主观评测

类似于已有的客观评测方式,可以在configs/eval_subjective.py中进行相关配置
类似于已有的客观评测方式,可以在examples/eval_subjective.py中进行相关配置

### 基本参数models, datasets 和 judgemodels的指定

Expand Down Expand Up @@ -134,7 +134,7 @@ judgemodel通常被设置为GPT4等强力模型,可以直接按照config文件
### 第三步 启动评测并输出评测结果

```shell
python run.py configs/eval_subjective.py -r
python run.py examples/eval_subjective.py -r
```

- `-r` 参数支持复用模型推理和评估结果。
Expand Down
18 changes: 9 additions & 9 deletions docs/zh_cn/get_started/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

**可视化**:评估完成后,OpenCompass 将结果整理成易读的表格,并将其保存为 CSV 和 TXT 文件。你也可以激活飞书状态上报功能,此后可以在飞书客户端中及时获得评测状态报告。

接下来,我们将展示 OpenCompass 的基础用法,展示基座模型模型 [InternLM2-1.8B](https://huggingface.co/internlm/internlm2-1_8b) 和对话模型 [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)、[Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) 在 [GSM8K](https://github.com/openai/grade-school-math) 和 [MATH](https://github.com/hendrycks/math) 下采样数据集上的评估。它们的配置文件可以在 [configs/eval_chat_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_chat_demo.py) 和 [configs/eval_base_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_base_demo.py) 中找到。
接下来,我们将展示 OpenCompass 的基础用法,展示基座模型模型 [InternLM2-1.8B](https://huggingface.co/internlm/internlm2-1_8b) 和对话模型 [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)、[Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) 在 [GSM8K](https://github.com/openai/grade-school-math) 和 [MATH](https://github.com/hendrycks/math) 下采样数据集上的评估。它们的配置文件可以在 [examples/eval_chat_demo.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_chat_demo.py) 和 [examples/eval_base_demo.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_base_demo.py) 中找到。

在运行此实验之前,请确保您已在本地安装了 OpenCompass。这个例子 (应该) 可以在一台 _GTX-1660-6G_ GPU 下成功运行。

Expand Down Expand Up @@ -136,7 +136,7 @@ python tools/list_configs.py llama mmlu

除了通过命令行配置实验外,OpenCompass 还允许用户在配置文件中编写实验的完整配置,并通过 `run.py` 直接运行它。配置文件是以 Python 格式组织的,并且必须包括 `datasets` 和 `models` 字段。

本次测试配置在 [configs/eval_chat_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_chat_demo.py) 中。此配置通过 [继承机制](../user_guides/config.md#继承机制) 引入所需的数据集和模型配置,并以所需格式组合 `datasets` 和 `models` 字段。
本次测试配置在 [examples/eval_chat_demo.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_chat_demo.py) 中。此配置通过 [继承机制](../user_guides/config.md#继承机制) 引入所需的数据集和模型配置,并以所需格式组合 `datasets` 和 `models` 字段。

```python
from mmengine.config import read_base
Expand All @@ -154,7 +154,7 @@ models = hf_qwen2_1_5b_instruct_models + hf_internlm2_chat_1_8b_models
运行任务时,我们只需将配置文件的路径传递给 `run.py`:

```bash
python run.py configs/eval_chat_demo.py --debug
python run.py examples/eval_chat_demo.py --debug
```

:::{dropdown} 关于 `models`
Expand Down Expand Up @@ -190,7 +190,7 @@ models = [

与模型类似,数据集的配置文件也提供在 `configs/datasets` 下。用户可以在命令行中使用 `--datasets`,或通过继承在配置文件中导入相关配置

下面是来自 `configs/eval_chat_demo.py` 的与数据集相关的配置片段:
下面是来自 `examples/eval_chat_demo.py` 的与数据集相关的配置片段:

```python
from mmengine.config import read_base # 使用 mmengine.read_base() 读取基本配置
Expand Down Expand Up @@ -270,7 +270,7 @@ python run.py \

除了通过命令行配置实验外,OpenCompass 还允许用户在配置文件中编写实验的完整配置,并通过 `run.py` 直接运行它。配置文件是以 Python 格式组织的,并且必须包括 `datasets` 和 `models` 字段。

本次测试配置在 [configs/eval_base_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_base_demo.py) 中。此配置通过 [继承机制](../user_guides/config.md#继承机制) 引入所需的数据集和模型配置,并以所需格式组合 `datasets` 和 `models` 字段。
本次测试配置在 [examples/eval_base_demo.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_base_demo.py) 中。此配置通过 [继承机制](../user_guides/config.md#继承机制) 引入所需的数据集和模型配置,并以所需格式组合 `datasets` 和 `models` 字段。

```python
from mmengine.config import read_base
Expand All @@ -288,7 +288,7 @@ models = hf_qwen2_1_5b_models + hf_internlm2_1_8b_models
运行任务时,我们只需将配置文件的路径传递给 `run.py`:

```bash
python run.py configs/eval_base_demo.py --debug
python run.py examples/eval_base_demo.py --debug
```

:::{dropdown} 关于 `models`
Expand Down Expand Up @@ -324,7 +324,7 @@ models = [

与模型类似,数据集的配置文件也提供在 `configs/datasets` 下。用户可以在命令行中使用 `--datasets`,或通过继承在配置文件中导入相关配置

下面是来自 `configs/eval_base_demo.py` 的与数据集相关的配置片段:
下面是来自 `examples/eval_base_demo.py` 的与数据集相关的配置片段:

```python
from mmengine.config import read_base # 使用 mmengine.read_base() 读取基本配置
Expand Down Expand Up @@ -358,7 +358,7 @@ OpenCompass 通常假定运行环境网络是可用的。如果您遇到网络
由于 OpenCompass 默认并行启动评估过程,我们可以在第一次运行时以 `--debug` 模式启动评估,并检查是否存在问题。包括在前述的所有文档中,我们都使用了 `--debug` 开关。在 `--debug` 模式下,任务将按顺序执行,并实时打印输出。

```bash
python run.py configs/eval_chat_demo.py -w outputs/demo --debug
python run.py examples/eval_chat_demo.py -w outputs/demo --debug
```

对话默写 'internlm/internlm2-chat-1_8b' 和 'Qwen/Qwen2-1.5B-Instruct' 将在首次运行期间从 HuggingFace 自动下载。
Expand All @@ -371,7 +371,7 @@ python run.py configs/eval_chat_demo.py -w outputs/demo --debug
然后,您可以按 `Ctrl+C` 中断程序,并以正常模式运行以下命令:

```bash
python run.py configs/eval_chat_demo.py -w outputs/demo
python run.py examples/eval_chat_demo.py -w outputs/demo
```

在正常模式下,评估任务将在后台并行执行,其输出将被重定向到输出目录 `outputs/demo/{TIMESTAMP}`。前端的进度条只指示已完成任务的数量,而不考虑其成功或失败。**任何后端任务失败都只会在终端触发警告消息。**
Expand Down
8 changes: 4 additions & 4 deletions opencompass/configs/datasets/CHARM/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,11 +95,11 @@ ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
```bash
cd ${path_to_opencompass}

# modify config file `configs/eval_charm_rea.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_rea.py -r --dump-eval-details
# modify config file `examples/eval_charm_rea.py`: uncomment or add models you want to evaluate
python run.py examples/eval_charm_rea.py -r --dump-eval-details

# modify config file `configs/eval_charm_mem.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_mem.py -r --dump-eval-details
# modify config file `examples/eval_charm_mem.py`: uncomment or add models you want to evaluate
python run.py examples/eval_charm_mem.py -r --dump-eval-details
```
The inference and evaluation results would be in `${path_to_opencompass}/outputs`, like this:
```bash
Expand Down
8 changes: 4 additions & 4 deletions opencompass/configs/datasets/CHARM/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,11 +93,11 @@ ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
```bash
cd ${path_to_opencompass}

# 修改配置文件`configs/eval_charm_rea.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py configs/eval_charm_rea.py -r --dump-eval-details
# 修改配置文件`examples/eval_charm_rea.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py examples/eval_charm_rea.py -r --dump-eval-details

# 修改配置文件`configs/eval_charm_mem.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py configs/eval_charm_mem.py -r --dump-eval-details
# 修改配置文件`examples/eval_charm_mem.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py examples/eval_charm_mem.py -r --dump-eval-details
```
推理和评测的结果位于路径`${path_to_opencompass}/outputs`, 如下所示:
```bash
Expand Down
2 changes: 1 addition & 1 deletion opencompass/configs/datasets/babilong/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ BABILong paper provides in total 20 tasks, we provide 10 tasks configurations in
Opencompass provides a demo for evaluating language models on the BABILong dataset.

```bash
opencompass configs/eval_babilong.py
opencompass examples/eval_babilong.py
```
OpenCompass provides the results of some models on the BABILong dataset. The evaluation results are run with LMDeploy with default model settings.

Expand Down
4 changes: 2 additions & 2 deletions opencompass/configs/datasets/chinese_simpleqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,9 @@ We provide three evaluation methods.
```


- Step3: configuration your launch in configs/eval_chinese_simpleqa.py, set your models to be evaluated, set your judge model (we recommend to use gpt4o) and launch it!
- Step3: configuration your launch in examples/eval_chinese_simpleqa.py, set your models to be evaluated, set your judge model (we recommend to use gpt4o) and launch it!
```
python run.py configs/eval_chinese_simpleqa.py
python run.py examples/eval_chinese_simpleqa.py
```


Expand Down
2 changes: 1 addition & 1 deletion opencompass/configs/datasets/inference_ppl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ where Eq. (1) is the normal mean ppl computation formula, for inference-ppl, we

```shell
cd opencompass
python run.py configs/eval_inference_ppl.py
python run.py examples/eval_inference_ppl.py
```

# Some results
Expand Down
Loading
Loading