This repository contains the code and released models for the TASLP 2025 paper Learning Evaluation Models from Large Language Models for Sequence Generation 📝. We propose CSEM (Customized Sequence Evaluation Metric), a three-stage training framework that leverages large language models to automatically generate labeled data for training evaluation metrics, thus eliminating reliance on human annotations. CSEM supports diverse evaluation settings, including single-aspect, multi-aspect, reference-based, and reference-free, enabling flexible and effective assessment of sequence generation across varied scenarios.
The code of this repo is modified from Unbabel/COMET 🌹🌹🌹. If you encounter installation issues (e.g., related to PyTorch or CUDA), we recommend first checking the COMET issues for potential solutions. If the problem persists, please feel free to submit an issue in this repository.
git clone https://gitee.com/wangclnlp/CSEM
cd CSEM
pip install poetry
poetry install
You can train the model with facebookresearch/fairseq or any project traning large language models.
Prepare queries, corresponding answers, and responses from generative models, then label the responses with specified template (take "Single-aspect Evaluation for Machine Translation" as an example):
Based on the human reference, score the following translation from [Source Language] to [Target Language] with respect to [Aspect] with one to five stars, where one star means [Description of the Worst Translation on a Single Aspect] and five stars mean [Description of the Perfect Translation on a Single Aspect].
Note that [Definition of the Used Single Evaluation Aspect].
[Source Language] source: [Source]
[Target Language] human reference: [Reference]
[Target Language] translation: [Translation]
Stars:
The data should be in csv format with different columns depending on whether there is reference or not.
The columns include src
, mt
ref
and score
.
Example:
src | und wieder ins haus zurück bringen. |
mt | then they had to bring them in. |
ref | putting back in the house. |
score | 2 |
The columns include src
, mt
and score
.
Example:
src | das ist sehr praktisch und extrem toll. |
mt | this is very practical and extremely awesome. |
score | 4 |
Training arguments are managed in yaml format in subdirectory configs/
, after configuring arguments in the config file, you can train the model-based metric with following command:
python comet/cli/train.py --cfg /path/to/config/file
-
Training w/ Reference
The example config file for training without reference is located at
configs/completeness_diff_train_size/reference_model.yaml
. -
Training w/o Reference
The example config file for training without reference is located at
configs/coherence_diff_train_size/referenceless_model.yaml
.
@misc{learning2025wang,
title={Learning Evaluation Models from Large Language Models for Sequence Generation},
author={Chenglong Wang and Hang Zhou and Kaiyan Chang and Tongran Liu and Chunliang Zhang and Quan Du and Tong Xiao and Yue Zhang and Jingbo Zhu},
year={2025},
eprint={2308.04386},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2308.04386},
}