Customized Sequence Evaluation Metric (CSEM)

This repository contains the code and released models for the TASLP 2025 paper Learning Evaluation Models from Large Language Models for Sequence Generation 📝. We propose CSEM (Customized Sequence Evaluation Metric), a three-stage training framework that leverages large language models to automatically generate labeled data for training evaluation metrics, thus eliminating reliance on human annotations. CSEM supports diverse evaluation settings, including single-aspect, multi-aspect, reference-based, and reference-free, enabling flexible and effective assessment of sequence generation across varied scenarios.

Installation Guide

The code of this repo is modified from Unbabel/COMET 🌹🌹🌹. If you encounter installation issues (e.g., related to PyTorch or CUDA), we recommend first checking the COMET issues for potential solutions. If the problem persists, please feel free to submit an issue in this repository.

git clone https://gitee.com/wangclnlp/CSEM
cd CSEM
pip install poetry
poetry install

Preparing Datasets

Training a Generative Language Model

You can train the model with facebookresearch/fairseq or any project traning large language models.

Sampling from the Generative Language Model and Labeling the Data

Prepare queries, corresponding answers, and responses from generative models, then label the responses with specified template (take "Single-aspect Evaluation for Machine Translation" as an example):

Based on the human reference, score the following translation from [Source Language] to [Target Language] with respect to [Aspect] with one to five stars, where one star means [Description of the Worst Translation on a Single Aspect] and five stars mean [Description of the Perfect Translation on a Single Aspect].
Note that [Definition of the Used Single Evaluation Aspect].

[Source Language] source: [Source]
[Target Language] human reference: [Reference]
[Target Language] translation: [Translation]
Stars:

Post-processing Data

The data should be in csv format with different columns depending on whether there is reference or not.

Data w/ Reference

The columns include src, mt ref and score.

Example:

src	und wieder ins haus zurück bringen.
mt	then they had to bring them in.
ref	putting back in the house.
score	2

Data w/o Reference

The columns include src, mt and score.

Example:

src	das ist sehr praktisch und extrem toll.
mt	this is very practical and extremely awesome.
score	4

Training Scripts

Training arguments are managed in yaml format in subdirectory configs/, after configuring arguments in the config file, you can train the model-based metric with following command:

python comet/cli/train.py --cfg /path/to/config/file

Training w/ Reference

The example config file for training without reference is located at configs/completeness_diff_train_size/reference_model.yaml.
Training w/o Reference

The example config file for training without reference is located at configs/coherence_diff_train_size/referenceless_model.yaml.

Citation

@misc{learning2025wang,
      title={Learning Evaluation Models from Large Language Models for Sequence Generation}, 
      author={Chenglong Wang and Hang Zhou and Kaiyan Chang and Tongran Liu and Chunliang Zhang and Quan Du and Tong Xiao and Yue Zhang and Jingbo Zhu},
      year={2025},
      eprint={2308.04386},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2308.04386}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
comet		comet
configs		configs
docs		docs
tests		tests
train_scripts		train_scripts
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.models.md		LICENSE.models.md
MODELS.md		MODELS.md
README.md		README.md
calc_comet_score.sh		calc_comet_score.sh
calc_mt-metrics-eval_score.sh		calc_mt-metrics-eval_score.sh
compute_accuracy_system.py		compute_accuracy_system.py
compute_score.py		compute_score.py
count_score_of_sys.py		count_score_of_sys.py
csem.jpg		csem.jpg
cur_env.txt		cur_env.txt
poetry.lock		poetry.lock
pre_env.txt		pre_env.txt
preprocess_data.py		preprocess_data.py
preprocess_data_with_ref.py		preprocess_data_with_ref.py
preprocess_data_without_ref.py		preprocess_data_without_ref.py
pyproject.toml		pyproject.toml
score.py		score.py
split_summary_valid.py		split_summary_valid.py
train_comet.sh		train_comet.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Customized Sequence Evaluation Metric (CSEM)

Installation Guide

Preparing Datasets

Training a Generative Language Model

Sampling from the Generative Language Model and Labeling the Data

Post-processing Data

Data w/ Reference

Data w/o Reference

Training Scripts

Citation

About

Uh oh!

Releases

Packages

Languages

License

wangclnlp/CSEM

Folders and files

Latest commit

History

Repository files navigation

Customized Sequence Evaluation Metric (CSEM)

Installation Guide

Preparing Datasets

Training a Generative Language Model

Sampling from the Generative Language Model and Labeling the Data

Post-processing Data

Data w/ Reference

Data w/o Reference

Training Scripts

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages