Skip to content
/ CSEM Public

Code for TASLP 2025 paper "Learning Evaluation Models from Large Language Models for Sequence Generation"

License

Notifications You must be signed in to change notification settings

wangclnlp/CSEM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customized Sequence Evaluation Metric (CSEM)

This repository contains the code and released models for the TASLP 2025 paper Learning Evaluation Models from Large Language Models for Sequence Generation 📝. We propose CSEM (Customized Sequence Evaluation Metric), a three-stage training framework that leverages large language models to automatically generate labeled data for training evaluation metrics, thus eliminating reliance on human annotations. CSEM supports diverse evaluation settings, including single-aspect, multi-aspect, reference-based, and reference-free, enabling flexible and effective assessment of sequence generation across varied scenarios.

Installation Guide

The code of this repo is modified from Unbabel/COMET 🌹🌹🌹. If you encounter installation issues (e.g., related to PyTorch or CUDA), we recommend first checking the COMET issues for potential solutions. If the problem persists, please feel free to submit an issue in this repository.

git clone https://gitee.com/wangclnlp/CSEM
cd CSEM
pip install poetry
poetry install

Preparing Datasets

Training a Generative Language Model

You can train the model with facebookresearch/fairseq or any project traning large language models.

Sampling from the Generative Language Model and Labeling the Data

Prepare queries, corresponding answers, and responses from generative models, then label the responses with specified template (take "Single-aspect Evaluation for Machine Translation" as an example):

Based on the human reference, score the following translation from [Source Language] to [Target Language] with respect to [Aspect] with one to five stars, where one star means [Description of the Worst Translation on a Single Aspect] and five stars mean [Description of the Perfect Translation on a Single Aspect].
Note that [Definition of the Used Single Evaluation Aspect].

[Source Language] source: [Source]
[Target Language] human reference: [Reference]
[Target Language] translation: [Translation]
Stars:

Post-processing Data

The data should be in csv format with different columns depending on whether there is reference or not.

Data w/ Reference

The columns include src, mt ref and score.

Example:

src und wieder ins haus zurück bringen.
mt then they had to bring them in.
ref putting back in the house.
score 2

Data w/o Reference

The columns include src, mt and score.

Example:

src das ist sehr praktisch und extrem toll.
mt this is very practical and extremely awesome.
score 4

Training Scripts

Training arguments are managed in yaml format in subdirectory configs/, after configuring arguments in the config file, you can train the model-based metric with following command:

python comet/cli/train.py --cfg /path/to/config/file

Citation

@misc{learning2025wang,
      title={Learning Evaluation Models from Large Language Models for Sequence Generation}, 
      author={Chenglong Wang and Hang Zhou and Kaiyan Chang and Tongran Liu and Chunliang Zhang and Quan Du and Tong Xiao and Yue Zhang and Jingbo Zhu},
      year={2025},
      eprint={2308.04386},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2308.04386}, 
}

About

Code for TASLP 2025 paper "Learning Evaluation Models from Large Language Models for Sequence Generation"

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published