Skip to content

swiss-ai/parity-aware-bpe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv License

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

This repository provides an implementation of the Parity-Aware BPE algorithm. Paper: "Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization" [arXiv 2025]

Overview

Parity-aware BPE learns a tokenization that ensures parity in token lengths across languages on a multi-parallel development set. Unlike standard BPE, which optimizes merges based on a single corpus, this approach explicitly considers cross-lingual fairness during the tokenization process.

Installation

You can install this package directly from GitHub:

pip install git+https://github.com/swiss-ai/parity-aware-bpe.git

For development installation:

git clone https://github.com/swiss-ai/parity-aware-bpe.git
cd parity-aware-bpe
pip install -e .

Usage Instructions

The arguments of parity-aware-bpe are as follows:

  • --variant: Parity-aware BPE variant. Options:
    • base – standard parity-aware BPE (default)
    • window – moving-window balancing version
  • --input: Space-separated list of training corpora (one per language).
  • --dev: Space-separated list of development texts used for parity computation (multi-parallel). The tool assumes that the language of the nth input corpus corresponds to the nth dev corpus (same order as --input).
  • --ratio: Space-separated list of desired compression ratios (floats), relative to pre-tokenized training set length, per input language. Can be used for parity computation (on training data) in lieu of development set.
  • --global-merges: Optionally, one can perform the first M merge operations based on global frequency statistics (equivalent to standard BPE), and only switch to a parity-optimizing mode after (Hybrid parity-aware BPE). This argument controls how many merge operations are performed based on global statistics.
  • --symbols: Total number of BPE merges to perform.
  • --output: Path to the output file where BPE merge rules will be saved (one per line).
  • --total-symbols: Adjusts the number of merges by subtracting character counts (so --symbols approximates total symbols needed).
  • --min-frequency: Minimum pair frequency to continue merging (default: 2).
  • --window-size: Context window size for the window-balancing variant (default: 100).
  • --alpha: Parameter controlling the moving-window balancing behavior (default: 2).

Example Usage

Python3 parity_aware_bpe/parity_aware_learn_bpe.py \
        --symbols {num_operations} \ 
        --variant {"base" or "window"} \
        --input {train_files}  \
        --dev {development_files}  \
        --output {output_file} 

Classical BPE

To run the classical BPE algorithm you can use learn_bpe.py:

Python3 parity_aware_bpe/learn_bpe.py \
        --symbols {num_operations} \ 
        --input {train_files}  \
        --dev {development_files}  \
        --output {output_file} 

Generating a Vocabulary

After learning the merges, you can build a vocabulary file using the build_vocab_from_merges function in HF_tokenizer.py. To create a Hugging Face-compatible tokenizer:

Python3 parity_aware_bpe/HF_tokenizer.py \
        --merges_file_path {merge_file_path} \
        --tokenizer_path {tokenizer_save_folder}

Loading the tokenizer

import os
from transformers import PreTrainedTokenizerFast
from tokenizers.pre_tokenizers import Whitespace, ByteLevel
from tokenizers.models import BPE
from tokenizers import Tokenizer, pre_tokenizers

merge_file = os.path.join(tokenizer_path, "merges.txt")
vocab_file = os.path.join(tokenizer_path, "vocab.json")
tokenizer = Tokenizer(BPE(vocab=vocab_file, merges=merge_file))
pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), ByteLevel(use_regex=False)]) # You need to use the same pre_tokenizer as the one used in BPE training
tokenizer.pre_tokenizer = pre_tokenizer

wrapped_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

Intrinsic Evaluation

For our intrinsic evaluation, we use Tok##Suite to analyze and compare tokenizers across multiple languages and metrics. You can find the evaluation suite here.

Citation

If you use this code for your research, please cite our paper:

@article{foroutan-meister-et-al-2025-parity-aware-bpe,
  title={Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization},
  author={Foroutan, Negar and Meister, Clara and Paul, Debjit and Niklaus, Joel and Ahmadi, Sina, and Bosselut, Antoine and Sennrich, Rico},
  url={https://arxiv.org/abs/2508.04796},
  booktitle={arXiv},
  year={2025}
}

About

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [arXiv 2025]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages