TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

This repository includes source code of the TF-Locoformer model proposed in the following paper:

@InProceedings{Saijo2024_TFLoco,
  author    =  {Saijo, Kohei and Wichern, Gordon and Germain, Fran\c{c}ois G. and Pan, Zexu and {Le Roux}, Jonathan},
  title     =  {TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement},
  booktitle =  {Proc. International Workshop on Acoustic Signal Enhancement (IWAENC)},
  year      =  2024,
  month     =  sep
}

In addition to the original TF-Locoformer, the repository also supports some variants of TF-Locoformer:

TF-Locoformer with no positional encoding (TF-Locoformer-NoPE) introduced in [Saijo2025Comparative]
TF-Locoformer with band-split encoder (BS-Locoformer) introduced in [Saijo2025Task]

@InProceedings{Saijo2025Comparative,
  author    =  {Saijo, Kohei and Ogawa, Tetsuji},
  title     =  {A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models},
  booktitle =  {Proc. European Signal Processing Conference (EUSIPCO)},
  year      =  2025,
  month     =  sep
}

@InProceedings{Saijo2025Task,
  author    =  {Saijo, Kohei and Ebbers, Janek and Germain, Fran\c{c}ois G. and Wichern, Gordon and {Le Roux}, Jonathan},
  title     =  {Task-Aware Unified Source Separation},
  booktitle =  {Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year      =  2025,
  month     =  april
}

ESPnet compatible version and standalone version
Environmental setup: Installing ESPnet from source and injecting the TF-Locoformer code
Using a pre-trained model
Example of training and inference
Instructions for running training on each dataset in the ESPnet pipeline (Librimix, WHAMR! and DNS)
Contributing
Copyright and license

ESPnet compatible version and standalone version

This repository includes the TF-Locoformer model and its variants with two formats:

1. ESPnet comppatible code

ESPnet compatible version is placed under espnet2/enh/separator/tflocoformer_separator.py. This code can be used in ESPnet without any modification.

If you would like to reproduce the results in [Saijo2024_TFLoco], please follow the instructions in the next section

2. Standalone code

The ESPnet compatible version needs to be modified when you would like to use it in your own code, as it imports some funcions/classes from ESPnet. We prepared the standalone version without any dependency on ESPnet, and placed it under the standalone/ directory.

This standalone code is compatible with the pre-trained models provided in the repo (see this section for more details on pre-trained models), but needs a slight modification of the keys in the state dict when loading the weights. Please refer to the example code below or tests/test_tflocoformer_load_pretrained_weights.py to know how to load the pre-trained weights.

state_dict = torch.load("./egs2/whamr/enh1/exp/enh_train_enh_tflocoformer_raw/valid.loss.ave_5best.pth")

# provided pre-trained weights have keys starting with 'separator.', but this has to be removed.
new_state_dict = {}
for k, v in state_dict.items():
    new_key = ".".join(k.split(".")[1:])
    new_state_dict[new_key] = v
model.load_state_dict(new_state_dict, strict=True)

Please note that we do not provide any training recipe for this version. You can setup the ESPnet environment as instructed below if you need a training recipe.

Environmental setup: Installing ESPnet from source and injecting the TF-Locoformer code

In this repo, we provide the code for TF-Locoformer along with scripts to run training and inference in ESPnet. The following commands install ESPnet from source and copy the TF-Locoformer code to the appropriate directories in ESPnet.

For more details on installing ESPnet, please refer to https://espnet.github.io/espnet/installation.html.

# Clone espnet code.
git clone https://github.com/espnet/espnet.git

# Checkout the commit where we tested our code.
cd ./espnet && git checkout 90eed8e53498e7af682bc6ff39d9067ae440d6a4

# Set up conda environment.
# ./setup_anaconda /path/to/conda environment-name python-version
cd ./tools && ./setup_anaconda.sh /path/to/conda tflocoformer 3.10.8

# Install espnet from source with other dependencies. We used torch 2.1.0 and cuda 11.8.
# NOTE: torch version must be 2.x.x for other dependencies.
# If you encounter issues with the ESPnet installation, this issue might be helpful https://github.com/espnet/espnet/issues/6106#issuecomment-2841967451
make TH_VERSION=2.1.0 CUDA_VERSION=11.8

# Install the RoPE package.
conda activate tflocoformer && pip install rotary-embedding-torch==0.6.1

# Copy the TF-Locoformer code to ESPnet.
# NOTE: ./copy_files_to_espnet.sh changes `espnet2/tasks/enh.py`. Please be careful when using your existing ESPnet environment.
cd ../../ && git clone https://github.com/merlresearch/tf-locoformer.git && cd tf-locoformer
./copy_files_to_espnet.sh /path/to/espnet-root

Using a pre-trained model

This repo supports speech separation/enhancement on 4 datasets:

WSJ0-2mix (egs2/wsj0_2mix/enh1)
Libri2mix (egs2/librimix/enh1)
WHAMR! (egs2/whamr/enh1)
DNS-Interspeech2020 dataset (egs2/dns_ins20/enh1)

In each egs2 directory, you can find the pre-trained model under the exp directory.

One can easily use the pre-trained model to separate an audio mixture as follows:

# assuming you are now at ./egs2/wsj0_2mix/enh1
python separate.py \
    --model_path ./exp/enh_train_enh_tflocoformer_pretrained/valid.loss.ave_5best.pth \
    --audio_path /path/to/input_audio \
    --audio_output_dir /path/to/output_directory

Example of training and inference

Here are example commands to run the WSJ0-2mix recipe. Other dataset recipes are similar, but require additional steps (refer to the next section).

# Go to the corresponding example directory.
cd ../espnet/egs2/wsj0_2mix/enh1

# Data preparation and stats collection if necessary.
# NOTE: please fill the corresponding part of db.sh for data preparation.
./run.sh --stage 1 --stop_stage 5

# Training. We used 4 GPUs for training (batch size was 1 on each GPU; GPU RAM depends on dataset).
./run.sh --stage 6 --stop_stage 6 --enh_config conf/tuning/train_enh_tflocoformer.yaml --ngpu 4

# Inference.
./run.sh --stage 7 --stop_stage 7 --enh_config conf/tuning/train_enh_tflocoformer.yaml --ngpu 1 --gpu_inference true --inference_model valid.loss.ave_5best.pth

# Scoring. Scores are written in RESULT.md.
./run.sh --stage 8 --stop_stage 8 --enh_config conf/tuning/train_enh_tflocoformer.yaml

Instructions for running training on each dataset in the ESPnet pipeline

Some recipe changes are required to run the experiments as in the paper. After finishing the processes below, you can run the recipe in a normal way as described above.

WHAMR!

First, please install pyroomacoustics: pip install pyroomacoustics==0.2.0.

The default task in ESPnet is noisy reverberant speech enhancement without dereverberation (using mix_single_reverb subset), while we did noisy reverberant speech separation with dereverberation. To do the same task as in the paper, please run the following commands in egs2/whamr/enh1:

# Speech enhancement task -> speech separation task.
sed -i '13,15s|single|both|' run.sh
sed -i '23s|1|2|' run.sh

# Modify the url of the WHAM! noise and the WHAMR! script.
sed -i '42s|.*|  wham_noise_url=https://my-bucket-a8b4b49c25c811ee9a7e8bba05fa24c7.s3.amazonaws.com/wham_noise.zip|' local/whamr_create_mixture.sh
sed -i '52s|.*|script_url=https://my-bucket-a8b4b49c25c811ee9a7e8bba05fa24c7.s3.amazonaws.com/whamr_scripts.tar.gz|' local/whamr_create_mixture.sh

cd local && patch -b < whamr_data_prep.patch && cd ..

Then, you can start running the recipe from stage 1.

Libri2mix

The default task in ESPnet is noisy speech separation, while we did noise-free speech separation. To do the same task as in the paper, run the following commands in egs2/librimix/enh1:

# Apply the patch file to data.sh.
cd local && patch -b < data.patch && cd ..

# Use only train-360. By default, both train-100 and train-360 are used.
sed -i '12s|"train"|"train-360"|' run.sh

# Noisy separation -> noise-free separation.
sed -i '17s|true|false|' run.sh

# Data preparation in the "clean" condition (noise-free separation).
./run.sh --stage 1 --stop_stage 5 --local_data_opts "--sample_rate 8k --min_or_max min --cond clean"

DNS interspeech2020 dataset

In the paper, we simulated 3000 hours of noisy speech: 2700 h for training and 300 h for validation. To reproduce the paper's result, run the following commands in egs2/dns_ins20/enh1:

sed -i '18s|.*|total_hours=3000|' local/dns_create_mixture.sh
sed -i '19s|.*|snr_lower=-5|' local/dns_create_mixture.sh
sed -i '20s|.*|snr_upper=15|' local/dns_create_mixture.sh

We recommend reducing the size of the validation data to save training time since the validation loop with 300 h takes a very long time.

Contributing

See CONTRIBUTING.md for our policy on contributions.

Copyright and license

Released under Apache-2.0 license, as found in the LICENSE.md file.

All files, except as noted below:

Copyright (C) 2024 Mitsubishi Electric Research Laboratories (MERL)

SPDX-License-Identifier: Apache-2.0

The following patch files:

espnet2/tasks/enh.patch
egs2/librimix/local/data.patch
egs2/whamr/local/whamr_data_prep.patch

include code from https://github.com/espnet/espnet (license included in LICENSES/Apache-2.0.md)

Copyright (C) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (C) 2017 ESPnet Developers

SPDX-License-Identifier: Apache-2.0
SPDX-License-Identifier: Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
.reuse		.reuse
LICENSES		LICENSES
egs2		egs2
espnet2		espnet2
standalone		standalone
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
copy_files_to_espnet.sh		copy_files_to_espnet.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Table of contents

ESPnet compatible version and standalone version

1. ESPnet comppatible code

2. Standalone code

Environmental setup: Installing ESPnet from source and injecting the TF-Locoformer code

Using a pre-trained model

Example of training and inference

Instructions for running training on each dataset in the ESPnet pipeline

WHAMR!

Libri2mix

DNS interspeech2020 dataset

Contributing

Copyright and license

About

Uh oh!

Releases 3

Packages

Contributors 2

Uh oh!

Languages

Uh oh!

License

Uh oh!

merlresearch/tf-locoformer

Folders and files

Latest commit

History

Repository files navigation

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Table of contents

ESPnet compatible version and standalone version

1. ESPnet comppatible code

2. Standalone code

Environmental setup: Installing ESPnet from source and injecting the TF-Locoformer code

Using a pre-trained model

Example of training and inference

Instructions for running training on each dataset in the ESPnet pipeline

WHAMR!

Libri2mix

DNS interspeech2020 dataset

Contributing

Copyright and license

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

Packages