Skip to content

[tts][ernie_sat]add ernie sat synthesize_e2e #2287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/aishell3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
* voc3 - MultiBand MelGAN
* vc0 - Tactron2 Voice Cloning with GE2E
* vc1 - FastSpeech2 Voice Cloning with GE2E
* ernie_sat - ERNIE-SAT
152 changes: 151 additions & 1 deletion examples/aishell3/ernie_sat/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,151 @@
# ERNIE SAT with AISHELL3 dataset
# ERNIE-SAT with AISHELL3 dataset

ERNIE-SAT 是可以同时处理中英文的跨语言的语音-语言跨模态大模型,其在语音编辑、个性化语音合成以及跨语言的语音合成等多个任务取得了领先效果。可以应用于语音编辑、个性化合成、语音克隆、同传翻译等一系列场景,该项目供研究使用。

## 模型框架
ERNIE-SAT 中我们提出了两项创新:
- 在预训练过程中将中英双语对应的音素作为输入,实现了跨语言、个性化的软音素映射
- 采用语言和语音的联合掩码学习实现了语言和语音的对齐

<p align="center">
<img src="https://user-images.githubusercontent.com/24568452/186110814-1b9c6618-a0ab-4c0c-bb3d-3d860b0e8cc2.png" />
</p>

## Dataset
### Download and Extract
Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.

### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.

## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`.
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
- synthesize waveform from text file.

```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.

```text
dump
├── dev
│ ├── norm
│ └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── speech_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.

Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and id of each utterance.

### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.

### Synthesizing
We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.

Download pretrained HiFiGAN model from [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) and unzip it.
```bash
unzip hifigan_aishell3_ckpt_0.2.0.zip
```
HiFiGAN checkpoint contains files listed below.
```text
hifigan_aishell3_ckpt_0.2.0
├── default.yaml # default config used to train HiFiGAN
├── feats_stats.npy # statistics used to normalize spectrogram when training HiFiGAN
└── snapshot_iter_2500000.pdz # generator parameters of HiFiGAN
```
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
## Speech Synthesis and Speech Editing
### Prepare
**prepare aligner**
```bash
mkdir -p tools/aligner
cd tools
# download MFA
wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz
# extract MFA
tar xvf montreal-forced-aligner_linux.tar.gz
# fix .so of MFA
cd montreal-forced-aligner/lib
ln -snf libpython3.6m.so.1.0 libpython3.6m.so
cd -
# download align models and dicts
cd aligner
wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip
wget https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/simple.lexicon
wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip
wget https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/cmudict-0.7b
cd ../../
```
**prepare pretrained FastSpeech2 models**

ERNIE-SAT use FastSpeech2 as phoneme duration predictor:
```bash
mkdir download
cd download
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip
unzip fastspeech2_conformer_baker_ckpt_0.5.zip
unzip fastspeech2_nosil_ljspeech_ckpt_0.5.zip
cd ../
```
**prepare source data**
```bash
mkdir source
cd source
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540307.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540428.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/LJ050-0278.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p243_313.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p299_096.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/this_was_not_the_show_for_me.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/README.md
cd ../
```

You can check the text of downloaded wavs in `source/README.md`.
### Speech Synthesis and Speech Editing
```bash
./run.sh --stage 3 --stop-stage 3 --gpus 0
```
`stage 3` of `run.sh` calls `local/synthesize_e2e.sh`, `stage 0` of it is **Speech Synthesis** and `stage 1` of it is **Speech Editing**.

You can modify `--wav_path`、`--old_str` and `--new_str` yourself, `--old_str` should be the text corresponding to the audio of `--wav_path`, `--new_str` should be designed according to `--task_name`, both `--source_lang` and `--target_lang` should be `zh` for model trained with AISHELL3 dataset.
## Pretrained Model
Pretrained ErnieSAT model:
- [erniesat_aishell3_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/erniesat_aishell3_ckpt_1.2.0.zip)

Model | Step | eval/mlm_loss | eval/loss
:-------------:| :------------:| :-----: | :-----:
default| 8(gpu) x 289500|51.723782|51.723782
9 changes: 6 additions & 3 deletions examples/aishell3/ernie_sat/conf/default.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# This configuration tested on 8 GPUs (A100) with 80GB GPU memory.
# It takes around 3 days to finish the training,You can adjust
# batch_size、num_workers here and ngpu in local/train.sh for your machine
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
Expand All @@ -21,8 +24,8 @@ mlm_prob: 0.8
###########################################################
# DATA SETTING #
###########################################################
batch_size: 20
num_workers: 2
batch_size: 40
num_workers: 8

###########################################################
# MODEL SETTING #
Expand Down Expand Up @@ -280,4 +283,4 @@ token_list:
- o3
- iang5
- ei5
- <sos/eos>
- <sos/eos>
23 changes: 3 additions & 20 deletions examples/aishell3/ernie_sat/local/synthesize.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,11 @@ config_path=$1
train_output_path=$2
ckpt_name=$3

stage=1
stop_stage=1

# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
--erniesat_config=${config_path} \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi
stage=0
stop_stage=0

# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
Expand Down
52 changes: 52 additions & 0 deletions examples/aishell3/ernie_sat/local/synthesize_e2e.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/bin/bash

config_path=$1
train_output_path=$2
ckpt_name=$3

stage=0
stop_stage=1

if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo 'speech synthesize !'
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \
--task_name=synthesize \
--wav_path=source/SSB03540307.wav\
--old_str='请播放歌曲小苹果。' \
--new_str='歌曲真好听。' \
--source_lang=zh \
--target_lang=zh \
--erniesat_config=${config_path} \
--phones_dict=dump/phone_id_map.txt \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--output_name=exp/pred_gen.wav
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo 'speech edit !'
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \
--task_name=edit \
--wav_path=source/SSB03540428.wav \
--old_str='今天天气很好' \
--new_str='今天心情很好' \
--source_lang=zh \
--target_lang=zh \
--erniesat_config=${config_path} \
--phones_dict=dump/phone_id_map.txt \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--output_name=exp/pred_edit.wav
fi
4 changes: 2 additions & 2 deletions examples/aishell3/ernie_sat/local/train.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ python3 ${BIN_DIR}/train.py \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=2 \
--phones-dict=dump/phone_id_map.txt
--ngpu=8 \
--phones-dict=dump/phone_id_map.txt
6 changes: 5 additions & 1 deletion examples/aishell3/ernie_sat/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ stop_stage=100

conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_153.pdz
ckpt_name=snapshot_iter_289500.pdz

# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
Expand All @@ -30,3 +30,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
1 change: 1 addition & 0 deletions examples/aishell3_vctk/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
# Mixed Chinese and English TTS with AISHELL3 and VCTK datasets
* ernie_sat - ERNIE-SAT
Loading