This repo contains the training code for Phoneme-level ASR for Voice Conversion (VC) and TTS (Text-Mel Alignment) used in StarGANv2-VC and StyleTTS.
- Python >= 3.7
- Clone this repository:
git clone https://github.com/yl4579/AuxiliaryASR.git
cd AuxiliaryASR- Install python requirements:
pip install SoundFile torchaudio torch jiwer pyyaml click matplotlib g2p_en librosa- Prepare your own dataset and put the
train_list.txtandval_list.txtin theDatafolder (see Training section for more details).
python train.py --config_path ./Configs/config.ymlPlease specify the training and validation data in config.yml file. The data list format needs to be filename.wav|label|speaker_number, see train_list.txt as an example (a subset for LJSpeech). Note that speaker_number can just be 0 for ASR, but it is useful to set a meaningful number for TTS training (if you need to use this repo for StyleTTS).
Checkpoints and Tensorboard logs will be saved at log_dir. To speed up training, you may want to make batch_size as large as your GPU RAM can take. However, please note that batch_size = 64 will take around 10G GPU RAM.
This repo is set up for English with the g2p_en package, but you can train it with other languages. If you would like to train for datasets in different languages, you will need to modify the meldataset.py file (L86-93) with your own phonemizer. You also need to change the vocabulary file (word_index_dict.txt) and change n_token in config.yml to reflect the number of tokens. A recommended phonemizer for other languages is phonemizer.
The author would like to thank @tosaka-m for his great repository and valuable discussions.