This repository provides a comprehensive framework for fine-tuning and evaluating speech translation models, including Whisper and Seamless, on the FLEURS dataset. The framework supports both monolingual and multilingual configurations with easy-to-use training and testing scripts.
# Clone the repository
git clone https://github.com/jonahdvt/ast-lrl-speech.git
cd ast-lrl-speech
# Install dependencies
conda env create -f environment.yaml
conda activate speech-processing-env
The project uses the FLEURS dataset with the following structure:
All dataset files are stored in the lang_aggregate_data/
directory.
- Language-specific info:
[langname]_fleurs_info.csv
- Contains metadata for each language - English translations:
en_translation.csv
- Reference translations in English - Audio mapping: Each audio file is mapped to corresponding translations across all languages using the
codes
attribute
The codes
attribute serves as the primary key for mapping audio files to their translations across different languages in the processing scripts.
Navigate to the appropriate Whisper fine-tuning script based on your dataset requirements:
train_scripts/whisper/whisper_finetune_xxxxx.py
- Select script: Choose the appropriate fine-tuning script for your target dataset
- Language selection: Specify the languages you want to fine-tune on
- Training iterations: Set the number of iterations (especially important for streaming)
- Base model: Choose the pre-trained Whisper model to fine-tune from
- Training arguments: Configure your desired
training_args
- Model deployment: Set up
kwargs
for pushing and saving the model to Hugging Face Hub
languages= [
"ig_ng",
"lg_ug",
"sw_ke",
"yo_ng",
"ha_ng"
] # Target languages - Example from African multilingual finetuning
whisper_model = 'openai/whisper-large-v3'
Navigate to the appropriate Seamless fine-tuning script based on your configuration:
train_scripts/seamless/seamless_finetune_xxxxx.py
- Mono configuration: Single language fine-tuning
- Multi configuration: Multi-language fine-tuning
- Select configuration: Choose between mono/multi configuration scripts
- Language selection: Specify target languages
- Training iterations: Set iteration count for streaming scenarios
- Base model: Select the base Seamless model
- Training arguments: Configure training parameters
- Model deployment: Set up Hugging Face Hub integration in
kwargs
python testing_scripts/whisper_test.py
- Model selection: Choose between base or custom fine-tuned model in
model_id
- Language selection: Select desired languages for direct inference
- Dataset: Runs inference on the FLEURS test dataset
python testing_scripts/seamless_test.py
- Model selection: Choose between base or custom fine-tuned model in
model_id
- Language selection: Select desired languages for direct inference
- Dataset: Runs inference on the FLEURS test dataset
python testing_scripts/gemini.py
- Valid Gemini API key
- Configure target languages at the top of the script
- Provides translation inference using Gemini 2.0 Flash on the FLEURS test dataset
python testing_scripts/chatgpt.py
- Valid OpenAI API key
- Configure target languages at the top of the script
- Provides translation inference using GPT 4 on the FLEURS test dataset
Results from all testing scripts are automatically saved and can be found in the designated output directories.
To calculate metrics (WER/BLEU), see metrics.py
To extract best/worst samples based on WER differencials, see dataset/sample_finder.py
- Memory Management: Adjust batch sizes based on your GPU memory. Streaming is also advised to avoid downloading large datasets. When training a model, pushing it to Hugging Face Hub is a good way to avoid carrying large models locally.
- Streaming: Use appropriate iteration counts for streaming datasets
- Model Versioning: Use descriptive names when pushing to Hugging Face Hub
- API Limits: Be mindful of API rate limits when using Gemini or GPT testing
- Ensure all API keys are properly configured before running external API tests
- Monitor training progress and adjust hyperparameters as needed
- Consider using different base models for optimal performance on specific languages
For detailed configuration examples and troubleshooting, please refer to the individual script documentation within each file.