-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Description
What is your question?
I'm trying to pre-train a HuBERT model from scratch following the example instructions. Apart from number of training steps, I'm following the description in the original HuBERT paper for the parameters of the first and second iteration of training, specifically:
- To generate labels for the first iteration HuBERT training [...], we run k-means clustering with 100 clusters on 39-dimensional MFCC features
- To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration
I've finished pretraining for 100k steps on KM labels extracted from MFCC features (n_clusters = 100, label_rate = 50), and I've extracted new KM labels based on layer 6 HuBERT features following the simple_kmeans instructions (n_clusters = 500, label_rate = 50).
I would now like to continue pre-training with the new labels.
When running the command below, I encounter the following:
RuntimeError: Error(s) in loading state_dict for HubertModel:
size mismatch for label_embs_concat: copying a param with shape torch.Size([104, 256]) from checkpoint, the shape in current model is torch.Size([504, 256]).
And subsequently:
Exception: Cannot load model parameters from checkpoint <path_to_ckpt>/checkpoint_last.pt; please ensure that the architectures match.
Based on the param shapes, I expect this error is caused by the size of the second iteration label set (500) not matching the size of the first iteration label set (100).
However, I'm not sure how to adapt the pre-training command to allow for a new label set with more clusters, and I could not find this in the instructions, documentation or existing issues.
Code
fairseq-hydra-train \
distributed_training.distributed_world_size=$num_GPUS +optimization.update_freq=$update_freq \
distributed_training.distributed_port="$PORT" \
optimization.max_update=100100 \
checkpoint.save_interval_updates=1 \
task.data=$path_to_tsv \
task.label_dir=$path_to_labels \
task.labels='["km"]' \
model.label_rate=50 \
checkpoint.save_dir=$path_to_ckpt \
--config-dir $path_to_config \
--config-name $config \
(I do indeed intend to save every checkpoint for the first 100 steps)
The $path_to_labels
points to my new labels directory, containing a dummy dict.km.txt
for the 500 labels, as well as the train.km
and valid.km
files for the new label set.
What have you tried?
I've tried adding checkpoint.reset_dataloader=true
to the command, but this results in the same error. I think I need to reset something else to initialize label_embs_concat
from scratch, but I'm unsure how.
What's your environment?
- fairseq Version:
0.12.2
- PyTorch Version:
2.5.1
- OS:
Linux
- How you installed fairseq:
pip
- Python version:
3.10.4
- CUDA/cuDNN version:
12.6.0