Size mismatch in HuBERT second iteration training

#### What is your question?
I'm trying to pre-train a HuBERT model from scratch following [the example instructions](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#pre-train-a-hubert-model). Apart from number of training steps, I'm following the description in the [original HuBERT paper](http://doi.org/10.1109/TASLP.2021.3122291) for the parameters of the first and second iteration of training, specifically:

> - To generate labels for the ﬁrst iteration HuBERT training [...], we run k-means clustering with **100 clusters** on 39-dimensional MFCC features
> - To generate better targets for the subsequent iterations, we run k-means clustering with **500 clusters** on the latent features extracted from the HuBERT model pre-trained in the previous iteration

I've finished pretraining for 100k steps on KM labels extracted from MFCC features (n_clusters = 100, label_rate = 50), and I've extracted new KM labels based on layer 6 HuBERT features following the [simple_kmeans](https://github.com/facebookresearch/fairseq/tree/ecbf110e1eb43861214b05fa001eff584954f65a/examples/hubert/simple_kmeans) instructions (n_clusters = 500, label_rate = 50).

I would now like to continue pre-training with the new labels.

When running the command below, I encounter the following:
```
RuntimeError: Error(s) in loading state_dict for HubertModel:
 	size mismatch for label_embs_concat: copying a param with shape torch.Size([104, 256]) from checkpoint, the shape in current model is torch.Size([504, 256]).
```
And subsequently:
```
Exception: Cannot load model parameters from checkpoint <path_to_ckpt>/checkpoint_last.pt; please ensure that the architectures match.
```
Based on the param shapes, I expect this error is caused by the size of the second iteration label set (500) not matching the size of the first iteration label set (100).

However, I'm not sure how to adapt the pre-training command to allow for a new label set with more clusters, and I could not find this in the instructions, documentation or existing issues.

#### Code

```
fairseq-hydra-train \
    distributed_training.distributed_world_size=$num_GPUS +optimization.update_freq=$update_freq \
    distributed_training.distributed_port="$PORT" \
    optimization.max_update=100100 \
    checkpoint.save_interval_updates=1 \
    task.data=$path_to_tsv \
    task.label_dir=$path_to_labels \
    task.labels='["km"]' \
    model.label_rate=50 \
    checkpoint.save_dir=$path_to_ckpt \
    --config-dir $path_to_config \
    --config-name $config \
```
(I do indeed intend to save every checkpoint for the first 100 steps)

The `$path_to_labels` points to my new labels directory, containing a dummy `dict.km.txt` for the 500 labels, as well as the `train.km` and `valid.km` files for the new label set.

#### What have you tried?

I've tried adding `checkpoint.reset_dataloader=true` to the command, but this results in the same error.  I think I need to reset something else to initialize `label_embs_concat` from scratch, but I'm unsure how. 

#### What's your environment?

 - fairseq Version: `0.12.2`
 - PyTorch Version: `2.5.1`
 - OS: `Linux`
 - How you installed fairseq: `pip`
 - Python version: `3.10.4`
 - CUDA/cuDNN version: `12.6.0`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Size mismatch in HuBERT second iteration training #5626

What is your question?

Code

What have you tried?

What's your environment?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Size mismatch in HuBERT second iteration training #5626

Description

What is your question?

Code

What have you tried?

What's your environment?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions