Multispeaker trained model inferencing different voices

Trained a model with dataset of a multiple speakers.
Quality is ok but... The model produces random speaker voice on inference.
If there any type of control on this, is it possible to choose the voice? 
How the model chooses which one to use for inference?
What's interesting - model picks the same voice for each specific text (unless I edit anything in it, even a dot or comma).