Ask for help about the net_vocal

Hello, I observed the effect of net_vocal_attributes in the whole model framework. 

At present, the embedding extracted from the predicted sound, the distance of the negative sample pair (`audio_embedding_A1_pred` and `audio_embedding_B1_pred`) can reach 2, and the distance of the positive sample pair (`audio_embedding_A1_pred` and `audio_embedding_A2_pred`) can reach about 0.

But after I changed the input of net_vocal to pure real sound, the distance between negative sample pairs (`audio_embedding_A1_gt` and `audio_embedding_B_gt`) can only reach 1. That is to say, the sound feature extraction is not good **when I train the net_vocal alone.**

It stands to reason that pure ground voices are easier to extract features than predicted voices. I modified the parameters of the training (batch, learning rate, etc.) but none solved the problem. May I know what is the reason?

Looking forward to your reply！


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ask for help about the net_vocal #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ask for help about the net_vocal #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions