Update loss overview docs tables for distillation

tomaarsen · tomaarsen · commit 64e6ece5cb16 · 2025-08-06T13:28:42.000+02:00
diff --git a/docs/cross_encoder/loss_overview.md b/docs/cross_encoder/loss_overview.md
@@ -31,10 +31,13 @@ Loss functions play a critical role in the performance of your fine-tuned Cross
 These loss functions are specifically designed to be used when distilling the knowledge from one model into another.
 For example, when finetuning a small model to behave more like a larger & stronger one, or when finetuning a model to become multi-lingual.
 
-| Texts                                        | Labels                                                        | Appropriate Loss Functions                                                                 |
-|----------------------------------------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------|
-| `(sentence_A, sentence_B) pairs`             | `similarity score`                                            | <a href="../package_reference/cross_encoder/losses.html#mseloss">`MSELoss`</a>             |
-| `(query, passage_one, passage_two) triplets` | `gold_sim(query, passage_one) - gold_sim(query, passage_two)` | <a href="../package_reference/cross_encoder/losses.html#marginmseloss">`MarginMSELoss`</a> |
+| Texts                                             | Labels                                                                    | Appropriate Loss Functions                                                                 |
+|---------------------------------------------------|---------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
+| `(sentence_A, sentence_B) pairs`                  | `similarity score`                                                        | <a href="../package_reference/cross_encoder/losses.html#mseloss">`MSELoss`</a>             |
+| `(query, passage_one, passage_two) triplets`      | `gold_sim(query, passage_one) - gold_sim(query, passage_two)`             | <a href="../package_reference/cross_encoder/losses.html#marginmseloss">`MarginMSELoss`</a> |
+| `(query, positive, negative_1, ..., negative_n)`  | `[gold_sim(query, positive) - gold_sim(query, negative_i) for i in 1..n]` | <a href="../package_reference/cross_encoder/losses.html#marginmseloss">`MarginMSELoss`</a> |
+| `(query, positive, negative)`                     | `[gold_sim(query, positive), gold_sim(query, negative)]`                  | <a href="../package_reference/cross_encoder/losses.html#marginmseloss">`MarginMSELoss`</a> |
+| `(query, positive, negative_1, ..., negative_n) ` | `[gold_sim(query, positive), gold_sim(query, negative_i)...] `            | <a href="../package_reference/cross_encoder/losses.html#marginmseloss">`MarginMSELoss`</a> |
 
 ## Commonly used Loss Functions
 In practice, not all loss functions get used equally often. The most common scenarios are:
diff --git a/docs/sentence_transformer/loss_overview.md b/docs/sentence_transformer/loss_overview.md
@@ -37,14 +37,14 @@ For example, models trained with <a href="../package_reference/sentence_transfor
 These loss functions are specifically designed to be used when distilling the knowledge from one model into another.
 For example, when finetuning a small model to behave more like a larger & stronger one, or when finetuning a model to become multi-lingual.
 
-| Texts                                        | Labels                                                        | Appropriate Loss Functions                                                                        |
-|----------------------------------------------|---------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
-| `sentence`                                   | `model sentence embeddings`                                   | <a href="../package_reference/sentence_transformer/losses.html#mseloss">`MSELoss`</a>             |
-| `sentence_1, sentence_2, ..., sentence_N`    | `model sentence embeddings`                                   | <a href="../package_reference/sentence_transformer/losses.html#mseloss">`MSELoss`</a>             |
-| `(query, passage_one, passage_two) triplets` | `gold_sim(query, passage_one) - gold_sim(query, passage_two)` | <a href="../package_reference/sentence_transformer/losses.html#marginmseloss">`MarginMSELoss`</a> |
-| `(query, positive, negative_1, ..., negative_n)` | `[gold_sim(query, positive) - gold_sim(query, negative_i) for i in 1..n]` | <a href="../package_reference/sentence_transformer/losses.html#marginmseloss">`MarginMSELoss`</a> |
-| `(query, positive, negative)` | `[gold_sim(query, positive), gold_sim(query, negative)]` | <a href="../package_reference/sentence_transformer/losses.html#distilkldivloss">`DistillKLDivLoss`</a> |
-| `(query, positive, negative_1, ..., negative_n) ` | `[gold_sim(query, positive), gold_sim(query, negative_i)...] ` | <a href="../package_reference/sentence_transformer/losses.html#distilkldivloss">`DistillKLDivLoss`</a> |
+| Texts                                             | Labels                                                                    | Appropriate Loss Functions                                                                                                                                                                                   |
+|---------------------------------------------------|---------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `sentence`                                        | `model sentence embeddings`                                               | <a href="../package_reference/sentence_transformer/losses.html#mseloss">`MSELoss`</a>                                                                                                                        |
+| `(sentence_1, sentence_2, ..., sentence_N)`       | `model sentence embeddings`                                               | <a href="../package_reference/sentence_transformer/losses.html#mseloss">`MSELoss`</a>                                                                                                                        |
+| `(query, passage_one, passage_two)`               | `gold_sim(query, passage_one) - gold_sim(query, passage_two)`             | <a href="../package_reference/sentence_transformer/losses.html#marginmseloss">`MarginMSELoss`</a>                                                                                                            |
+| `(query, positive, negative_1, ..., negative_n)`  | `[gold_sim(query, positive) - gold_sim(query, negative_i) for i in 1..n]` | <a href="../package_reference/sentence_transformer/losses.html#marginmseloss">`MarginMSELoss`</a>                                                                                                            |
+| `(query, positive, negative)`                     | `[gold_sim(query, positive), gold_sim(query, negative)]`                  | <a href="../package_reference/sentence_transformer/losses.html#distillkldivloss">`DistillKLDivLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#marginmseloss">`MarginMSELoss`</a> |
+| `(query, positive, negative_1, ..., negative_n) ` | `[gold_sim(query, positive), gold_sim(query, negative_i)...] `            | <a href="../package_reference/sentence_transformer/losses.html#distillkldivloss">`DistillKLDivLoss`</a><br><a href="../package_reference/sentence_transformer/losses.html#marginmseloss">`MarginMSELoss`</a> |
 
 ## Commonly used Loss Functions
 In practice, not all loss functions get used equally often. The most common scenarios are:
diff --git a/docs/sparse_encoder/loss_overview.md b/docs/sparse_encoder/loss_overview.md
@@ -12,7 +12,7 @@
 
 The <a href="../package_reference/sparse_encoder/losses.html#spladeloss"><code>SpladeLoss</code></a> implements a specialized loss function for SPLADE (Sparse Lexical and Expansion) models. It combines a main loss function with regularization terms to control efficiency:
 
-- Supports all the losses mention below as main loss but three principal loss types: <a href="../package_reference/sparse_encoder/losses.html#sparsemultiplenegativesrankingloss"><code>SparseMultipleNegativesRankingLoss</code></a>, <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss"><code>SparseMarginMSELoss</code></a> and <a href="../package_reference/sparse_encoder/losses.html#sparsedistilkldivloss"><code>SparseDistillKLDivLoss</code></a>.
+- Supports all the losses mention below as main loss but three principal loss types: <a href="../package_reference/sparse_encoder/losses.html#sparsemultiplenegativesrankingloss"><code>SparseMultipleNegativesRankingLoss</code></a>, <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss"><code>SparseMarginMSELoss</code></a> and <a href="../package_reference/sparse_encoder/losses.html#sparsedistillkldivloss"><code>SparseDistillKLDivLoss</code></a>.
 - Uses <a href="../package_reference/sparse_encoder/losses.html#flopsloss"><code>FlopsLoss</code></a> for regularization to control sparsity by default, but supports custom regularizers.
 - Balances effectiveness (via the main loss) with efficiency by regularizing both query and document representations.
 - Allows using different regularizers for queries and documents via the `query_regularizer` and `document_regularizer` parameters, enabling fine-grained control over sparsity patterns for different types of inputs.
@@ -51,23 +51,22 @@ Loss functions play a critical role in the performance of your fine-tuned model.
 ## Distillation
 These loss functions are specifically designed to be used when distilling the knowledge from one model into another. This is rather commonly used when training Sparse embedding models.
 
-| Texts                                             | Labels                                                                    | Appropriate Loss Functions                                                                                                                                                                                              |
-|---------------------------------------------------|---------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `sentence`                                        | `model sentence embeddings`                                               | <a href="../package_reference/sparse_encoder/losses.html#sparsemseloss">`SparseMSELoss`</a>                                                                                                                             |
-| `sentence_1, sentence_2, ..., sentence_N`         | `model sentence embeddings`                                               | <a href="../package_reference/sparse_encoder/losses.html#sparsemseloss">`SparseMSELoss`</a>                                                                                                                             |
-| `(query, passage_one, passage_two) triplets`      | `gold_sim(query, passage_one) - gold_sim(query, passage_two)`             | <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss">`SparseMarginMSELoss`</a>                                                                                                                 |
-| `(query, positive, negative) triplets`            | `[gold_sim(query, positive), gold_sim(query, negative)]`                  | <a href="../package_reference/sparse_encoder/losses.html#sparsedistilkldivloss">`SparseDistillKLDivLoss`</a><br><a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss">`SparseMarginMSELoss`</a> |
-| `(query, positive, negative_1, ..., negative_n)`  | `[gold_sim(query, positive) - gold_sim(query, negative_i) for i in 1..n]` | <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss">`SparseMarginMSELoss`</a>                                                                                                                 |
-| `(query, positive, negative_1, ..., negative_n) ` | `[gold_sim(query, positive), gold_sim(query, negative_i)...] `            | <a href="../package_reference/sparse_encoder/losses.html#sparsedistilkldivloss">`SparseDistillKLDivLoss`</a><br><a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss">`SparseMarginMSELoss`</a> |
-
+| Texts                                             | Labels                                                                    | Appropriate Loss Functions                                                                                                                                                                                               |
+|---------------------------------------------------|---------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `sentence`                                        | `model sentence embeddings`                                               | <a href="../package_reference/sparse_encoder/losses.html#sparsemseloss">`SparseMSELoss`</a>                                                                                                                              |
+| `(sentence_1, sentence_2, ..., sentence_N)`       | `model sentence embeddings`                                               | <a href="../package_reference/sparse_encoder/losses.html#sparsemseloss">`SparseMSELoss`</a>                                                                                                                              |
+| `(query, passage_one, passage_two)`               | `gold_sim(query, passage_one) - gold_sim(query, passage_two)`             | <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss">`SparseMarginMSELoss`</a>                                                                                                                  |
+| `(query, positive, negative_1, ..., negative_n)`  | `[gold_sim(query, positive) - gold_sim(query, negative_i) for i in 1..n]` | <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss">`SparseMarginMSELoss`</a>                                                                                                                  |
+| `(query, positive, negative)`                     | `[gold_sim(query, positive), gold_sim(query, negative)]`                  | <a href="../package_reference/sparse_encoder/losses.html#sparsedistillkldivloss">`SparseDistillKLDivLoss`</a><br><a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss">`SparseMarginMSELoss`</a> |
+| `(query, positive, negative_1, ..., negative_n) ` | `[gold_sim(query, positive), gold_sim(query, negative_i)...] `            | <a href="../package_reference/sparse_encoder/losses.html#sparsedistillkldivloss">`SparseDistillKLDivLoss`</a><br><a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss">`SparseMarginMSELoss`</a> |
 
 ## Commonly used Loss Functions
 
 In practice, not all loss functions get used equally often. The most common scenarios are:
 
 * `(anchor, positive) pairs` without any labels: <a href="../package_reference/sparse_encoder/losses.html#sparsemultiplenegativesrankingloss"><code>SparseMultipleNegativesRankingLoss</code></a> (a.k.a. InfoNCE or in-batch negatives loss) is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. Here for our sparse retrieval tasks, this format works well with <a href="../package_reference/sparse_encoder/losses.html#spladeloss"><code>SpladeLoss</code></a> or <a href="../package_reference/sparse_encoder/losses.html#csrloss"><code>CSRLoss</code></a>, both typically using InfoNCE as their underlying loss function.
 
-* `(query, positive, negative_1, ..., negative_n)` format: This structure with multiple negatives is particularly effective with <a href="../package_reference/sparse_encoder/losses.html#spladeloss"><code>SpladeLoss</code></a> configured with <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss"><code>SparseMarginMSELoss</code></a>, especially in knowledge distillation scenarios where a teacher model provides similarity scores. The strongest models are trained with distillation losses like <a href="../package_reference/sparse_encoder/losses.html#sparsedistilkldivloss"><code>SparseDistillKLDivLoss</code></a> or <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss"><code>SparseMarginMSELoss</code></a>.
+* `(query, positive, negative_1, ..., negative_n)` format: This structure with multiple negatives is particularly effective with <a href="../package_reference/sparse_encoder/losses.html#spladeloss"><code>SpladeLoss</code></a> configured with <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss"><code>SparseMarginMSELoss</code></a>, especially in knowledge distillation scenarios where a teacher model provides similarity scores. The strongest models are trained with distillation losses like <a href="../package_reference/sparse_encoder/losses.html#sparsedistillkldivloss"><code>SparseDistillKLDivLoss</code></a> or <a href="../package_reference/sparse_encoder/losses.html#sparsemarginmseloss"><code>SparseMarginMSELoss</code></a>.
 
 ## Custom Loss Functions
 
diff --git a/sentence_transformers/cross_encoder/losses/MarginMSELoss.py b/sentence_transformers/cross_encoder/losses/MarginMSELoss.py