Error (also in original Facebook XLM) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head))

BenoitDalFerro · web-flow · commit c78e320b4a06 · 2023-02-14T15:54:31.000+01:00
As per Vaswani et al, 2017 p.4 - https://arxiv.org/pdf/1912.05372.pdf Is torch.matmul(q, k.transpose(2, 3)) / math.sqrt(dim_per_head) not q / math.sqrt(dim_per_head) This effectively scales queries only and not the queries-keys dot product as should be Mentioned in - original facebookresearch/XLM#357 - dependent original FlauBERT getalp/Flaubert@6d17688 - dependent Huggingface FlauBERT huggingface#21627
diff --git a/src/transformers/models/xlm/modeling_xlm.py b/src/transformers/models/xlm/modeling_xlm.py
@@ -176,8 +176,8 @@ def unshape(x):
                     k, v = cache[self.layer_id]
             cache[self.layer_id] = (k, v)
 
-        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)
         scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)
+        scores = scores / math.sqrt(dim_per_head) # (bs, n_heads, qlen, klen)
         mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)
         scores.masked_fill_(mask, torch.finfo(scores.dtype).min)  # (bs, n_heads, qlen, klen)