ReLU in residual connections?

Hi,

I am using part of your code for a particular implementation of a transformer architecture I  need as part of my master thesis research in RL. I noticed on the original paper from (Parisotto et al., 2019) that they re-order the LayerNorms so they place them at the input of both the multihead-attention and the feed-forward sub-modules. I saw that you also implement this on your code, via a the `config["layer_norm"]` setting. But on the paper they also mention, I quote: "_Because the layer norm reordering causes a path where two linear layers are applied in sequence, we apply a ReLU activation to each sub-module output before the residual connection (see Appendix C for equations)._". In fact, on those equations they apply a ReLU both to the output of the multihead-attention and feed-forward sub-modules, before performing the residual connection. I did not see that specific step on your code (just the standard residual connection), so I wonder whether there is a particular reason for that, or maybe I am missing something (I'm still quite novice in these implementations). In any case, congratulations for your great works, it is helping me a lot to understand the inner workings of such architectures. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ReLU in residual connections? #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ReLU in residual connections? #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions