-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
@Nandan91 @rajatsaini0294 HI
For each subspace, the input is HxWxG, through DW + MaxPool + PW, the middle attention map is HxWx1, then through Softmax + Expand, the final attention map is HxWxG.
Because the output dimension of this PW operation is 1, the final attention map is equivalent to one weight shared by all channels. Why use this PW?? Why is it designed so that all channels share one weight?
If this PW operation is removed, that is, treat the output of the MaxPool operation as the final attention map. In this case, it is equivalent to that each point and each channel has its own independent weight. Why not design it this way?
many many thanks!!!
Metadata
Metadata
Assignees
Labels
No labels