Discussion about this post

User's avatar
The AI Architect's avatar

Great curation of papers here. The outlier rescaling framing makes a lot of sense when you consider how normalization layers are basically learning to route signal dynamically. We ran into similr behavior when experimenting with quantization last year, and it was alaways unclear whether to treat it as a bug or feature until seeing work like this.

No posts

Ready for more?