
Autoregressive Transformers have taken over the world of Language Modeling (GPT-3). However, in order to train them, people use causal masking and sample parallelism, which means computation only happens in a feedforward manner. This results in higher layer information, which would be available, to not be used in the lower layers of subsequent tokens, and leads to a loss in the computational capabilities of the overall model. Feedback Transformers trade-off training speed for access to these representations and demonstrate remarkable improvements in complex reasoning and long-range dependency tasks.
https://youtu.be/zdb8MM94A5c
https://arxiv.org/abs/2002.09402
https://youtu.be/zdb8MM94A5c
https://arxiv.org/abs/2002.09402