Sparse Universal Transformer | Cool Papers

#1 Sparse Universal Transformer [PDF⁵] [Copy] [Kimi¹⁶]

Authors: Shawn Tan ; Yikang Shen ; Zhenfang Chen ; Aaron Courville ; Chuang Gan

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers and is Turing-complete under certain assumptions. Empirical evidence also shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, most state-of-the-art NLP systems use VTs as their backbone model instead of UTs. This is mainly because scaling UT parameters is more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) to reduce UT’s computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT combines the best of both worlds, achieving strong generalization results on formal language tasks (Logical inference and CFQ) and impressive parameter and computation efficiency on standard natural language benchmarks like WMT’14.

2023.emnlp-main.12@ACL

#1 Sparse Universal Transformer [PDF5] [Copy] [Kimi16]

#1 Sparse Universal Transformer [PDF⁵] [Copy] [Kimi¹⁶]