@SimonsInstituteTOC
  @SimonsInstituteTOC
Simons Institute | Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of... @SimonsInstituteTOC | Uploaded 4 days ago | Updated 18 hours ago
Andrew Gordon Wilson (New York University)
https://simons.berkeley.edu/talks/andrew-gordon-wilson-new-york-university-2024-09-27
Transformers as a Computational Model

Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts to develop alternatives have focused on a small number of hand-crafted structured matrices, and have neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, and Monarch, along with many novel structures. We develop a taxonomy of all such operators based on their computational and algebraic properties, which provides insights into their scaling laws. Combining these insights with empirical evaluation, we identify a subset of structures that achieve better performance than dense layers as a function of training compute. To further improve their compute efficiency, we develop a natural extension of these structures that convert into a sparse mixture-of-experts layer. The resulting layer significantly outperforms dense layers in compute-optimal training efficiency for large language models
Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of...Specification-guided Reinforcement learningRobust Optimization and GeneralizationImproved Bounds for Fully Dynamic Matching via Ordered Ruzsa-Szemeredi GraphsError Embraced: Making Trustworthy Scientific Decisions with Imperfect PredictionsStochastic Minimum Vertex Cover with Few Queries: a 3/2-approximationUnderstanding the expressive power of transformers through the lens of formal language theoryController Synthesis Beyond the Worst CaseSocial Behavior Prediction from Video ObservationsAgnostic Proper Learning of Monotone Functions: Beyond the Black-Box Correction BarrierStreaming Algorithms for Connectivity AugmentationSublinear algorithms in social networks via core-periphery decomposition

Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of... @SimonsInstituteTOC

SHARE TO X SHARE TO REDDIT SHARE TO FACEBOOK WALLPAPER