Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of... @SimonsInstituteTOC

Simons Institute | Re-thinking Transformers: Searching for Efficient Linear Layers over a Continuous Space of... @SimonsInstituteTOC | Uploaded 4 days ago | Updated 18 hours ago
Andrew Gordon Wilson (New York University)
https://simons.berkeley.edu/talks/andrew-gordon-wilson-new-york-university-2024-09-27
Transformers as a Computational Model

Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts to develop alternatives have focused on a small number of hand-crafted structured matrices, and have neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, and Monarch, along with many novel structures. We develop a taxonomy of all such operators based on their computational and algebraic properties, which provides insights into their scaling laws. Combining these insights with empirical evaluation, we identify a subset of structures that achieve better performance than dense layers as a function of training compute. To further improve their compute efficiency, we develop a natural extension of these structures that convert into a sparse mixture-of-experts layer. The resulting layer significantly outperforms dense layers in compute-optimal training efficiency for large language models

Specification-guided Reinforcement learning

$Improved Bounds for Fully Dynamic Matching via Ordered Ruzsa-Szemeredi Graphs Sepehr Assadi (University of Waterloo) https://simons.berkeley.edu/talks/sepehr-assadi-university-waterloo-2024-07-30 Sublinear Graph Simplification We follow the recent breakthrough of Behnezhad and Ghafari [FOCS24] in designing fully dynamic algorithms for the maximum matching problem whose runtimes is parameterized by the density of the so-called Ordered Ruzsa-Szemeredi (ORS) Graphs. These are graphs whose edges can be partitioned into an ordered set of linear-size matchings such that each matching is induced among the edges of the matchings that appear after it in this ordering. Behnezhad and Ghafari developed an algorithm for (1+eps)-approximation of matchings for any constant eps with update time roughly sqrt{n * ORS(n)}, where ORS(n) denotes the largest number of matchings possible in an n-vertex ORS graph. The parameter ORS(n) is still quite poorly understood and we effectively only know that it is between n^{o(1)} and o(n). Under the plausible scenario that the current lower bounds for ORS(n) are almost optimal, the BG algorithm will run in n^{1/2+o(1)} update time, which will be a major achievement in this area. In this talk, we will cover our recent improvements to the BG-algorithm that improve the update time to only n^{o(1)} * ORS(n). This means that under the plausible scenario above, this algorithm simply runs in n^{o(1)} update time. In its current stage also, this fully reduces the algorithmic problem of designing dynamic matching algorithms to a purely combinatorial problem of upper bounding ORS(n) with no algorithmic considerations.$

Error Embraced: Making Trustworthy Scientific Decisions with Imperfect Predictions

Stochastic Minimum Vertex Cover with Few Queries: a 3/2-approximation

Understanding the expressive power of transformers through the lens of formal language theory

Controller Synthesis Beyond the Worst Case

Social Behavior Prediction from Video Observations

$Agnostic Proper Learning of Monotone Functions: Beyond the Black-Box Correction Barrier Jane Lange (Massachusetts Institute of Technology) https://simons.berkeley.edu/talks/jane-lange-massachusetts-institute-technology-2024-08-02 Sublinear Graph Simplification We give the first agnostic, efficient, proper learning algorithm for monotone Boolean functions. Given 2^Õ(sqrt(n)/ε) uniformly random examples of an unknown function f:{±1}^n→{±1}, our algorithm outputs a hypothesis g:{±1}^n→{±1} that is monotone and (opt+ε)-close to f, where opt is the distance from f to the closest monotone function. The running time of the algorithm (and consequently the size and evaluation time of the hypothesis) is also 2^Õ(sqrt(n)/ε), nearly matching the lower bound of Blais et al (RANDOM 15). We also give an algorithm for estimating up to additive error ε the distance of an unknown function f to monotone using a run-time of 2^Õ(sqrt(n)/ε). Previously, for both of these problems, sample-efficient algorithms were known, but these algorithms were not run-time efficient. Our work thus closes this gap in our knowledge between the run-time and sample complexity. This work builds upon the improper learning algorithm of Bshouty and Tamon (JACM 96) and the proper semiagnostic learning algorithm of Lange, Rubinfeld, and Vasilyan (FOCS 22), which obtains a non-monotone Boolean-valued hypothesis, then corrects it to monotone using query-efficient local computation algorithms on graphs. This black-box correction approach can achieve no error better than 2opt+ε information-theoretically; we bypass this barrier by a) augmenting the improper learner with a convex optimization step, and b) learning and correcting a real-valued function before rounding its values to Boolean. Our real-valued correction algorithm solves the poset sorting problem of [LRV22] for functions over general posets with non-Boolean labels.$

Streaming Algorithms for Connectivity Augmentation