Simons Institute | ML Efficiency for Large Models: From Data Efficiency to Faster Transformers @SimonsInstituteTOC | Uploaded 2 months ago | Updated 9 hours ago
Vahab Mirrokni (Google Research, NYC)
https://simons.berkeley.edu/talks/vahab-mirrokni-google-research-nyc-2024-06-18
ML Efficiency for Large Models: From Data Efficiency to Faster Transformers
Scaling large models efficiently for faster training and inference is a fundamental challenge. In this talk, we present a number of algorithmic challenges and potential solutions from theory to practice. First, we discuss data efficiency and model efficiency problems that can be formalized as a subset selection problem. For model efficiency, we present sequential attention for feature selection and sparsification[ICLR'23, arxiv]. For data efficiency, we present a sensitivity sampling technique for improved quality and efficiency of the models[ICML'24]. Furthermore, we discuss the intrinsic quadratic complexity of attention models as well as token generation. We first discuss HyperAttention; a technique to develop linear-time attention algorithms under mild assumptions[ICLR'24]. We then present PolySketchFormer, a technique to bypass the hardness results of achieving sub-quadratic attention by applying sketching of polynomial functions[ICML'24]. Finally, we show how to address the complexity of token generation via clustering techniques[arxiv].
Vahab Mirrokni (Google Research, NYC)
https://simons.berkeley.edu/talks/vahab-mirrokni-google-research-nyc-2024-06-18
ML Efficiency for Large Models: From Data Efficiency to Faster Transformers
Scaling large models efficiently for faster training and inference is a fundamental challenge. In this talk, we present a number of algorithmic challenges and potential solutions from theory to practice. First, we discuss data efficiency and model efficiency problems that can be formalized as a subset selection problem. For model efficiency, we present sequential attention for feature selection and sparsification[ICLR'23, arxiv]. For data efficiency, we present a sensitivity sampling technique for improved quality and efficiency of the models[ICML'24]. Furthermore, we discuss the intrinsic quadratic complexity of attention models as well as token generation. We first discuss HyperAttention; a technique to develop linear-time attention algorithms under mild assumptions[ICLR'24]. We then present PolySketchFormer, a technique to bypass the hardness results of achieving sub-quadratic attention by applying sketching of polynomial functions[ICML'24]. Finally, we show how to address the complexity of token generation via clustering techniques[arxiv].