<![CDATA[PhD Defense by Aran Komatsuzaki]]>

670854 event 1698939984 1698939984 <![CDATA[PhD Defense by Aran Komatsuzaki]]>
Title: Improving Foundation Models

Date: Tuesday, November 14th

Time: 6:30pm EST

Location: Zoom: https://gatech.zoom.us/j/96067185652?pwd=MkptcWhRZm5KZ3dpZEQ4ZHpVVlg2dz09

Aran Komatsuzaki

Machine Learning Ph.D. Student
School of Mathematics
Georgia Institute of Technology

Committee
Dr. Heinrich Matzinger (Advisor) - School of Mathematics, Georgia Institute of Technology
Dr. Weinjing Liao - School of Mathematics, Georgia Institute of Technology
Dr. Hannah Choi - School of Mathematics, Georgia Institute of Technology
Dr. Mayya Zhilova - School of Mathematics, Georgia Institute of Technology
Dr. Alexander Lerch - School of Music, Johns Hopkins University

Abstract
Foundation models are the family of models (e.g. GPT-4, CLIP) that are trained on a massive dataset and perform various down-streaming tasks, usually with either zero- or few-shot learning, optionally after fine-tuning. This dissertation presents a wide range of important measures we have made to make foundation models more efficient, performant and versatile. In particular, we focus on three points of improvement: architecture, dataset and training. We first present our finding on how to optimally scale language models, which leads to significant performance improvement. We then present GPT-J, one of the earliest open-source large language models. We then show that the performance of ViT and T5, both Transformer-based foundation models, can be greatly improved for a given compute budget using Sparse Upcycling, which is to resume training a sparsely gated model made out of pretrained dense models. We also briefly discuss LAION datasets, massive open-source datasets with around one billion pairs of text and image that are used to train various state-of-the-art multimodal models, and ARB benchmark, a highly challenging benchmark to measure the state-of-the-art LLMs such as GPT-4. On the theoretical side, we prove that feedforward layers of a transformer cannot be compressed without information loss, which may explain the power of sparsely gated models such as mixture-of-experts.

]]> Improving Foundation Models

]]> <![CDATA[]]> 221981 1788 100811