I recently completed my PhD in theoretical physics at Harvard, where I was very fortunate to be advised by Cengiz Pehlevan. My research is on developing a science of deep learning by leveraging insights from statistical physics, random matrix theory, and extensive empirical studies on real models and datasets. Key objects of study include the neural tangent kernel, maximal update parameterization also known as μP, and the dichotomy between rich and lazy training. Lately, we have been thinking a lot about scaling laws. Prior to this, I had the pleasure of collaborating with Andy Strominger and colleagues on topics adjacent to string theory and quantum field theory. My work is funded by the NDSEG Fellowship (2019-2022) and by the Hertz Fellowship (2019-2024).
I received my M.S. in mathematics and B.S. in physics from Yale in 2018. There, I did research under David Poland, studying conformal field theories using the bootstrap program, and under Philsang Yoo on the mathematical aspects of quantum field theory and its connection with the Langlands program. Concurrently, I was part of John Murray's lab, where we built tools to study working memory in recurrent neural networks.
Over the course of graduate school, I've consulted for two biotech firms: Protein Evolution and Quantum Si as a machine learning scientist. I've also worked at Jane Street as a quantiative research intern. While an undegrad, I interned at Google, applying deep learning to computer vision on the internet of things. I also worked at the Perimeter Institute under Erik Schnetter on tackling the curse of dimensionality in numerical partial differential equations.
My earliest exposure to scientific research was under Dr. James Ellenbogen at MITRE, and with Drs. John Dell and Jonathan Osborne at Thomas Jefferson High School.
Abstract: This thesis develops a theoretical framework for understanding the scaling properties of information processing systems in the regime of large data, large model size, and large computational resources. The goal is to develop an understanding of the impressive performance that deep neural networks have exhibited.
The first part of this thesis examines models linear in their parameters but nonlinear in their inputs. This includes linear regression, kernel regression, and random feature models. Utilizing random matrix theory and free probability, I provide precise characterizations of their training dynamics, generalization capabilities, and out-of-distribution performance, alongside a detailed analysis of sources of variance. A variety of scaling laws observed in state-of-the-art large language and vision models are already present in this simple setting.
The second part of this thesis focuses on representation learning. Leveraging insights from models linear in inputs but nonlinear in parameters, I present a theory of early-stage representation learning where a network with small weight initialization can learn features without altering the loss. This phenomenon, termed silent alignment, is empirically validated across various architectures and datasets. The idea of starting at small initialization leads naturally to the "maximal update parameterization'', μP, that allows for feature learning at infinite width. I present empirical studies showing that practical networks can approach their theoretical infinite-width feature learning limits. Finally, I consider down-scaling the output of a neural network by a fixed constant. When this constant is small, the network behaves as a linear model in parameters; when large, it induces silent alignment. I present theoretical and empirical results of the influence of this hyperparameter on feature learning, performance, and dynamics.
Committee: Cengiz Pehlevan, Haim Sompolinsky, and Michael Brenner.(2024) Accepted
(2024) In Submission [arXiv]
(2024) In Submission [arXiv]
(2024) In Submission [arXiv]
(2024) In Submission [arXiv]
(2024) ICML 2024 [arXiv]
(2023) NeurIPS 2023 [OpenReview] [arXiv]
(2022) ICLR 2023 [OpenReview] [arXiv]
(2021) ICLR 2022 [OpenReview] [arXiv]
(2021) eNeuro [Journal Link] [bioRxiv] [Code Repository]
(2022) Journal of High Energy Physics [Journal Link] [arXiv]
(2021) Physical Review D [Journal Link] [arXiv] [IAS Talk]
(2021) Journal of High Energy Physics [Journal Link] [arXiv] [Princeton Talk]
(2018) Journal of High Energy Physics [Journal Link] [arXiv] [Code Repository]
(2018) Yale Senior Thesis [PDF] [Presentation Slides]
(2017) [arXiv] [PDF] [Code Repository]
(2017) Physical Review A [Journal Link] [PDF]
(Fall 2015) [PDF]
(2021-2023) [Bouchaud & Potters: Random Matrix Theory] [MacKay: Information Theory, Inference, and Learning Algorithms Part 0, Part 1, Part 2] [HST: Elements of Statistical Learning] [Engel and Van Den Broeck: Statistical Mechanics of Learning] [Kardar: Statistical Physics of Fields] [Mezard and Montanari: Information, Physics, and Computation Chapter 1, Chapter 4, Chapter 5, Chapter 8]
(2019-2020) [PDF]
(Spring 2020) [PDF]
(Fall 2018) [PDF]
(Fall 2018) [PDF] [Chapter 1: Black Holes and the Holographic Principle] [Chapter 2: Matrices and Strings] [Chapter 3: Holographic Duality]
(Spring 2018) [Lecture 1] [Lecture 2]
(Fall 2017) [PDF]
(Spring, Fall 2017) [Full Notes] [Part 1: Categorical Harmonic Analysis] [Part 2: Moduli Space of Bundles] [Part 3: Geometric Satake] [Part 4: Geometric Representation Theory] [Part 5: Intro to Derived Algebraic Geometry] [Part 6: Back to Basics] [Part 7: Singular Support] [Part 8: Revisiting D(Bun_G)] [Part 9: How to study D(Bun_G)] [Part 10: Factorization Structures] [Part 11: Fundamental Local Equivalence]
(Spring, Fall 2017) [Spring Talk] [Fall Talk]
(Fall 2016) [PDF]
(Fall 2016) [Review notes on Fiber Bundles] [Talk 1] [Talk 2]
(Spring 2016) [PDF]
(Spring 2016) [PDF]
(Summer 2024) [Final Paper]
(Summer 2020) [Final Paper]
(Summer 2019) [Final Presentation]
(Summer 2016) [Lecture]
(Fall 2013) [PDF]
(Fall 2013) [PDF]