Abstract:
Scaling up LLM training requires a reliable understanding of how hyperparameters and data composition shape learning dynamics. This talk presents two key insights. First, we introduce a multi-power law that accurately predicts the entire loss curve in LLM pretraining. Our law explicitly quantifies the effect of learning rate schedules and is precise enough to identify an optimal schedule that outperforms cosine decay. Second, we show that when pretraining LLMs on data mixtures, LLMs do not always acquire high-quality, knowledge-dense data in a smooth, linear fashion. Instead, knowledge acquisition can follow a predictable phase transition, where small shifts in model size or data mixing ratios can trigger abrupt changes. Together, these results suggest potential avenues for improving the predictability of LLM training, especially in hyperparameter scheduling and data mixing.
讲者信息:
Kaifeng Lyu will be joining the Institute for Interdisciplinary Information Sciences (IIIS) at Tsinghua University as an Assistant Professor this June. His primary research interests include machine learning theory, AI safety/alignment, and optimization. Kaifeng Lyu is a postdoc at the Simons Institute for the Theory of Computing. He obtained his PhD in Computer Science from Princeton University, advised by Sanjeev Arora.
涉及到的paper:
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
https://arxiv.org/abs/2503.12811
Data Mixing can Induce Phase Transitions in Knowledge Acquisition
arxiv soon