投稿

[FAI] 清华吕凯风 | 大模型训练中的扩展定律与相变现象

3000

1

2025-03-24 19:30:00

未经作者授权，禁止转载

正在缓冲...

Abstract: Scaling up LLM training requires a reliable understanding of how hyperparameters and data composition shape learning dynamics. This talk presents two key insights. First, we introduce a multi-power law that accurately predicts the entire loss curve in LLM pretraining. Our law explicitly quantifies the effect of learning rate schedules and is precise enough to identify an optimal schedule that outperforms cosine decay. Second, we show that when pretraining LLMs on data mixtures, LLMs do not always acquire high-quality, knowledge-dense data in a smooth, linear fashion. Instead, knowledge acquisition can follow a predictable phase transition, where small shifts in model size or data mixing ratios can trigger abrupt changes. Together, these results suggest potential avenues for improving the predictability of LLM training, especially in hyperparameter scheduling and data mixing. 讲者信息： Kaifeng Lyu will be joining the Institute for Interdisciplinary Information Sciences (IIIS) at Tsinghua University as an Assistant Professor this June. His primary research interests include machine learning theory, AI safety/alignment, and optimization. Kaifeng Lyu is a postdoc at the Simons Institute for the Theory of Computing. He obtained his PhD in Computer Science from Princeton University, advised by Sanjeev Arora. 涉及到的paper： A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules https://arxiv.org/abs/2503.12811 Data Mixing can Induce Phase Transitions in Knowledge Acquisition arxiv soon

大模型训练

FAI-Seminar 发消息

争做国内最好的人工智能研讨班！seminar官网：fai-seminar.ac.cn

好看又好用！家庭投影应该怎么选？大眼橙&极米哪款更好用？

[FAI] 北大张博航 | 基于子图的图神经网络表达能力探究

01:28:33

[FAI] 北大罗胜杰 | 人工智能与通用分子表征

01:22:02

[FAI] MIT 刘子鸣 | 智能从饥饿中诞生

01:20:51

[FAI] UMich马鉴昊 | 均值估计还能这么玩？稀疏鲁棒均值估计

01:01:36

[FAI] 北大金及凯 | 无穷维与有限维区别多大？算子学习的最优算法

01:24:53

[FAI] 中科大王博涵 | 动量可以加速SGD吗？

01:17:45

[FAI] 清华滕佳烨 | 现代机器学习视角下的不确定性度量 | ICLR 23

01:05:55

[FAI] 普林斯顿蔡天乐 | 让大语言模型自己创造工具

55:33

[FAI*] TTIC 李志远 | 平坦正则化对泛化的帮助 (special talk)

01:05:56

[FAI] 清华陈乐偲 | 双层优化问题最优一阶算法

55:39

[FAI] NeurIPS23' oral 北大张博航 | 思维链如何解锁大模型的隐藏能力

01:08:13

[FAI] 清华顾欣然 | 分布式学习中如何设置通信频率？平方同步律！

01:13:37

[FAI] DeepMind 石佳欣 | 长序列建模？基于小波理论的神经网络框架

01:13:06

[FAI] JMLR 港中文范凤磊 | 揭秘ReLU神经网络

01:03:17

[FAI] CMU 刘冰彬 | 顺序推理问题的"捷径"解法 ICLR 23' oral, NeurIPS 23' spotlight

01:12:10

[FAI] 清华温凯越 | Transformer不能被单元方法解释 NeurIPS 23'

01:02:07

[FAI] 清华游凯超 | 理解、学习与使用PyTorch编译器（torch.compile）

48:23

[FAI] UMich 胡威 | 神经网络表示中的隐藏结构

01:05:02

[FAI] CMU 翟润天 | 表征学习和大模型的泛

01:25:49

[FAI] 北大罗胜杰 | 高效等变网络设计 ICLR 24'

01:19:58

[FAI] Princeton 高天宇 | 上下文并行编码实现语言模型的长文本拓展

01:07:10

[FAI] 香港大学邹荻凡 | 基于扩散蒙特卡洛方法的快速采样算法

58:37

[FAI] NYU 陆一平 | 基于模拟算法校准的AI4science:算法与理论

01:13:08

[FAI] Princeton 俞鼎力 | 张量程序VI：无限深度神经网络中的特征学习

01:09:36

[FAI] Princeton 吕凯风 | 浅谈神经网络在算法推理上的局限性

01:30:19

[FAI] CMU 李禹辰 | 现代语言模型的理论理解

01:19:21

[FAI*] 清华大学李建 | 深度学习中梯度方法的泛化与隐式偏差 (special talk)

01:13:17

[FAI] 北大张博航 | 图神经网络表达能力的评估准则

01:12:27

[FAI] CMU 黎善达 | 更快的大模型推理＆ AIMO竞赛第二名方法分享

01:16:47

Transformer上下文学习的训练过程分析

01:19:20

[FAI] Berkeley 吴京风 | 梯度下降新视角：大步长、振荡与加速

01:12:07

[FAI] 港城大马梓业 | 通过增强鞍点的可逃脱性以克服非凸景观下的挑战

01:39:29

[FAI] 人大刘勇 | 检索增强能提升大模型的推理能力吗？

53:36

[FAI] 清华大学陈乐偲 | 基于“懒”Hessian技术的快速牛顿算法 ICLR 25' Oral

01:05:18

[FAI] 清华吕凯风 | 大模型训练中的扩展定律与相变现象

01:37:45

[FAI] Stanford 温凯越 | 山谷河流：从损失景观理解WSD学习率机制

01:15:22

[FAI] Princeton 黄凯旋 | MATH-Perturb: 评估llm面对复杂改动的数学推理能力

59:33

[FAI] 清华卢睿 | 理解扩散模型生成文字的幻觉问题 | ICLR 2025

01:03:37

[FAI] MIT 杨松琳 | 可扩展线性RNN的进展：DeltaNet及其变体

01:55:24

[FAI] 清华陈焕然 | 扩散模型即为（可证明的）鲁棒分类器

01:17:18

[FAI] UCB 席浩诚 | 利用稀疏性加速视频扩散Transformer推理

50:35

[FAI] UCB 蔡榆杭 | 神经网络中梯度下降算法的隐式偏差

57:15

[FAI] Princeton 王子轩 | 从易到难：Transformer如何学会多步组合推理

01:08:35

[FAI] 北大李柄辉 | 深入理解深度学习中的对抗样本现象：从模型表达能力与训练动力学视角

01:26:58

[FAI] Princeton 王嘉宸 | 如何为模型训练团队推荐数据集？重新思考代理模型

01:26:05

[FAI] 港中深张雨舜 | 浅谈神经网络Hessian矩阵的特殊结构

01:20:23

顶部