No.15:Large Context Window with Blockwise Parallel Transformers & Ring Attention

363
0
2023-11-10 11:20:00
正在缓冲...
8
4
13
5
Speaker: Hao Liu, University of California, Berkeley Bio: Hao Liu a PhD student at UC Berkeley, advised by Pieter Abbeel. His research includes agents that autonomously interact with the environment, learn from experience continually, and build scalable architectures for them. Abstract: Transformers are widely used as the backbone for state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications from ChatGPT to AlphaFold. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving long sequences or long-term dependencies. In this talk, I will present some of our recent works on reducing memory cost and enabling long sequences. The first paper is Blockwise Parallel Transformer for Large Context Models, which reorganizes the computation of transformers without making approximations, and computes feedforward and self-attention block by block, reducing memory cost by four times which allows training four times longer sequences. In the subsequent work, Ring Attention with Blockwise Transformers for Near-Infinite Context, we build upon previous work and reorganize the computation of transformers across devices – the query block stays on each device while the key and value blocks circulate in a ring of devices. This allows computing transformers on much longer sequences, with the maximum sequence length scales linearly with the number of devices. As an example, if one could train a GPT that can process 16K words on 256 GPUs, then these two works enable training GPT that can process up to 16M words without overheads nor approximations.
https://dlo-seminar.github.io/
客服
顶部
赛事库 课堂 2021拜年纪