Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

#1 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [PDF⁴²] [Copy] [Kimi¹¹⁹] [REL]

Authors: Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.

Subject: Computation and Language

Publish: 2024-03-14 17:59:26 UTC

2403.09636

#1 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [PDF42] [Copy] [Kimi119] [REL]

#1 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [PDF⁴²] [Copy] [Kimi¹¹⁹] [REL]