Total: 1
The dramatic improvements in Large Language Models (LLMs) come at the cost of increased computational resources for inference. Recent studies ameliorate the computational costs of LLMs by increasing their activation sparsity but suffer from significant performance degradation on downstream tasks. In this work, we introduce a new framework for sparsifying the activations of LLMs and reducing inference costs, dubbed $\underline{C}$ontextually $\underline{A}$ware $\underline{T}$hresholding for $\underline{S}$parsity (CATS). CATS is a relatively simple algorithm that is easy to implement and highly effective. At the heart of our framework is a new non-linear activation function. We demonstrate that CATS can be applied to various models, including Mistral-7B and Llama2-7B \& 13B, and outperforms existing sparsification techniques across multiple tasks. More precisely, CATS-based models achieve downstream task performance within $\sim$ 99\% of their base models at activation sparsity levels of 50\%, even without any fine-tuning. Moreover, with fine-tuning that targets only 1\% of the parameters, CATS-based models not only converge faster but also achieve better task performance than competing techniques. Finally, we develop a custom GPU kernel for the efficient implementation of CATS that translates the activation sparsity of CATS to real wall-clock time speedups. Our custom kernel implementation of CATS results in a $\sim$15\% improvement in wall-clock inference latency of token generation. We release our code, experiments, and datasets at https://github.com/ScalingIntelligence/CATS.