Performance

#1 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [PDF²³] [Copy] [Kimi²⁸]

Authors: Yujun Lin ; Haotian Tang ; Shang Yang ; Zhekai Zhang ; Guangxuan Xiao ; Chuang Gan ; Song Han

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve.

#2 QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systems [PDF] [Copy] [Kimi]

Authors: Nenad Mijić ; Abhiram Kaushik ; Davor Davidović

In this paper we present a novel algorithm developed for computing the QR factorisation of extremely ill-conditioned tall-and-skinny matrices on distributed memory systems. The algorithm is based on the communication-avoiding CholeskyQR2 algorithm and its block Gram-Schmidt variant. The latter improves the numerical stability of the CholeskyQR2 algorithm and significantly reduces the loss of orthogonality even for matrices with condition numbers up to $10^{15}$. Currently, there is no distributed GPU version of this algorithm available in the literature which prevents the application of this method to very large matrices. In our work we provide a distributed implementation of this algorithm and also introduce a modified version that improves the performance, especially in the case of extremely ill-conditioned matrices. The main innovation of our approach lies in the interleaving of the CholeskyQR steps with the Gram-Schmidt orthogonalisation, which ensures that update steps are performed with fully orthogonalised panels. The obtained orthogonality and numerical stability of our modified algorithm is equivalent to CholeskyQR2 with Gram-Schmidt and other state-of-the-art methods. Weak scaling tests performed with our test matrices show significant performance improvements. In particular, our algorithm outperforms state-of-the-art Householder-based QR factorisation algorithms available in ScaLAPACK by a factor of $6$ on CPU-only systems and up to $80\times$ on GPU-based systems with distributed memory.

#3 Analysis of Markovian Arrivals and Service with Applications to Intermittent Overload [PDF] [Copy] [Kimi]

Authors: Isaac Grosof ; Yige Hong ; Mor Harchol-Balter

Almost all queueing analysis assumes i.i.d. arrivals and service. In reality, arrival and service rates fluctuate over time. In particular, it is common for real systems to intermittently experience overload, where the arrival rate temporarily exceeds the service rate, which an i.i.d. model cannot capture. We consider the MAMS system, where the arrival and service rates each vary according to an arbitrary finite-state Markov chain, allowing intermittent overload to be modeled. We derive the first explicit characterization of mean queue length in the MAMS system, with explicit bounds for all arrival and service chains at all loads. Our bounds are tight in heavy traffic. We prove even stronger bounds for the important special case of two-level arrivals with intermittent overload. Our key contribution is an extension to the drift method, based on the novel concepts of relative arrivals and relative completions. These quantities allow us to tractably capture the transient correlational effect of the arrival and service processes on the mean queue length.

#1 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [PDF23] [Copy] [Kimi28]

#2 QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systems [PDF] [Copy] [Kimi]

#3 Analysis of Markovian Arrivals and Service with Applications to Intermittent Overload [PDF] [Copy] [Kimi]

#1 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [PDF²³] [Copy] [Kimi²⁸]