LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs

#1 LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs [PDF] [Copy] [Kimi¹] [REL]

Authors: Erik Schultheis, Dan Alistarh

We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.

Subjects: Distributed, Parallel, and Cluster Computing , Machine Learning

Publish: 2025-12-17 10:51:45 UTC

2512.15306

#1 LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs [PDF] [Copy] [Kimi1] [REL]

#1 LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs [PDF] [Copy] [Kimi¹] [REL]