CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

#1 CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference [PDF⁵] [Copy] [Kimi⁵] [REL]

Authors: Dong Liu, Yanxuan Yu, Ben Lengerich

Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top-$k$ certification and $\varepsilon$-certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \href{https://github.com/FastLM/CSV-Decode}{https://github.com/FastLM/CSV-Decode}.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2025-11-16 14:02:41 UTC

2511.21702

#1 CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference [PDF5] [Copy] [Kimi5] [REL]

#1 CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference [PDF⁵] [Copy] [Kimi⁵] [REL]