TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments

#1 TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Yue Guan, Hongtao Yu, Peng Chen, Daohang Shi, Karthik Manivannan, Nicholas J Riasanovsky, Manman Ren, Lei Wang, Shane Nay, Partha Kanuparthy, Zaifeng Pan, Zhengding Hu, Yufei Ding

Modern GPUs increasingly rely on specialized hardware units and asynchronous coordination mechanisms, so performance depends on orchestrating data movement, tensor-core computation, and synchronization rather than exposing more thread-level parallelism. This creates a programming-model tension: if too much execution structure is hidden, the compiler must catch up to new hardware mechanisms; if too much is exposed, the burden of orchestration falls back onto the programmer. We present TLX (Triton Low-level Language Extensions), built around MIMW (Multi-Instruction, Multi-Warp), which expresses orchestration at warp-group granularity while preserving Triton's productive blocked programming model for regular computation. TLX realizes this idea as an embedded extension to Triton, exposing explicit interfaces for multi-warp execution, local-memory orchestration, asynchronous operations, and cluster-aware control. Our evaluation shows that TLX supports substantial customization with limited development effort while remaining competitive with state-of-the-art implementations. TLX-authored kernels have been deployed in large-scale training and inference production systems. Our code is open sourced at https://github.com/facebookexperimental/triton.

Subject: Hardware Architecture

Publish: 2026-05-11 17:46:01 UTC

2605.10905

#1 TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments [PDF1] [Copy] [Kimi1] [REL]

#1 TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments [PDF¹] [Copy] [Kimi¹] [REL]