Universal Model Routing for Efficient LLM Inference

#1 Universal Model Routing for Efficient LLM Inference [PDF⁶] [Copy] [Kimi¹⁷] [REL]

Authors: Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

Subjects: Computation and Language , Machine Learning

Publish: 2025-02-12 20:30:28 UTC

2502.08773

#1 Universal Model Routing for Efficient LLM Inference [PDF6] [Copy] [Kimi17] [REL]

#1 Universal Model Routing for Efficient LLM Inference [PDF⁶] [Copy] [Kimi¹⁷] [REL]