2508.11291

Total: 1

#1 Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks [PDF] [Copy] [Kimi1] [REL]

Authors: Rui Bao, Nan Xue, Yaping Sun, Zhiyong Chen

The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies context-aware costs arising from model switching and KV-cache management. While maintaining full inference quality, extensive experiments demonstrate that our framework cuts average response latency by 5-15% and reduces large model invocations by 10-20% against competitive baselines on MMLU, GSM8K, and MT-Bench-101 benchmarks.

Subjects: Information Theory , Artificial Intelligence , Machine Learning

Publish: 2025-08-15 07:55:05 UTC