FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving

#1 FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving [PDF²] [Copy] [Kimi³] [REL]

Authors: Shouwei Gao, Junqi Yin, Feiyi Wang, Wenqian Dong

Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2026-02-26 03:55:51 UTC

2602.22593

#1 FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving [PDF2] [Copy] [Kimi3] [REL]

#1 FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving [PDF²] [Copy] [Kimi³] [REL]