Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

#1 Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery [PDF] [Copy] [Kimi] [REL]

Authors: Kellen Gillespie, Robyn Perry

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2026-06-16 04:55:06 UTC

2606.17519

#1 Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery [PDF] [Copy] [Kimi] [REL]