StitchLLM: Serving LLMs, One Block at a Time

#1 StitchLLM: Serving LLMs, One Block at a Time [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Bodun Hu, Shuozhe Li, Saurabh Agarwal, Myungjin Lee, Akshay Jajoo, Jiamin Li, Le Xu, Geon-Woo Kim, Donghyun Kim, Hong Xu, Amy Zhang, Aditya Akella

The rapid evolution of large language models (LLMs) has revolutionized natural language processing (NLP) tasks such as text generation, translation, and comprehension. However, the increasing computational demands and inference costs of these models present significant challenges. This study investigates the dynamic and efficient utilization of pre-trained weights from open-sourced LLMs of varying parameter sizes to achieve an optimal balance between computational efficiency and task performance. Drawing inspiration from the dual-process theory of human cognition, we introduce StitchLLM: a dynamic model routing framework that employs a powerful bottom model to process all queries, and uses a lightweight routing mechanism to allocate computational resources appropriately. Our novel framework optimizes efficiency and maintains performance, leveraging a trainable stitching layer for seamless integration of decoder layers across different LLMs. Experimental results demonstrate that StitchLLM improves system throughput while minimizing performance degradation, offering a flexible solution for deploying LLMs in resource-constrained settings.

Subject: ACL.2025 - Long Papers

2025.acl-long.1305@ACL

#1 StitchLLM: Serving LLMs, One Block at a Time [PDF1] [Copy] [Kimi1] [REL]

#1 StitchLLM: Serving LLMs, One Block at a Time [PDF¹] [Copy] [Kimi¹] [REL]