Operating Systems

#1 vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [PDF⁷] [Copy] [Kimi⁹]

Authors: Ramya Prabhu ; Ajay Nayak ; Jayashree Mohan ; Ramachandran Ramjee ; Ashish Panwar

Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager. Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation. Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework. We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels. vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer.

#2 sqlelf: a SQL-centric Approach to ELF Analysis [PDF] [Copy] [Kimi]

Authors: Farid Zakaria ; Zheyuan Chen ; Andrew Quinn ; Thomas R. W. Scogland

The exploration and understanding of Executable and Linkable Format (ELF) objects underpin various critical activities in computer systems, from debugging to reverse engineering. Traditional UNIX tooling like readelf, nm, and objdump have served the community reliably over the years. However, as the complexity and scale of software projects has grown, there arises a need for more intuitive, flexible, and powerful methods to investigate ELF objects. In this paper, we introduce sqlelf, an innovative tool that empowers users to probe ELF objects through the expressive power of SQL. By modeling ELF objects as relational databases, sqlelf offers the following advantages over conventional methods. Our evaluations demonstrate that sqlelf not only provides more nuanced and comprehensive insights into ELF objects but also significantly reduces the effort and time traditionally required for ELF exploration tasks

#3 uTNT: Unikernels for Efficient and Flexible Internet Probing [PDF] [Copy] [Kimi]

Authors: Maxime Letemple ; Gaulthier Gain ; Sami Ben Mariem ; Laurent Mathy ; Benoit Donnet

The last twenty years have seen the development and popularity of network measurement infrastructures. Internet measurement platforms have become common and have demonstrated their relevance in Internet understanding and security observation. However, despite their popularity, those platforms lack of flexibility and reactivity, as they are usually used for longitudinal measurements. As a consequence, they may miss detecting events that are security or Internet-related. During the same period, operating systems have evolved to virtual machines (VMs) as self-contained units for running applications, with the recent rise of unikernels, ultra-lightweight VMs tailored for specific applications, eliminating the need for a host OS. In this paper, we advocate that measurement infrastructures could take advantage of unikernels to become more flexible and efficient. We propose uTNT, a proof-of-concept unikernel-based implementation of TNT, a traceroute extension able to reveal MPLS tunnels. This paper documents the full toolchain for porting TNT into a unikernel and evaluates uTNT performance with respect to more traditional approaches. The paper also discusses a use case in which uTNT could find a suitable usage. uTNT source code is publicly available on Gitlab.

#1 vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [PDF7] [Copy] [Kimi9]

#2 sqlelf: a SQL-centric Approach to ELF Analysis [PDF] [Copy] [Kimi]

#3 uTNT: Unikernels for Efficient and Flexible Internet Probing [PDF] [Copy] [Kimi]

#1 vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [PDF⁷] [Copy] [Kimi⁹]