Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems

#1 Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems [PDF²] [Copy] [Kimi²] [REL]

Authors: Chang Lou, Dimas Shidqi Parikesit, Yujin Huang, Zhewen Yang, Senapati Diwangkara, Yuzhuo Jing, Achmad Imam Kistijantoro, Ding Yuan, Suman Nath, Peng Huang

Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers. In this paper, we explore a novel approach that directly derives semantic checkers from system test code. We first present a large-scale study on existing system test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive tens to hundreds of checkers. These checkers detect 15 out of 20 real-world silent failures we reproduce and incur small runtime overhead.

Subject: OSDI.2025

lou@osdi25@USENIX

#1 Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems [PDF2] [Copy] [Kimi2] [REL]

#1 Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems [PDF²] [Copy] [Kimi²] [REL]