huang-lexiang@osdi22@USENIX

Total: 1

#1 Metastable Failures in the Wild [PDF] [Copy] [Kimi] [REL]

Authors: Lexiang Huang ; Matthew Magnusson ; Abishek Bangalore Muralikrishna ; Salman Estyak ; Rebecca Isaacs ; Abutalib Aghayev ; Timothy Zhu ; Aleksey Charapko

Recently, Bronson et al. introduced a framework for understanding a class of failures in distributed systems called metastable failures. The examples of metastable failures presented in that work are simplified versions of failures observed at Facebook. In this work, we study the prevalence of such failures in the wild by scouring over publicly available incident reports from many organizations, ranging from hyperscalers to small companies. Our main findings are threefold. First, metastable failures are universally observed—we present an in-depth study of 22 metastable failures from 11 different organizations. Second, metastable failures are a recurring pattern in many severe outages—e.g., at least 4 out of 15 major outages in the last decade at Amazon Web Services were caused by metastable failures. Third, we extend the model by Bronson et al. to better reflect the metastable failures seen in the wild by categorizing two types of triggers and two types of amplification mechanisms, which we confirm through developing multiple example applications that reproduce different types of metastable failures in a controlled environment. We believe our work will aid in a deeper understanding of metastable failures and in coming up with solutions to them.