alfatafta@osdi20@USENIX

Total: 1

#1 Toward a Generic Fault Tolerance Technique for Partial Network Partitioning [PDF] [Copy] [Kimi] [REL]

Authors: Mohammed Alfatafta ; Basil Alkhatib ; Ahmed Alquraan ; Samer Al-Kiswany

We present an extensive study focused on partial network partitioning. Partial network partitions disrupt the communication between some but not all nodes in a cluster. First, we conduct a comprehensive study of system failures caused by this fault in 12 popular systems. Our study reveals that the studied failures are catastrophic (e.g., lead to data loss), easily manifest, and can manifest by partially partitioning a single node. Second, we dissect the design of eight popular systems and identify four principled approaches for tolerating partial partitions. Unfortunately, our analysis shows that implemented fault tolerance techniques are inadequate for modern systems; they either patch a particular mechanism or lead to a complete cluster shutdown, even when alternative network paths exist. Finally, our findings motivate us to build Nifty, a trans-parent communication layer that masks partial network partitions. Nifty builds an overlay between nodes to detour packets around partial partitions. Our prototype evaluation with six popular systems shows that Nifty overcomes the short comings of current fault tolerance approaches and effectively masks partial partitions while imposing negligible overhead.