ji@fast19@USENIX

Total: 1

#1 Automatic, Application-Aware I/O Forwarding Resource Allocation [PDF] [Copy] [Kimi] [REL]

Authors: Xu Ji, Bin Yang, Tianyu Zhang, Xiaosong Ma, Xiupeng Zhu, Xiyang Wang, Nosayba El-Sayed, Jidong Zhai, Weiguo Liu, Wei Xue

The I/O forwarding architecture is widely adopted on modern supercomputers, with a layer of intermediate nodes sitting between the many compute nodes and backend storage nodes. This allows compute nodes to run more efficiently and stably with a leaner OS, offloads I/O coordination and communication with backend from the compute nodes, maintains less concurrent connections to storage systems, and provides additional resources for effective caching, prefetching, write buffering, and I/O aggregation. However, with many existing machines, these forwarding nodes are assigned to serve fixed set of compute nodes. We explore an automatic mechanism, DFRA, for application-adaptive dynamic forwarding resource allocation. With I/O monitoring data that proves affordable to acquire in real time and maintain for long-term history analysis, Upon each job's dispatch, DFRA conducts a history-based study to determine whether the job should be granted more forwarding resources or given dedicated forwarding nodes. Such customized I/O forwarding lets the small fraction of I/O-intensive applications achieve higher I/O performance and scalability, meanwhile effectively isolating disruptive I/O activities. We implemented, evaluated, and deployed DFRA on Sunway TaihuLight, the current No.2 supercomputer in the world. It improves applications' I/O performance by up to 16.0x, eliminates most of the inter-application I/O interference, and has saved over 200 million of core-hours during its deployment on TaihuLight for past 8 months. Finally, our proposed DFRA design is not platform-dependent, making it applicable to the management of existing and future I/O forwarding or burst buffer resources.