2503.21013

Total: 1

#1 AllReduce Scheduling with Hierarchical Deep Reinforcement Learning [PDF2] [Copy] [Kimi] [REL]

Authors: Yufan Wei, Mickel Liu, Wenfei Wu

AllReduce is a technique in distributed computing which saw use in many critical applications of deep learning. Existing methods of AllReduce scheduling oftentimes lack flexibility due to being topology-specific or relying on extensive handcrafted designs that require domain-specific knowledge. In this work, we aim to alleviate this inflexibility by proposing a deep-reinforcement-learning (DRL)-based pipeline that can generate AllReduce scheduling for various network topologies without topology-specific design features. The flow scheduling module of this pipeline consists of two hierarchically-structured DRL policies that work cooperatively to find optimal scheduling. We showcase the performance of our method compared to the baseline methods on three topologies: BCube, DCell, and Jellyfish. Finally, we contributed a Python-based simulation environment simulating AllReduce scheduling on these network topologies.

Subjects: Networking and Internet Architecture , Distributed, Parallel, and Cluster Computing

Publish: 2025-03-26 22:01:49 UTC