Total: 1
In an era when the performance of a single compute device plateaus, software must be designed to scale on a massively parallel system for better runtime performance. However, in the context of training deep learning models, the commonly used back-propagation (BP) algorithm imposes a strong sequential dependency in the process of gradient computation. Under model parallelism, BP has a theoretical step complexity of Theta(n) which hinders its scalability in a parallel computing environment, where n represents the number of compute devices into which a model is partitioned.