Distributed training and data parallelism | Deep Learning Systems Class Notes