1 points | by traceopt-ai 5 hours ago ago
1 comments
This tool focuses on finding stragglers in multi-GPU PyTorch (DDP) training. In practice, one slow rank often gates the entire step, but it is hard to see which GPU is lagging and why.
This is early and single-node only for now. Feedback welcome.
This tool focuses on finding stragglers in multi-GPU PyTorch (DDP) training. In practice, one slow rank often gates the entire step, but it is hard to see which GPU is lagging and why.
This is early and single-node only for now. Feedback welcome.