Distributed Training - Nate's Notes

The main thing to note with distributed training is that each worker (could be a gpu, cpu, different machine, etc) gets a different slice of the data. It will then use the same model to compute a forward pass and calculate the loss, then compute the gradient via the backward pass. This means that we have a set of gradients calculated by each worker-so how do we do the update? We simply need to find a way to **reduce** the multiple gradients (a tensor, frequently a matrix) to a single gradient. This is often done via *averaging*: ![](Screen%20Shot%202022-11-28%20at%207.04.53%20AM.png) This can happen *synchronously*: ![](A%20friendly%20introduction%20to%20distributed%20training%20(ML%20Tech%20Talks)%2010-33%20screenshot.png) A great example of how this occurs can be found in [this great video](https://www.youtube.com/watch?v=S1tN9a4Proc). This can also occur *asynchronously*, as is the case in the **parameter-server architecture**. --- Date: 20221128 Links to: Tags: #review References: * [A friendly introduction to distributed training (ML Tech Talks) - YouTube](https://www.youtube.com/watch?v=S1tN9a4Proc)