The main thing to note with distributed training is that each worker (could be a gpu, cpu, different machine, etc) gets a different slice of the data. It will then use the same model to compute a forward pass and calculate the loss, then compute the gradient via the backward pass. This means that we have a set of gradients calculated by each worker-so how do we do the update?
We simply need to find a way to **reduce** the multiple gradients (a tensor, frequently a matrix) to a single gradient. This is often done via *averaging*:

This can happen *synchronously*:
%2010-33%20screenshot.png)
A great example of how this occurs can be found in [this great video](https://www.youtube.com/watch?v=S1tN9a4Proc).
This can also occur *asynchronously*, as is the case in the **parameter-server architecture**.
---
Date: 20221128
Links to:
Tags: #review
References:
* [A friendly introduction to distributed training (ML Tech Talks) - YouTube](https://www.youtube.com/watch?v=S1tN9a4Proc)