Distributed Data Parallel (DDP)

Distributed Data Parallel (DDP)

June 26, 2025ยทDeependu
Deependu

๐Ÿง  Concept: Training Workflows

Single-GPU (No Parallelism)

<p>DataLoader โ†’ batch โ†’ model โ†’ loss โ†’ backward โ†’ optimizer.step()</p>
  • Everything runs in one process, on one device.

DDP

<p>Multiple processes โ€” each with its <strong>own model replica and DataLoader</strong>, running on a <strong>separate GPU</strong>.</p>

DDP

Each process does the following:

  1. Loads a different subset of data using its own DataLoader (thanks to DistributedSampler).
  2. Performs forward + backward pass independently on its own model replica.
  3. During backward(), gradients are synchronized across all processes using all-reduce.
  4. Each process computes the average gradient, ensuring consistent updates.
  5. Each optimizer step updates its local model, but all replicas remain in sync because gradients were averaged.

โœ… This gives you data parallelism with minimal communication overhead and no parameter divergence.


DDP over DataParallel (DP)

<p>DataParallel is an <strong>older approach</strong> to data parallelism. DP is trivially simple (with just one extra line of code) but it is much less performant.</p>
DataParallelDistributedDataParallel
More overhead; model is replicated and destroyed at each forward passModel is replicated only once
Only supports single-node parallelismSupports scaling to multiple machines
Slower; uses multithreading on a single process and runs into Global Interpreter Lock (GIL) contentionFaster (no GIL contention) because it uses multiprocessing
Last updated on