Distributed Data Parallel (DDP)

June 26, 2025·

🧠 Concept: Training Workflows

DataLoader → batch → model → loss → backward → optimizer.step()

Multiple processes — each with its own model replica and DataLoader, running on a separate GPU.

DDP

Each process does the following:

Loads a different subset of data using its own DataLoader (thanks to DistributedSampler).
Performs forward + backward pass independently on its own model replica.
During backward(), gradients are synchronized across all processes using all-reduce.
Each process computes the average gradient, ensuring consistent updates.
Each optimizer step updates its local model, but all replicas remain in sync because gradients were averaged.

✅ This gives you data parallelism with minimal communication overhead and no parameter divergence.

DataParallel is an older approach to data parallelism. DP is trivially simple (with just one extra line of code) but it is much less performant.

`DataParallel`	`DistributedDataParallel`
More overhead; model is replicated and destroyed at each forward pass	Model is replicated only once
Only supports single-node parallelism	Supports scaling to multiple machines
Slower; uses multithreading on a single process and runs into Global Interpreter Lock (GIL) contention	Faster (no GIL contention) because it uses multiprocessing

Last updated on June 26, 2025