CUDA Memory Model

Unified Memory is CUDA’s attempt to make CPU–GPU memory look like one shared address space. The pointer you get from cudaMallocManaged works everywhere. But the real machinery underneath is page migration. Memory moves between CPU RAM and GPU VRAM in chunks (pages, typically 4 KB or 64 KB depending on architecture).

Left alone, the system migrates pages only when they are touched. That creates GPU page faults, which stall execution. cudaMemPrefetchAsync exists to move the data before the kernel needs it.

Below is a cleaned-up set of notes suitable for a markdown reference.

CUDA Unified Memory Notes

1. Allocate Unified Memory

cudaMallocManaged allocates memory accessible from both CPU and GPU.

int *data;
int N = 1024;

cudaMallocManaged(&data, N * sizeof(int));

Same pointer works everywhere.

__global__ void add_one(int* data) {
    int i = threadIdx.x;
    data[i] += 1;
}

int main() {
    add_one<<<1, 1024>>>(data);
    cudaDeviceSynchronize();
}

Free memory normally:

cudaFree(data);

2. Why Prefetch Exists

Without prefetch:

Kernel accesses memory
GPU triggers page fault
Driver migrates page CPU → GPU
Kernel resumes

This can happen thousands of times.

Prefetch moves memory ahead of execution, avoiding faults.

3. cudaMemPrefetchAsync

Prefetch unified memory to a specific device.

cudaMemPrefetchAsync(ptr, size, device_id);

Example:

int device;
cudaGetDevice(&device);

cudaMemPrefetchAsync(data, N * sizeof(int), device);

add_one<<<1, N>>>(data);
cudaDeviceSynchronize();

Meaning:

Move the pages to the GPU’s VRAM before the kernel runs.

4. Prefetch Back to CPU

After GPU work finishes, you can migrate memory back.

cudaMemPrefetchAsync(data, N*sizeof(int), cudaCpuDeviceId);

Now CPU reads will not trigger page faults.

5. Multi-GPU Prefetch

Unified memory supports migration between GPUs.

Example system:

GPU0
GPU1
CPU

Move memory to GPU1:

cudaMemPrefetchAsync(data, size, 1);

Then run a kernel on GPU1:

cudaSetDevice(1);
kernel<<<grid, block>>>(data);

The runtime migrates pages GPU0 → GPU1 if needed.

6. Multi-GPU Example

int *data;
size_t size = N * sizeof(int);

cudaMallocManaged(&data, size);

// initialize on CPU
for(int i = 0; i < N; i++)
    data[i] = i;

// move memory to GPU1
cudaMemPrefetchAsync(data, size, 1);

cudaSetDevice(1);
kernel<<<grid, block>>>(data);

cudaDeviceSynchronize();

7. cudaMemAdvise (placement hints)

You can guide CUDA’s migration strategy.

Example: prefer GPU memory.

cudaMemAdvise(data,
              size,
              cudaMemAdviseSetPreferredLocation,
              device);

Useful hints:

Advice	Meaning
`cudaMemAdviseSetPreferredLocation`	preferred device for pages
`cudaMemAdviseSetReadMostly`	optimize for read-heavy workloads
`cudaMemAdviseSetAccessedBy`	allow access from another GPU

8. Multi-GPU Access Hint

Allow multiple GPUs to access the same memory.

cudaMemAdvise(data,
              size,
              cudaMemAdviseSetAccessedBy,
              gpu_id);

This avoids repeated migrations in some scenarios.

9. When Unified Memory Works Well

Good cases:

complex pointer structures (trees, graphs)
quick CUDA prototyping
multi-GPU research workloads
datasets larger than VRAM

10. When It’s Not Ideal

High-performance kernels often avoid it because:

page faults stall warps
migration latency is unpredictable
frameworks prefer explicit memory control

Production systems often use:

cudaMalloc
cudaMemcpyAsync
custom memory pools

for deterministic performance.

Basics of CUDA C++