
Notes Overview
- Graph-level (e.g.,
operator fusion,kernel scheduling,memory planning) - Kernel-level (e.g.,
CUDA,Triton,custom operatorsfor specialized hardware) - System-level (e.g.,
distributed trainingacross GPUs/TPUs,inference servingat scale)
Projects
- Pytorch Compiler (TorchFX, TorchInductor, IR Graph, Functorch)
- CUDA programming
- Triton
- LeetGPU
- CUDA Graphs
- NVFuser
- Modular (mojo)