Explain the term CUDA. How does it relate to GPU programming?
CUDA (Compute Unified Device Architecture) is a parallel computing platform and API developed by NVIDIA. It allows developers to write programs that execute on NVIDIA GPUs using C, C++, and Python (via libraries like PyCUDA or CuPy).
CUDA exposes GPU's low-level hardware to perform general-purpose computations (GPGPU).
What are the typical use cases of GPUs outside graphics rendering?
GPUs are widely used in:
Deep Learning & AI training/inference
Cryptocurrency mining
Scientific simulations (e.g., weather, molecular dynamics)
Financial modeling
Video encoding/decoding
Autonomous vehicles (sensor fusion, image recognition)
How does GPU memory hierarchy differ from CPU memory hierarchy?
GPU memory has the following hierarchy:
Global Memory: Large but slow, accessible by all threads.
Shared Memory: Fast, shared within a thread block.
Constant Memory: Read-only and cached.
Local Memory: Private to each thread (register spilling).
CPU relies more on cache levels (L1, L2, L3), while GPU relies on efficient use of shared/global memory.
What are the key components of a GPU architecture?
A typical GPU has:
SMs (Streaming Multiprocessors): Contain multiple CUDA cores and control units.
CUDA Cores: Execute instructions in parallel.
Warp Scheduler: Manages execution of thread warps.
Memory Units: Global, shared, and texture memory blocks.
Control Logic: For managing execution flow and memory access.
How do you optimize a deep learning model to run efficiently on a GPU?
Use GPU-accelerated libraries like cuDNN, TensorRT, PyTorch or TensorFlow-GPU.
Minimize CPU-GPU memory transfers. Use
to(device)
in PyTorch properly.Increase batch size to better utilize parallelism.
Profile with tools like Nsight Systems, NVIDIA Visual Profiler.
Use mixed-precision training (e.g., FP16) to speed up training and reduce memory usage.
What are warps in GPU, and how do they affect performance?
A warp is a group of 32 threads that execute the same instruction simultaneously on CUDA GPUs.
If threads in a warp diverge (take different paths), they are executed serially, reducing performance.
Keeping warp divergence low and ensuring memory coalescing helps maximize GPU efficiency.
Explain the concept of memory coalescing in GPU programming.
Memory coalescing refers to organizing memory access patterns such that threads in a warp access contiguous memory addresses.
Coalesced access reduces memory latency and increases bandwidth utilization.
Example: Thread 0 accesses A[0], thread 1 accesses A[1], ..., thread 31 accesses A[31].
How do you handle synchronization in GPU kernels?
In CUDA, __syncthreads()
is used to synchronize threads within the same thread block.
It ensures all threads have completed their instructions before moving forward.
For inter-block synchronization, you usually need to end the kernel and launch a new one, or use CUDA Cooperative Groups (advanced).
Compare CUDA and OpenCL. When would you prefer one over the other?
Feature | CUDA | OpenCL |
---|---|---|
Vendor | NVIDIA only | Cross-platform (AMD, Intel, NVIDIA) |
Performance | Often better on NVIDIA GPUs | Slightly slower on NVIDIA |
Maturity | Very mature, good ecosystem | Good for portability |
Use case | When targeting NVIDIA GPUs and want performance & tool support | When you need cross-vendor support and portability |