GPU – QA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and API developed by NVIDIA. It allows developers to write programs that execute on NVIDIA GPUs using C, C++, and Python (via libraries like PyCUDA or CuPy).
CUDA exposes GPU's low-level hardware to perform general-purpose computations (GPGPU).

GPUs are widely used in:

  • Deep Learning & AI training/inference

  • Cryptocurrency mining

  • Scientific simulations (e.g., weather, molecular dynamics)

  • Financial modeling

  • Video encoding/decoding

  • Autonomous vehicles (sensor fusion, image recognition)

GPU memory has the following hierarchy:

  • Global Memory: Large but slow, accessible by all threads.

  • Shared Memory: Fast, shared within a thread block.

  • Constant Memory: Read-only and cached.

  • Local Memory: Private to each thread (register spilling).
    CPU relies more on cache levels (L1, L2, L3), while GPU relies on efficient use of shared/global memory.

A typical GPU has:

  • SMs (Streaming Multiprocessors): Contain multiple CUDA cores and control units.

  • CUDA Cores: Execute instructions in parallel.

  • Warp Scheduler: Manages execution of thread warps.

  • Memory Units: Global, shared, and texture memory blocks.

  • Control Logic: For managing execution flow and memory access.

  • Use GPU-accelerated libraries like cuDNN, TensorRT, PyTorch or TensorFlow-GPU.

  • Minimize CPU-GPU memory transfers. Use to(device) in PyTorch properly.

  • Increase batch size to better utilize parallelism.

  • Profile with tools like Nsight Systems, NVIDIA Visual Profiler.

  • Use mixed-precision training (e.g., FP16) to speed up training and reduce memory usage.

A warp is a group of 32 threads that execute the same instruction simultaneously on CUDA GPUs.

  • If threads in a warp diverge (take different paths), they are executed serially, reducing performance.

  • Keeping warp divergence low and ensuring memory coalescing helps maximize GPU efficiency.

Memory coalescing refers to organizing memory access patterns such that threads in a warp access contiguous memory addresses.

  • Coalesced access reduces memory latency and increases bandwidth utilization.

  • Example: Thread 0 accesses A[0], thread 1 accesses A[1], ..., thread 31 accesses A[31].

In CUDA, __syncthreads() is used to synchronize threads within the same thread block.

  • It ensures all threads have completed their instructions before moving forward.

  • For inter-block synchronization, you usually need to end the kernel and launch a new one, or use CUDA Cooperative Groups (advanced).

FeatureCUDAOpenCL
VendorNVIDIA onlyCross-platform (AMD, Intel, NVIDIA)
PerformanceOften better on NVIDIA GPUsSlightly slower on NVIDIA
MaturityVery mature, good ecosystemGood for portability
Use caseWhen targeting NVIDIA GPUs and want performance & tool supportWhen you need cross-vendor support and portability