GPU - QA - Daily Dose Of AI

What is a GPU and how does it differ from a CPU?

A GPU (Graphics Processing Unit) is designed for parallel processing, making it ideal for tasks like rendering graphics, deep learning, and simulations. While a CPU (Central Processing Unit) has a few cores optimized for sequential processing, a GPU has hundreds or thousands of smaller cores that can perform tasks simultaneously.

CPU: Few powerful cores, good at sequential tasks.
GPU: Many lightweight cores, good at parallel workloads.

Explain the term CUDA. How does it relate to GPU programming?

CUDA (Compute Unified Device Architecture) is a parallel computing platform and API developed by NVIDIA. It allows developers to write programs that execute on NVIDIA GPUs using C, C++, and Python (via libraries like PyCUDA or CuPy).
CUDA exposes GPU's low-level hardware to perform general-purpose computations (GPGPU).

What are the typical use cases of GPUs outside graphics rendering?

GPUs are widely used in:

Deep Learning & AI training/inference
Cryptocurrency mining
Scientific simulations (e.g., weather, molecular dynamics)
Financial modeling
Video encoding/decoding
Autonomous vehicles (sensor fusion, image recognition)

How does GPU memory hierarchy differ from CPU memory hierarchy?

GPU memory has the following hierarchy:

Global Memory: Large but slow, accessible by all threads.
Shared Memory: Fast, shared within a thread block.
Constant Memory: Read-only and cached.
Local Memory: Private to each thread (register spilling).
CPU relies more on cache levels (L1, L2, L3), while GPU relies on efficient use of shared/global memory.

What are the key components of a GPU architecture?

A typical GPU has:

SMs (Streaming Multiprocessors): Contain multiple CUDA cores and control units.
CUDA Cores: Execute instructions in parallel.
Warp Scheduler: Manages execution of thread warps.
Memory Units: Global, shared, and texture memory blocks.
Control Logic: For managing execution flow and memory access.

How do you optimize a deep learning model to run efficiently on a GPU?

Use GPU-accelerated libraries like cuDNN, TensorRT, PyTorch or TensorFlow-GPU.
Minimize CPU-GPU memory transfers. Use to(device) in PyTorch properly.
Increase batch size to better utilize parallelism.
Profile with tools like Nsight Systems, NVIDIA Visual Profiler.
Use mixed-precision training (e.g., FP16) to speed up training and reduce memory usage.

What are warps in GPU, and how do they affect performance?

A warp is a group of 32 threads that execute the same instruction simultaneously on CUDA GPUs.

If threads in a warp diverge (take different paths), they are executed serially, reducing performance.
Keeping warp divergence low and ensuring memory coalescing helps maximize GPU efficiency.

Explain the concept of memory coalescing in GPU programming.

Memory coalescing refers to organizing memory access patterns such that threads in a warp access contiguous memory addresses.

Coalesced access reduces memory latency and increases bandwidth utilization.
Example: Thread 0 accesses A[0], thread 1 accesses A[1], ..., thread 31 accesses A[31].

How do you handle synchronization in GPU kernels?

In CUDA, __syncthreads() is used to synchronize threads within the same thread block.

It ensures all threads have completed their instructions before moving forward.
For inter-block synchronization, you usually need to end the kernel and launch a new one, or use CUDA Cooperative Groups (advanced).

Compare CUDA and OpenCL. When would you prefer one over the other?

Feature	CUDA	OpenCL
Vendor	NVIDIA only	Cross-platform (AMD, Intel, NVIDIA)
Performance	Often better on NVIDIA GPUs	Slightly slower on NVIDIA
Maturity	Very mature, good ecosystem	Good for portability
Use case	When targeting NVIDIA GPUs and want performance & tool support	When you need cross-vendor support and portability

Author Info

Ashish Singal

Category Collection

GPU – QA