The same SQL that takes 60 seconds on a CPU database runs in 200ms on a GPU database — and the reason is not that GPUs are faster processors, it is that the execution model changes what happens between query plan and result.

Situation

Every database engineer has seen a query that looks harmless in code review and painful in production:

SELECT country, SUM(revenue)
FROM events
GROUP BY country;

At 10,000 rows, nobody cares. At 10 billion rows, this becomes a serious execution problem. CPU-based execution engines process this query through a bounded number of threads, each handling a sequential slice of the data. The query is I/O-intensive and compute-intensive, but the CPU serializes its work in ways that GPU execution does not.

The Problem

The structural gap is parallelism. A CPU-based database runs this query with dozens to hundreds of parallel workers. A GPU-based engine runs it with thousands to tens of thousands of parallel threads, each processing a slice of columnar data simultaneously. The difference in wall time is not incremental — it is a category change for the right workload shape.

The engineering question is not “why is this fast?” but rather “which queries change category, and which don’t?” Getting this wrong leads to GPU infrastructure that produces no benefit for the actual hot paths, because the bottleneck is I/O or coordination, not compute throughput.

Step-by-Step: How the Query Executes

10B row GPU query timeline

Step 1: CPU plans the query

The request starts as a normal SQL path: parse SQL, resolve objects, build logical plan, choose physical plan. CPU remains the control plane for planning, scheduling, and orchestration.

Step 2: Engine isolates the heavy path

The planner identifies operators suitable for acceleration. In most systems, this is hybrid execution — CPU keeps control-flow-heavy tasks, GPU takes scan/compute-heavy operators. The right model is not “GPU-only database” but “GPU-accelerated execution.”

Step 3: Columnar data minimizes work

For this query, the engine only needs country and revenue. Columnar layouts avoid moving irrelevant columns and align better with parallel arithmetic over dense vectors.

Step 4: GPU fan-out across threads

The heavy scan/compute path is fanned out across many threads:

Thread 1     -> rows 1-1M
Thread 2     -> rows 1M-2M
Thread 3     -> rows 2M-3M
...
Thread 10000 -> rows 9.9B-10B

Each thread performs repeated, regular work over a slice of data.

Step 5: Partial aggregation and reduction

Each worker builds partial aggregates, then the engine reduces them into final grouped totals. This is familiar database behavior, but at much higher degrees of parallelism.

Step 6: Finalize on CPU

After heavy compute, final result shaping and response serialization return through CPU-side control flow.

The complete flow:

SQL query
-> CPU planner
-> column selection
-> GPU scan + compute
-> GPU partial aggregates
-> GPU reduction
-> CPU final return

Stage ownership summary

StageCPU-centric pathGPU-accelerated path
Parse + optimizeCPUCPU
Column selectionCPUCPU
Large scanCPU workersGPU threads
Partial aggregationCPU workersGPU threads
ReductionCPU mergeGPU reduction + CPU finalize
Result shapingCPUCPU

Inside a GPU database engine

In Practice

NVIDIA RAPIDS cuDF documents the execution pattern for DataFrame aggregations: the GPU receives a columnar memory representation, applies the projection and filter kernels in parallel across all rows, builds partial hash aggregates per thread block, then reduces across blocks. The documented behavior is that this execution model is fastest when the working set fits in GPU VRAM — data spills to system RAM through NVLink or PCIe, and the bandwidth of that interconnect becomes the new bottleneck when the query exceeds VRAM capacity.

BlazeIT and similar GPU-accelerated SQL engines (documented in academic literature, e.g., He et al., VLDB 2008) established the baseline behavior: scan-heavy queries with low selectivity (reading most of a table) see the largest speedups because the GPU’s memory bandwidth advantage over CPU memory bandwidth is largest for sequential reads. Selective point lookups see no benefit because GPU thread management overhead dominates the per-row compute time.

Where It Breaks

ScenarioWhat breaksWhy
Query workload is OLTPNo speedup, higher latencyGPU kernel overhead is larger than the compute savings for small, indexed lookups
Working set exceeds GPU VRAMSpeedup collapses to CPU-level or slowerPCIe/NVLink transfer becomes the bottleneck; GPU’s internal bandwidth advantage disappears
Query is I/O-bound, not compute-boundAdding GPU does not helpThe storage read is the bottleneck; GPU sits idle waiting for data
Write-heavy workloadIncorrect fitTransactional writes require coordination machinery that GPUs do not accelerate
Irregular or sparse data accessLower GPU utilizationBranching access patterns lead to thread divergence, reducing GPU parallelism efficiency

What to Do Next

  • Problem: At 10B row scale, CPU-based analytical engines hit a parallelism ceiling that cannot be solved by adding CPU cores — the bottleneck is the number of simultaneous arithmetic operations, not the sophistication of the logic.
  • Solution: Move scan-heavy, aggregate-heavy SQL workloads to a GPU-accelerated execution engine; verify the query is compute-bound (not I/O-bound) before attributing speedup to GPU offload.
  • Proof: Run EXPLAIN ANALYZE on the target query and confirm the majority of time is in scan, aggregate, or join operators (not in network or storage I/O), then benchmark on a GPU-enabled instance with the same query and data volume.
  • Action: Identify your three slowest analytical queries this week and profile whether the bottleneck is CPU compute, memory bandwidth, or storage I/O — only CPU compute bottlenecks are GPU-offload candidates.