How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database

The same SQL that takes 60 seconds on a CPU database runs in 200ms on a GPU database — and the reason is not that GPUs are faster processors, it is that the execution model changes what happens between query plan and result.

Situation

Every database engineer has seen a query that looks harmless in code review and painful in production:

SELECT country, SUM(revenue)
FROM events
GROUP BY country;

At 10,000 rows, nobody cares. At 10 billion rows, this becomes a serious execution problem. CPU-based execution engines process this query through a bounded number of threads, each handling a sequential slice of the data. The query is I/O-intensive and compute-intensive, but the CPU serializes its work in ways that GPU execution does not.

The Problem

The structural gap is parallelism. A CPU-based database runs this query with dozens to hundreds of parallel workers. A GPU-based engine runs it with thousands to tens of thousands of parallel threads, each processing a slice of columnar data simultaneously. The difference in wall time is not incremental — it is a category change for the right workload shape.

The engineering question is not “why is this fast?” but rather “which queries change category, and which don’t?” Getting this wrong leads to GPU infrastructure that produces no benefit for the actual hot paths, because the bottleneck is I/O or coordination, not compute throughput.

Step-by-Step: How the Query Executes

10B row GPU query timeline

Step 1: CPU plans the query

The request starts as a normal SQL path: parse SQL, resolve objects, build logical plan, choose physical plan. CPU remains the control plane for planning, scheduling, and orchestration.

Step 2: Engine isolates the heavy path

The planner identifies operators suitable for acceleration. In most systems, this is hybrid execution — CPU keeps control-flow-heavy tasks, GPU takes scan/compute-heavy operators. The right model is not “GPU-only database” but “GPU-accelerated execution.”

Step 3: Columnar data minimizes work

For this query, the engine only needs country and revenue. Columnar layouts avoid moving irrelevant columns and align better with parallel arithmetic over dense vectors.

Step 4: GPU fan-out across threads

The heavy scan/compute path is fanned out across many threads:

Thread 1     -> rows 1-1M
Thread 2     -> rows 1M-2M
Thread 3     -> rows 2M-3M
...
Thread 10000 -> rows 9.9B-10B

Each thread performs repeated, regular work over a slice of data.

Step 5: Partial aggregation and reduction

Each worker builds partial aggregates, then the engine reduces them into final grouped totals. This is familiar database behavior, but at much higher degrees of parallelism.

Step 6: Finalize on CPU

After heavy compute, final result shaping and response serialization return through CPU-side control flow.

The complete flow:

SQL query
-> CPU planner
-> column selection
-> GPU scan + compute
-> GPU partial aggregates
-> GPU reduction
-> CPU final return

Stage ownership summary

Stage	CPU-centric path	GPU-accelerated path
Parse + optimize	CPU	CPU
Column selection	CPU	CPU
Large scan	CPU workers	GPU threads
Partial aggregation	CPU workers	GPU threads
Reduction	CPU merge	GPU reduction + CPU finalize
Result shaping	CPU	CPU

Inside a GPU database engine

In Practice

NVIDIA RAPIDS cuDF documents the execution pattern for DataFrame aggregations: the GPU receives a columnar memory representation, applies the projection and filter kernels in parallel across all rows, builds partial hash aggregates per thread block, then reduces across blocks. The documented behavior is that this execution model is fastest when the working set fits in GPU VRAM — data spills to system RAM through NVLink or PCIe, and the bandwidth of that interconnect becomes the new bottleneck when the query exceeds VRAM capacity.

BlazeIT and similar GPU-accelerated SQL engines (documented in academic literature, e.g., He et al., VLDB 2008) established the baseline behavior: scan-heavy queries with low selectivity (reading most of a table) see the largest speedups because the GPU’s memory bandwidth advantage over CPU memory bandwidth is largest for sequential reads. Selective point lookups see no benefit because GPU thread management overhead dominates the per-row compute time.

Where It Breaks

Scenario	What breaks	Why
Query workload is OLTP	No speedup, higher latency	GPU kernel overhead is larger than the compute savings for small, indexed lookups
Working set exceeds GPU VRAM	Speedup collapses to CPU-level or slower	PCIe/NVLink transfer becomes the bottleneck; GPU’s internal bandwidth advantage disappears
Query is I/O-bound, not compute-bound	Adding GPU does not help	The storage read is the bottleneck; GPU sits idle waiting for data
Write-heavy workload	Incorrect fit	Transactional writes require coordination machinery that GPUs do not accelerate
Irregular or sparse data access	Lower GPU utilization	Branching access patterns lead to thread divergence, reducing GPU parallelism efficiency

What to Do Next

Problem: At 10B row scale, CPU-based analytical engines hit a parallelism ceiling that cannot be solved by adding CPU cores — the bottleneck is the number of simultaneous arithmetic operations, not the sophistication of the logic.
Solution: Move scan-heavy, aggregate-heavy SQL workloads to a GPU-accelerated execution engine; verify the query is compute-bound (not I/O-bound) before attributing speedup to GPU offload.
Proof: Run EXPLAIN ANALYZE on the target query and confirm the majority of time is in scan, aggregate, or join operators (not in network or storage I/O), then benchmark on a GPU-enabled instance with the same query and data volume.
Action: Identify your three slowest analytical queries this week and profile whether the bottleneck is CPU compute, memory bandwidth, or storage I/O — only CPU compute bottlenecks are GPU-offload candidates.

Situation

The Problem

Step-by-Step: How the Query Executes

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Why Databases Are Moving Toward GPU Execution Engines

Vector Search on GPU Databases

CPU vs GPU vs TPU Explained for Database Engineers