How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database
The same SQL that takes 60 seconds on a CPU database runs in 200ms on a GPU database — and the reason is not that GPUs are faster processors, it is that the execution model changes what happens between query plan and result.
Situation
Every database engineer has seen a query that looks harmless in code review and painful in production:
SELECT country, SUM(revenue)
FROM events
GROUP BY country;
At 10,000 rows, nobody cares. At 10 billion rows, this becomes a serious execution problem. CPU-based execution engines process this query through a bounded number of threads, each handling a sequential slice of the data. The query is I/O-intensive and compute-intensive, but the CPU serializes its work in ways that GPU execution does not.
The Problem
The structural gap is parallelism. A CPU-based database runs this query with dozens to hundreds of parallel workers. A GPU-based engine runs it with thousands to tens of thousands of parallel threads, each processing a slice of columnar data simultaneously. The difference in wall time is not incremental — it is a category change for the right workload shape.
The engineering question is not “why is this fast?” but rather “which queries change category, and which don’t?” Getting this wrong leads to GPU infrastructure that produces no benefit for the actual hot paths, because the bottleneck is I/O or coordination, not compute throughput.
Step-by-Step: How the Query Executes
Step 1: CPU plans the query
The request starts as a normal SQL path: parse SQL, resolve objects, build logical plan, choose physical plan. CPU remains the control plane for planning, scheduling, and orchestration.
Step 2: Engine isolates the heavy path
The planner identifies operators suitable for acceleration. In most systems, this is hybrid execution — CPU keeps control-flow-heavy tasks, GPU takes scan/compute-heavy operators. The right model is not “GPU-only database” but “GPU-accelerated execution.”
Step 3: Columnar data minimizes work
For this query, the engine only needs country and revenue. Columnar layouts avoid moving irrelevant columns and align better with parallel arithmetic over dense vectors.
Step 4: GPU fan-out across threads
The heavy scan/compute path is fanned out across many threads:
Thread 1 -> rows 1-1M
Thread 2 -> rows 1M-2M
Thread 3 -> rows 2M-3M
...
Thread 10000 -> rows 9.9B-10B
Each thread performs repeated, regular work over a slice of data.
Step 5: Partial aggregation and reduction
Each worker builds partial aggregates, then the engine reduces them into final grouped totals. This is familiar database behavior, but at much higher degrees of parallelism.
Step 6: Finalize on CPU
After heavy compute, final result shaping and response serialization return through CPU-side control flow.
The complete flow:
SQL query
-> CPU planner
-> column selection
-> GPU scan + compute
-> GPU partial aggregates
-> GPU reduction
-> CPU final return
Stage ownership summary
| Stage | CPU-centric path | GPU-accelerated path |
|---|---|---|
| Parse + optimize | CPU | CPU |
| Column selection | CPU | CPU |
| Large scan | CPU workers | GPU threads |
| Partial aggregation | CPU workers | GPU threads |
| Reduction | CPU merge | GPU reduction + CPU finalize |
| Result shaping | CPU | CPU |
In Practice
NVIDIA RAPIDS cuDF documents the execution pattern for DataFrame aggregations: the GPU receives a columnar memory representation, applies the projection and filter kernels in parallel across all rows, builds partial hash aggregates per thread block, then reduces across blocks. The documented behavior is that this execution model is fastest when the working set fits in GPU VRAM — data spills to system RAM through NVLink or PCIe, and the bandwidth of that interconnect becomes the new bottleneck when the query exceeds VRAM capacity.
BlazeIT and similar GPU-accelerated SQL engines (documented in academic literature, e.g., He et al., VLDB 2008) established the baseline behavior: scan-heavy queries with low selectivity (reading most of a table) see the largest speedups because the GPU’s memory bandwidth advantage over CPU memory bandwidth is largest for sequential reads. Selective point lookups see no benefit because GPU thread management overhead dominates the per-row compute time.
Where It Breaks
| Scenario | What breaks | Why |
|---|---|---|
| Query workload is OLTP | No speedup, higher latency | GPU kernel overhead is larger than the compute savings for small, indexed lookups |
| Working set exceeds GPU VRAM | Speedup collapses to CPU-level or slower | PCIe/NVLink transfer becomes the bottleneck; GPU’s internal bandwidth advantage disappears |
| Query is I/O-bound, not compute-bound | Adding GPU does not help | The storage read is the bottleneck; GPU sits idle waiting for data |
| Write-heavy workload | Incorrect fit | Transactional writes require coordination machinery that GPUs do not accelerate |
| Irregular or sparse data access | Lower GPU utilization | Branching access patterns lead to thread divergence, reducing GPU parallelism efficiency |
What to Do Next
- Problem: At 10B row scale, CPU-based analytical engines hit a parallelism ceiling that cannot be solved by adding CPU cores — the bottleneck is the number of simultaneous arithmetic operations, not the sophistication of the logic.
- Solution: Move scan-heavy, aggregate-heavy SQL workloads to a GPU-accelerated execution engine; verify the query is compute-bound (not I/O-bound) before attributing speedup to GPU offload.
- Proof: Run
EXPLAIN ANALYZEon the target query and confirm the majority of time is in scan, aggregate, or join operators (not in network or storage I/O), then benchmark on a GPU-enabled instance with the same query and data volume. - Action: Identify your three slowest analytical queries this week and profile whether the bottleneck is CPU compute, memory bandwidth, or storage I/O — only CPU compute bottlenecks are GPU-offload candidates.