How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database
Every database engineer has seen a query that looks harmless in code review and painful in production.
SELECT country, SUM(revenue)
FROM events
GROUP BY country;
At 10,000 rows, nobody cares. At 10 billion rows, this becomes a serious execution problem.
A GPU-accelerated system can sometimes execute the same logical work in a fraction of the time, not because SQL changed, but because the execution model changed.
The Query Shape
For large analytical workloads, the hot path often looks like:
SCAN -> PROJECT -> AGGREGATE
At this scale, the core challenge is throughput:
- Read large volumes efficiently
- Apply repeated operations
- Aggregate partial results quickly
Step 1: CPU Plans the Query
The request still starts as a normal SQL path:
- Parse SQL
- Resolve objects
- Build logical plan
- Choose physical plan
CPU remains the control plane for planning, scheduling, and orchestration.
Step 2: Engine Isolates the Heavy Path
The planner identifies operators suitable for acceleration. In most systems, this is hybrid execution:
- CPU keeps control-flow-heavy tasks
- GPU takes scan/compute-heavy operators
This is why the right model is not GPU-only database, but GPU-accelerated execution.
Step 3: Columnar Data Minimizes Work
For this query, the engine mainly needs:
countryrevenue
Columnar layouts avoid moving irrelevant columns and align better with parallel arithmetic over dense vectors.
Step 4: GPU Fan-Out Across Threads
The heavy scan/compute path is fanned out across many threads. Conceptually:
Thread 1 -> rows 1-1M
Thread 2 -> rows 1M-2M
Thread 3 -> rows 2M-3M
...
Thread 10000 -> rows 9.9B-10B
Each thread performs repeated, regular work over a slice of data.
Step 5: Partial Aggregation + Reduction
Each worker builds partial aggregates, then the engine reduces them into final grouped totals.
This is familiar database behavior, but at much higher degrees of parallelism.
Step 6: Finalize on CPU
After heavy compute, final result shaping and response serialization return through CPU-side control flow.
The full flow is:
SQL query
-> CPU planner
-> column selection
-> GPU scan + compute
-> GPU partial aggregates
-> GPU reduction
-> CPU final return
CPU vs GPU Stage Ownership
| Stage | CPU-centric path | GPU-accelerated path |
|---|---|---|
| Parse + optimize | CPU | CPU |
| Column selection | CPU | CPU |
| Large scan | CPU workers | GPU threads |
| Partial aggregation | CPU workers | GPU threads |
| Reduction | CPU merge | GPU reduction + CPU finalize |
| Result shaping | CPU | CPU |
Inside a GPU Database Engine
The key architecture pattern is stable:
- CPU for planning and coordination
- GPU for repeated high-throughput compute
10B Row Query Timeline
The acceleration comes from combining:
- Massive parallelism
- Columnar access patterns
- Repeated arithmetic
- Efficient parallel reduction
Where This Works Best
GPU acceleration is strongest when workloads are:
- Scan-heavy
- Repetitive
- Arithmetic-heavy
- Throughput-oriented
- Columnar/vector-friendly
Where It Usually Does Not Help
GPU acceleration is usually weak for:
- Tiny indexed lookups
- Write-heavy OLTP paths
- Branch-heavy procedural logic
- Coordination-dominated workloads
Example:
UPDATE users
SET status = 'active'
WHERE id = 42;
That is still mostly CPU territory.
Key Takeaways
- A 10B row analytical query is primarily a throughput problem.
- GPUs accelerate this by widening parallel execution dramatically.
- Columnar storage is a major enabler of GPU-friendly execution.
- Most real systems are hybrid: CPU control plane + GPU data plane.
- Same SQL, different execution shape.
Comments