How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database

Every database engineer has seen a query that looks harmless in code review and painful in production.

SELECT country, SUM(revenue)
FROM events
GROUP BY country;

At 10,000 rows, nobody cares. At 10 billion rows, this becomes a serious execution problem.

A GPU-accelerated system can sometimes execute the same logical work in a fraction of the time, not because SQL changed, but because the execution model changed.

The Query Shape

For large analytical workloads, the hot path often looks like:

SCAN -> PROJECT -> AGGREGATE

At this scale, the core challenge is throughput:

Read large volumes efficiently
Apply repeated operations
Aggregate partial results quickly

Step 1: CPU Plans the Query

The request still starts as a normal SQL path:

Parse SQL
Resolve objects
Build logical plan
Choose physical plan

CPU remains the control plane for planning, scheduling, and orchestration.

Step 2: Engine Isolates the Heavy Path

The planner identifies operators suitable for acceleration. In most systems, this is hybrid execution:

CPU keeps control-flow-heavy tasks
GPU takes scan/compute-heavy operators

This is why the right model is not GPU-only database, but GPU-accelerated execution.

Step 3: Columnar Data Minimizes Work

For this query, the engine mainly needs:

country
revenue

Columnar layouts avoid moving irrelevant columns and align better with parallel arithmetic over dense vectors.

Step 4: GPU Fan-Out Across Threads

The heavy scan/compute path is fanned out across many threads. Conceptually:

Thread 1     -> rows 1-1M
Thread 2     -> rows 1M-2M
Thread 3     -> rows 2M-3M
...
Thread 10000 -> rows 9.9B-10B

Each thread performs repeated, regular work over a slice of data.

Step 5: Partial Aggregation + Reduction

Each worker builds partial aggregates, then the engine reduces them into final grouped totals.

This is familiar database behavior, but at much higher degrees of parallelism.

Step 6: Finalize on CPU

After heavy compute, final result shaping and response serialization return through CPU-side control flow.

The full flow is:

SQL query
-> CPU planner
-> column selection
-> GPU scan + compute
-> GPU partial aggregates
-> GPU reduction
-> CPU final return

CPU vs GPU Stage Ownership

Stage	CPU-centric path	GPU-accelerated path
Parse + optimize	CPU	CPU
Column selection	CPU	CPU
Large scan	CPU workers	GPU threads
Partial aggregation	CPU workers	GPU threads
Reduction	CPU merge	GPU reduction + CPU finalize
Result shaping	CPU	CPU

Inside a GPU Database Engine

Inside a GPU database engine

The key architecture pattern is stable:

CPU for planning and coordination
GPU for repeated high-throughput compute

10B Row Query Timeline

10B row GPU query timeline

The acceleration comes from combining:

Massive parallelism
Columnar access patterns
Repeated arithmetic
Efficient parallel reduction

Where This Works Best

GPU acceleration is strongest when workloads are:

Scan-heavy
Repetitive
Arithmetic-heavy
Throughput-oriented
Columnar/vector-friendly

Where It Usually Does Not Help

GPU acceleration is usually weak for:

Tiny indexed lookups
Write-heavy OLTP paths
Branch-heavy procedural logic
Coordination-dominated workloads

Example:

UPDATE users
SET status = 'active'
WHERE id = 42;

That is still mostly CPU territory.

Key Takeaways

A 10B row analytical query is primarily a throughput problem.
GPUs accelerate this by widening parallel execution dramatically.
Columnar storage is a major enabler of GPU-friendly execution.
Most real systems are hybrid: CPU control plane + GPU data plane.
Same SQL, different execution shape.