CPU vs GPU vs TPU Explained for Database Engineers

Database infrastructure conversations are breaking down the moment hardware enters the room because engineers are asking the wrong question. “Which is faster — CPU, GPU, or TPU?” is the wrong frame. The right question is the same one you already apply to query plans: what execution pattern does this workload need, and what hardware is optimized for that pattern?

Situation

OLTP systems are adding vector similarity, analytical aggregates, and AI inference to their workloads. Infrastructure teams are being asked to provision GPU instances without a framework for deciding when a GPU is the right choice versus a larger CPU instance or a purpose-built accelerator. The same confusion that once surrounded row-store vs column-store has returned at the hardware layer.

The Problem

Engineers who treat CPU, GPU, and TPU as a linear performance hierarchy make the wrong call in both directions: they over-provision GPUs for workloads that remain CPU-bound (transactions, connection management, control flow), and they under-provision accelerators for workloads that are genuinely scan-heavy or tensor-heavy. The result is either wasted capacity or incorrect assumptions that “the GPU is faster” without a workload-specific basis.

If you already understand OLTP vs OLAP, row vs column execution, and latency vs throughput, you already have the right mental model for this hardware decision.

Matching Execution Patterns to Hardware

CPU vs GPU vs TPU mental model

Hardware	DBA Mental Model	Best At
CPU	OLTP execution brain	Branching, coordination, transactions, mixed workloads
GPU	Parallel analytics engine	Scans, filters, joins, aggregations, vector math
TPU	Matrix math appliance	Dense AI tensor operations and model inference/training

What a CPU Is

A CPU is designed to be general-purpose. It handles many instruction types efficiently: branching, pointer chasing, transaction logic, conditional execution, scheduling and interrupts, complex control flow.

Think of a CPU as a traditional relational engine running OLTP traffic.

SELECT *
FROM orders
WHERE customer_id = 123
AND status = 'SHIPPED';

This is CPU-friendly because it involves index lookups, branching, and low-latency response patterns.

CPUs win when the workload is transactional, branch-heavy, latency-sensitive, coordination-heavy, or dominated by smaller irregular queries.

What a GPU Is

A GPU is not a faster CPU. It is built for repeating the same operation across massive data volumes in parallel.

Think of a GPU as a massively parallel analytics engine optimized for huge scans, repeated arithmetic, columnar execution, vector operations, and parallel filtering.

SELECT SUM(price * quantity)
FROM sales;

With billions of rows, this operation is repetitive and parallelizable — it maps well to GPU threads. GPUs win when the workload is scan-heavy, arithmetic-heavy, batch-oriented, highly parallelizable, or throughput-driven.

What a TPU Is

A TPU is more specialized than CPU or GPU. It is designed for dense matrix and tensor math used heavily in neural networks. Think of a TPU as a purpose-built model-math execution appliance.

TPUs are not general database accelerators. They are strongest when model computation itself is the bottleneck: neural network training, large-scale inference, dense tensor operations, and repeated matrix multiplications with regular shapes.

Dimension	CPU	GPU	TPU
Flexibility	Highest	Medium	Lowest
Best workload	Mixed/general-purpose	Parallel analytics	AI tensor math
Latency	Strong	Moderate	Workload-specific
Throughput	Moderate	Very high	Very high for AI
Branch-heavy logic	Excellent	Weak	Poor fit
OLTP	Best	Poor	Poor
Analytics	Decent	Excellent	General mismatch
ML inference	Decent	Strong	Excellent
Matrix multiplication	Okay	Strong	Best

In Practice

PostgreSQL’s execution model runs on CPUs — its buffer manager, lock manager, and MVCC machinery are built around sequential per-backend processing with branching logic. The documented behavior when you add GPU-accelerated extensions (such as PG-Strom for vectorized scan offload) is that the optimizer continues to handle query planning on CPU while the GPU handles the data-parallel scan and aggregation phases. This division of labor — CPU for control, GPU for data movement — is the documented design pattern for heterogeneous database systems.

NVIDIA’s RAPIDS cuDF library (Apache 2.0, documented at developer.nvidia.com/rapids) processes Pandas-like DataFrame operations on GPU. The documented design note is that data transfer between CPU memory and GPU memory (PCIe bandwidth) is the dominant latency cost for small-to-medium datasets, making GPU acceleration ineffective until the working set exceeds what the transfer overhead amortizes.

Google’s TPU documentation is explicit that TPUs are optimized for matrix multiplications with regular, statically-shaped tensors, and that irregular control flow, sparse operations, and dynamic shapes fall back to CPU or GPU. This boundary is the same boundary a DBA understands as the difference between a full table scan (GPU-friendly) and a complex multi-join query plan (CPU-friendly).

Where It Breaks

Scenario	What breaks	Why
GPU for OLTP	Latency increases, no throughput gain	GPU launch overhead and PCIe transfer cost exceed the per-request compute savings
CPU for large scans	Query runs 10–100x slower than GPU equivalent	CPU cannot parallelize the same scan operation across thousands of cores simultaneously
TPU for database workloads	Misfit — most DB operations are not dense tensor math	TPU lacks general-purpose branching and irregular memory access support
Heterogeneous system with small working set	GPU transfer overhead dominates	PCIe bandwidth makes GPU offload slower than in-memory CPU execution until data volume is large enough
Assuming GPU = faster for all AI workloads	Inference latency spikes at low concurrency	TPU is faster for batched dense inference; GPU wins for moderate concurrency; CPU wins for single-request light inference

What to Do Next

Problem: Adding GPU or TPU infrastructure without a workload-to-hardware mapping wastes capacity on the wrong execution pattern.
Solution: Classify hot paths by execution pattern before choosing hardware — transactions and coordination stay on CPU, scan-heavy analytics move to GPU, dense model math goes to TPU.
Proof: Run your heaviest analytical query on a GPU-enabled instance with a columnar execution engine (DuckDB, RAPIDS, or a GPU database) and compare elapsed time and I/O throughput against the same query on your current CPU-only setup — the gap narrows or disappears for CPU-bound query shapes.
Action: This week, identify the three highest-CPU-cost queries in your monitoring dashboard and classify each as branch-heavy (CPU-bound) or scan-heavy (GPU candidate). That classification determines whether GPU provisioning is justified.

Situation

The Problem

Matching Execution Patterns to Hardware

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

SIMD vs SIMT Explained for Database Engineers

How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database

Why Databases Are Moving Toward GPU Execution Engines