SIMD vs SIMT Explained for Database Engineers
A lot of GPU and vectorized execution discussions get confusing because people jump straight into terms like lanes, warps, thread blocks, and vector units, leaving database engineers to translate hardware jargon into query plans.
Situation
As analytical workloads grow and latency SLAs shrink, relying solely on row-by-row CPU execution is no longer viable. The industry has firmly shifted toward hardware acceleration for query execution. Systems are increasingly utilizing both CPU vector extensions (like AVX-512) and GPU offloading to process massive datasets faster. A lot of CPU-side gains in modern analytical engines come from vectorized execution and cache-friendly data layouts, while GPUs drive high throughput by maintaining massive thread pools for regular operations.
The Problem
When teams transition to hardware-accelerated databases, they often struggle to predict which workloads will actually benefit. A query that screams on a GPU might crawl if slightly modified, and CPU vectorization sometimes fails to engage at all due to data layout or branch-heavy logic. This unpredictability stems from treating “acceleration” as a black box without understanding the fundamental differences in how CPUs and GPUs parallelize work. If we don’t understand the execution model—specifically what gets parallelized and how branching affects the pipeline—how can we design schemas and write queries that actually leverage the hardware?
Core Concept
To understand the mechanics, we need to look at how a single operation is applied over large amounts of data. If you already understand vectorized query execution, row-at-a-time vs batch-at-a-time processing, and scan-heavy analytics, you already understand most of SIMD and SIMT.
flowchart TD
A[Query Operator] --> B[SIMD CPU Execution]
A --> C[SIMT GPU Execution]
B --> D[Single worker — Wide vector registers]
D --> E[Batch of rows processed in one instruction]
C --> F[Thousands of lightweight workers]
F --> G[Each thread handles a slice concurrently]
- SIMD (Single Instruction, Multiple Data): This is vertical widening inside the CPU. A single CPU worker uses wide vector registers to apply one instruction across a batch of values simultaneously. If a standard engine evaluates a filter one row at a time, a SIMD-enabled vectorized executor processes a batch (for example, 1024 rows) in a single CPU instruction step. SIMD usually helps with vectorized scans, arithmetic-heavy expressions, and batched comparisons.
- SIMT (Single Instruction, Multiple Threads): This is horizontal scaling inside a GPU. The hardware runs the same logical program across thousands of independent threads simultaneously. Instead of widening one worker, SIMT spawns a massive grid of lightweight workers, each applying the same operation to different data slices. SIMT usually helps with large scans, parallel filtering, aggregations, and vector similarity calculations.
If you remember one principle, remember this: SIMD widens a worker, whereas SIMT multiplies workers.
In Practice
We can observe how these execution models dictate database behavior in production systems. The documented pattern is that databases exhibit wildly different performance profiles depending on how their execution engine maps to the underlying hardware.
Example 1: CPU-friendly vectorized query (SIMD)
SELECT SUM(price)
FROM fact_sales
WHERE date_key BETWEEN 20240101 AND 20240131;
ClickHouse and SIMD: The documented pattern is that ClickHouse heavily utilizes SIMD instructions (like SSE4.2 and AVX-512) for this type of query. By storing data in contiguous columnar blocks, ClickHouse feeds vector registers directly. A single core filters thousands of integers in a handful of clock cycles, relying on vectorized predicate evaluation and batched accumulation.
Example 2: GPU-friendly scan and aggregate (SIMT)
SELECT country, SUM(revenue)
FROM events
GROUP BY country;
HEAVY.AI and SIMT: For GPU-native systems like HEAVY.AI (formerly OmniSci), the engine compiles SQL queries into LLVM IR and then to PTX code for NVIDIA GPUs. The SIMT model excels here because the massive scan volume and repeated per-row work maps perfectly to millions of GPU threads executing the partial aggregations in parallel.
Example 3: Bad acceleration candidate
SELECT *
FROM users
WHERE user_id = 42;
PostgreSQL and Row-at-a-Time: PostgreSQL historically processes queries row-by-row. While ideal for tiny indexed lookups where latency dominates, applying hardware acceleration here is counterproductive. Neither SIMD nor SIMT helps with single-row lookups because there is no batched data to widen and no parallel work to distribute.
Where It Breaks
Both models improve performance but have strict constraints, particularly around branching. CPUs handle irregular control flow well, but hardware accelerators lose efficiency when logic diverges.
| Execution Model | Strength | Failure Mode |
|---|---|---|
| SIMD (CPU) | Highly efficient for contiguous columnar scans with simple, repetitive predicates. | Branch Divergence: Performance collapses if the data requires complex, unpredictable IF — ELSE branching. The vector pipeline must evaluate both sides and mask out unused lanes, wasting CPU cycles. |
| SIMT (GPU) | Massive throughput for large aggregations, parallel joins, and heavy vector math. | Thread Divergence: If threads in the same hardware group take different execution paths, the GPU serializes execution, destroying performance. Additionally, tiny indexed lookups suffer heavily due to PCIe data transfer latency. |
What to Do Next
- Problem: Unpredictable performance when migrating standard analytical workloads to accelerated database engines due to a mismatch between query logic and hardware execution models.
- Solution: Map the workload shape to the hardware—use SIMD-optimized columnar stores for general, batch-oriented analytics, and SIMT-based GPU engines for massive, regular, math-heavy scans.
- Proof: Systems like ClickHouse achieve their speed through rigorous SIMD utilization on contiguous columnar data, while GPU databases like HEAVY.AI leverage SIMT to brute-force billion-row aggregates through parallel thread pools.
- Action: Audit slow analytical queries for heavy branching or scattered memory access. Refactor schema layouts to be columnar and contiguous, and replace row-at-a-time loop logic with vector-friendly bulk operations.