Engineering a Columnar Database in Rust: Lessons on io_uring, SIMD, and why I avoided Async/Await - CodeGurus

I recently released the core engine for Frigatebird, an OLAP (Columnar) database built from scratch. While building it, I made a few architectural decisions that go against the "standard" Rust web/systems path. I wanted to share the rationale and the performance implications of those choices.

1. Why I ditched Async/Await for a Custom Runtime
The standard advice in Rust is "just use Tokio." However, generic async runtimes are designed primarily for IO-bound tasks with many idle connections. In a database execution pipeline, tasks are often CPU-heavy (scanning/filtering compressed pages).

I found that mixing heavy compute with standard async executors led to unpredictable scheduling latency. Instead, I implemented a Morsel-Driven Parallelism model (inspired by DuckDB/Hyper):

Queries are broken into "morsels" (fixed-size row groups).
Instead of a central scheduler, worker threads use lock-free work stealing.
A query job holds an AtomicUsize counter. Threads race to increment it (CAS), effectively "claiming" the next step of the pipeline.
This keeps CPU cores pinned and maximizes instruction cache locality, as threads tend to stick to specific logic loops (Scanning vs Filtering).

2. Batched io_uring vs. Standard Syscalls
For the WAL (Write-Ahead Log), fsync latency is the killer. I built a custom storage engine ("Walrus") to leverage Linux's io_uring.

Instead of issuing pwrite syscalls one by one, the writer constructs a submission queue of ~2,000 entries in userspace.
It issues a single submit_and_wait syscall to flush them all.
This reduced the context-switching overhead significantly, allowing the engine to saturate NVMe bandwidth on a single thread.

3. The "Spin-Lock" Allocator
This was the riskiest decision. Standard OS mutexes (pthread_mutex) put threads to sleep, costing microseconds.

For the disk block allocator, I implemented a custom AtomicBool spin-lock.
It spins in a tight loop (std::hint::spin_loop()) for nanoseconds.
Trade-off: If the OS preempts the thread holding the lock, the system stalls. But because the critical section is just simple integer math (calculating offsets), it executes faster than the OS scheduler quantum, making this statistically safe and extremely fast.

4. Zero-Copy Serialization
I used rkyv instead of serde. Serde is great, but it usually involves deserialization steps (parsing bytes into structs). rkyv guarantees that the in-memory representation is identical to the on-disk representation, allowing for true zero-copy access by just casting pointers on the raw buffer.

I'm curious if others here have hit similar walls with Tokio in CPU-bound contexts, or if I just failed to tune it correctly?

submitted by /u/Ok_Marionberry8922
[link] [comments]

Related Articles