sqlite_heap is a pluggable Postgres storage engine that, instead of using
Postgres's built-in on-disk format, keeps each table's rows inside a private SQLite database
file — while still behaving like a normal Postgres table (transactions, MVCC,
indexes, VACUUM, the planner, all of it).
This document explains how it works from the ground up — including the
Postgres internals you need — so you can read the code with confidence. You write
CREATE TABLE t (...) USING sqlite_heap; and from then on every row you insert into
t lands in a SQLite file on disk. Queries, joins, indexes, and transactions all
still work — Postgres doesn't know or care that the bytes live in SQLite.
Mostly: because Postgres lets you, and it's a fantastic way to understand how Postgres's storage layer actually works. It's also a real, useful pattern — SQLite files are trivially copyable, embeddable, and inspectable. But the honest framing is: this is a serious prototype and a learning vehicle, not a production storage engine (see §16).
Three facts to anchor on:
- One SQLite file per Postgres table. When you create a
sqlite_heaptable, the extension creates a file at$PGDATA/sqlite_heap/<database-oid>/<table-oid>.sqlite. - Inside that file is a table called
storagewith one row per row of your Postgres table — plus bookkeeping columns. The actual Postgres row is stored as an opaqueBLOB; sqlite_heap never looks inside it. Postgres formats the bytes, sqlite_heap just holds them. - The extension is ~3,700 lines of Rust built with pgrx, compiled into a shared library that Postgres loads. It implements the ~43 callbacks Postgres requires of a storage engine.
Here is the SQLite schema (schema version 4):
CREATE TABLE storage (
rowid INTEGER PRIMARY KEY, -- SQLite's own row id; becomes the Postgres TID
xmin INTEGER NOT NULL, -- id of the transaction that created this row
cmin INTEGER NOT NULL, -- command id within that transaction
xmax INTEGER NOT NULL, -- id of the transaction that deleted it (0 = still live)
cmax INTEGER NOT NULL, -- command id of the delete
tuple BLOB NOT NULL -- the raw Postgres row bytes, opaque to us
);
CREATE TABLE meta (key TEXT PRIMARY KEY, value BLOB NOT NULL); -- small key/value scratch
Every column except tuple exists to reproduce Postgres's transactional behaviour.
The rest of this document is, essentially, the story of those columns.
Before the theory, here it is working. This walkthrough shows a sqlite_heap table
behaving like any other Postgres table — then opens the hood: the SQLite file on disk, and
the MVCC bookkeeping Postgres normally hides.
Setup
SQL
CREATE EXTENSION pg_sqlite_heap;
That registers a new access method called sqlite_heap.
A table that lives in SQLite
SQL
CREATE TABLE demo (id int primary key, label text) USING sqlite_heap;
INSERT INTO demo VALUES (1, 'a'), (2, 'b'), (3, 'c');
SELECT * FROM demo ORDER BY id;
Result
id | label
----+-------
1 | a
2 | b
3 | c
From here on it is an ordinary Postgres table — SELECT, JOIN,
indexes, and transactions all work, and nothing in your SQL knows the difference. But the bytes
are not in Postgres's heap. They are in a SQLite file:
SQL
SELECT sqlite_heap_file_path('demo'::regclass) AS path,
sqlite_heap_file_size('demo'::regclass) AS bytes,
sqlite_heap_schema_version('demo'::regclass) AS schema_v;
Result
path | bytes | schema_v
-------------------------------------------------+-------+----------
$PGDATA/sqlite_heap/<db-oid>/<table-oid>.sqlite | 4096 | 4
One SQLite database file per Postgres table.
The reveal — MVCC bookkeeping Postgres hides
Run an UPDATE and a DELETE:
SQL
UPDATE demo SET label = 'B' WHERE id = 2;
DELETE FROM demo WHERE id = 3;
SELECT * FROM demo ORDER BY id;
Result
id | label
----+-------
1 | a
2 | B
Postgres shows two rows. But the SQLite file still physically holds four —
sqlite_heap_storage() dumps the raw storage table, including the per-row
MVCC header columns (xmin/xmax) that Postgres keeps out of sight:
SQL
SELECT * FROM sqlite_heap_storage('demo'::regclass);
Result
rowid | xmin | cmin | xmax | cmax | tuple_bytes | live
-------+---------+------+---------+------+-------------+------
1 | 3851268 | 0 | 0 | 0 | 30 | t
2 | 3851268 | 0 | 3851269 | 0 | 30 | f <- old version of id=2
3 | 3851268 | 0 | 3851270 | 0 | 30 | f <- the deleted id=3
4 | 3851269 | 0 | 0 | 0 | 30 | t <- new version of id=2
The story the file tells: row 2 was the original id=2 — the
UPDATE (txn 3851269) stamped its xmax and wrote
row 4 as the new version. Row 3 was id=3 — the
DELETE (txn 3851270) stamped its xmax; dead, but still on
disk. This is exactly how Postgres's own heap works (§3);
sqlite_heap just lets you see it.
VACUUM reclaims the dead rows
SQL
VACUUM demo;
SELECT * FROM sqlite_heap_storage('demo'::regclass);
Result
rowid | xmin | cmin | xmax | cmax | tuple_bytes | live
-------+---------+------+------+------+-------------+------
1 | 3851273 | 0 | 0 | 0 | 30 | t
4 | 3851274 | 0 | 0 | 0 | 30 | t
The two dead versions are physically gone — VACUUM issued a
DELETE against the SQLite file for every row whose xmax is committed and
no longer visible to anyone (§13).
Open the file with the plain sqlite3 CLI
Nothing about the file is special — it is an ordinary SQLite database. Point the standard
sqlite3 tool at the path from above:
Shell
sqlite3 "$PGDATA/sqlite_heap/<db-oid>/<table-oid>.sqlite"
sqlite> SELECT rowid, xmin, xmax, length(tuple) FROM storage;
Result
1|3851273|0|30
4|3851274|0|30
The tuple column is the opaque Postgres tuple — the actual column values, in
Postgres's on-disk format. Everything else is the MVCC header.
| Function | Returns | Shows |
|---|---|---|
sqlite_heap_file_path(regclass) | text |
on-disk path of the table's SQLite file |
sqlite_heap_file_size(regclass) | bigint |
its size in bytes |
sqlite_heap_schema_version(regclass) | int |
the SQLite PRAGMA user_version |
sqlite_heap_storage(regclass) |
TABLE |
raw row dump with the hidden MVCC header
(rowid, xmin, cmin, xmax, cmax, tuple_bytes, live) |
sqlite_heap_files() | TABLE |
every sqlite_heap file in the database (table_oid, size_bytes) |
sqlite_heap_physical_rows(regclass) | bigint |
row count including dead versions |
sqlite_heap_live_rows(regclass) | bigint |
row count of live (xmax = 0) versions |
All take a table OID — pass a table name with
::regclass, e.g. sqlite_heap_storage('demo'::regclass).
You can't understand the code without three Postgres concepts. This is the first.
Postgres's normal storage format — the way it lays rows out in 8 KB pages on disk — is called the heap. Since Postgres 12, the heap is not hard-wired: it's just one implementation of a pluggable interface called a Table Access Method (TAM). You can register your own.
A table access method is a C struct called TableAmRoutine — a
big bag of function pointers, about 43 of them. Postgres calls these whenever
it needs to touch table storage: "begin a scan", "fetch the next row", "insert this tuple",
"delete the row at this location", "vacuum", "estimate how big this table is for the planner",
and so on.
To make a storage engine you fill in all 43 function pointers and register the struct under a
name. sqlite_heap's 43 functions live in tam.rs; they are
what translate "Postgres wants the next row" into "run a query against SQLite".
Registration is plain SQL, emitted by the extension at install time:
CREATE FUNCTION sqlite_heap_handler(internal) RETURNS table_am_handler
LANGUAGE c AS 'MODULE_PATHNAME', 'sqlite_heap_handler';
CREATE ACCESS METHOD sqlite_heap TYPE TABLE HANDLER sqlite_heap_handler;
The sqlite_heap_handler function (in lib.rs) just
returns a pointer to our filled-in TableAmRoutine. After that,
USING sqlite_heap on a CREATE TABLE routes that table's storage through
our 43 functions.
This is the second concept, and it is the conceptual heart of the whole project.
Postgres lets many transactions read and write concurrently without readers blocking writers. It does this with MVCC — Multi-Version Concurrency Control. The core idea:
Rows are never updated or deleted in place. Every row carries two transaction ids:
xmin— the id of the transaction that created this row version.xmax— the id of the transaction that deleted it (0means "still alive").
An UPDATE is therefore a delete + an insert: it sets xmax on
the old row version and writes a brand-new row version with a fresh xmin. Both
versions physically coexist on disk.
A DELETE just sets xmax — the row stays on disk, marked
dead.
Later, VACUUM physically removes versions that no transaction can possibly need
anymore.
So how does a transaction know which versions of a row it should see? With a snapshot.
When a transaction (or, in READ COMMITTED, each statement) starts, Postgres
takes a snapshot: a record of which transactions had already committed at that
instant.
A row version is visible to a snapshot when, roughly:
- its
xmintransaction had committed as of the snapshot — the row exists yet — and - its
xmaxis0, or thexmaxtransaction had not committed as of the snapshot — the row isn't deleted yet, from this snapshot's point of view.
This is why two transactions can see different "current" contents of the same table at the same time, and why an old read never blocks a new write.
There is one more wrinkle, command ids (cmin/cmax):
within a single transaction, a later command must see the effects of earlier commands,
but not its own. So each row version also records the command number that created
(cmin) and deleted (cmax) it. Without this,
INSERT INTO t SELECT * FROM t would loop forever, re-reading the rows it just
inserted.
How sqlite_heap does this
sqlite_heap is, fundamentally, a re-implementation of Postgres's MVCC visibility rules
on top of SQLite. Look back at the storage schema: xmin,
cmin, xmax, cmax are exactly those four MVCC values, stored
as ordinary SQLite columns.
When Postgres asks sqlite_heap for "the rows of table t visible to this snapshot",
sqlite_heap:
- reads all physical row versions out of the SQLite file (every
xmin/xmaxgeneration), - and for each one calls
visibility::row_visible(header, snapshot)— which applies the visibility rule above — keeping only the versions this snapshot should see.
That function, in visibility.rs, is small but load-bearing. It
dispatches on the kind of snapshot Postgres handed us — regular MVCC is the common
one, but Postgres also has SnapshotSelf, SnapshotDirty,
SnapshotAny, and SnapshotNonVacuumable, each with different rules
— and for the regular case implements the xmin/xmax + cmin/cmax logic described above.
A few smaller concepts that come up constantly in the code.
Postgres identifies a row's physical location with a TID: a pair
(block number, offset within block). Indexes don't store rows — they store
TIDs. "Look up id = 5" means "the index says id = 5 lives at TID
(0, 6); go fetch the row there."
SQLite, on the other hand, identifies a row with a single 64-bit rowid. So
sqlite_heap has to pack a SQLite rowid into a Postgres TID and unpack it again.
It splits the rowid into (block, offset) with 256 rows per "block" — the
function ffi::rowid_to_tid / ffi::tid_to_rowid. The 256 isn't
arbitrary: Postgres validates that offsets stay under MaxHeapTuplesPerPage (~291),
so a naive "just cast the rowid" overflows and Postgres rejects it once a table grows past a few
hundred rows.
Transaction ids in xmin/xmax are just numbers. To know whether
transaction #4217 actually committed (vs. aborted, vs. still running), Postgres
keeps the CLOG — a commit log mapping transaction id → outcome.
The visibility check needs this constantly ("did the transaction that wrote this row
commit?"), so sqlite_heap asks Postgres via TransactionIdDidCommit and caches the
answers per backend in CLOG_CACHE (committed/aborted are permanent, so they're safe
to cache).
Before Postgres changes any data on disk it writes the change to the WAL
first, and fsyncs it. That's what makes a committed transaction survive a crash.
SQLite has its own WAL, working the same way. The fact that there are now two
write-ahead logs that must stay consistent with each other is the source of the trickiest
problem in this codebase — see §10.
Putting it together:
Key points:
- Per-backend, per-table SQLite connections. Each Postgres backend process
opens its own SQLite connection to each table file it touches, kept in a thread-local map
(
CONNS). Connections are opened lazily on first use. - The files live under
$PGDATA. That means they're caught by a filesystem-level backup and follow Postgres's directory permissions — but it also means streaming replication (which only ships the Postgres WAL) won't replicate them. That's a fundamental trade-off of the design. - Postgres still owns the catalog. The table's existence, its columns, its indexes, its statistics — all of that is normal Postgres catalog data. Only the row bytes live in SQLite.
- Indexes are normal Postgres indexes. A B-tree index on a sqlite_heap table is a regular Postgres B-tree. It stores TIDs that point back into our SQLite file (via the rowid↔TID packing).
The crate is split into six files. Read them roughly in this order:
| File | ~Lines | What lives here |
|---|---|---|
| fxhash.rs | 30 | A tiny fast hasher for u32 keys. The hot maps (CONNS,
TX_RELS) are keyed by relation ids; Rust's default DoS-resistant hasher is overkill
here, so we use a one-multiply hash instead. |
| sqlite.rs | 624 | The SQLite backend. Connection management, the schema, every SQL operation
(insert, select_all, update_row, …), and the transaction
handshake between Postgres and SQLite. If you only read one file, read this one. |
| visibility.rs | 146 | MVCC. row_visible — the per-snapshot-type visibility rules from
§3. |
| ffi.rs | 173 | Safe wrappers around the raw C pointers Postgres hands us: Rel (a relation),
Slot (a tuple slot), OwnedHeapTuple. Also the TID↔rowid packing
and the code that builds a Postgres HeapTuple from our stored bytes. |
| tam.rs | 1053 | The ~43 TableAmRoutine callbacks, the scan-state struct, and the routine table
itself. This is the "translate Postgres requests into SQLite calls" layer. |
| lib.rs | 122 | Wiring only: the handler function, the extension_sql! that registers the access
method and the drop-trigger, a couple of debug functions, and the test module. |
tests.rs (~1,080 lines) holds the 51-case in-process
test suite, include!d into lib.rs.
CREATE TABLE … USING sqlite_heapWhen you run CREATE TABLE t (...) USING sqlite_heap, Postgres creates the catalog
entry as usual, then calls our relation_set_new_filelocator callback to set up
storage. sqlite_heap:
- Computes the file path:
$PGDATA/sqlite_heap/<db-oid>/<table-oid>.sqlite. The directory ismkdir'd once per backend and cached. - Opens (creating) the SQLite file via
rusqlite. - Applies connection PRAGMAs (WAL mode, cache size,
synchronous=FULL, …). - If the file is new, runs
install_schema— theCREATE TABLE storage …from §1 — and stamps a schema version into SQLite'sPRAGMA user_version. - Calls
sqlite::resetto make surestorageis empty.
The user_version stamp is how migrations work: if a backend opens an older file,
migrate_schema upgrades it in place (e.g. v2→v3 added the prev
column with ALTER TABLE).
DROP TABLE is the mirror image: a sql_drop event trigger calls
sqlite_heap_drop_storage, which schedules the .sqlite file (and its
-wal/-shm siblings) to be unlinked at commit time — so a
rolled-back DROP leaves the file intact.
INSERTPostgres calls our tuple_insert callback with a "tuple slot" — a container
holding the row to insert. The flow:
tuple_insert(rel, slot, cid, …) // tam.rs — the raw C entry point
└─ do_tuple_insert(rel, slot, cid)
├─ let tup = slot.copy_heap_tuple() // materialise the row as raw bytes
├─ let xid = current_xid_u32() // our transaction's id
├─ let rowid = sqlite::insert(rel_oid, xid, cid, tup.bytes())
└─ slot.set_inserted_tid(rel, rowid) // tell Postgres where it went
And sqlite::insert is just:
INSERT INTO storage (xmin, cmin, tuple) VALUES (?xid, ?cid, ?bytes)
xmax/cmax/prev default to 0 — the row
is alive, not deleted, and a standalone version. SQLite assigns a rowid; sqlite_heap
packs it into a TID and stamps it onto the slot so that any indexes, RETURNING
clauses, or triggers see where the row landed.
tup.bytes() is the raw Postgres on-disk tuple image — Postgres's own
format, with its null bitmap and packed column values. sqlite_heap stores it as a
BLOB and never parses it. When Postgres later reads the row back, sqlite_heap hands
those exact bytes back and Postgres decodes them.
SELECTThere are two ways Postgres reads a table, and sqlite_heap handles both.
Sequential scan
"Read every row." Postgres calls scan_begin, then scan_getnextslot
repeatedly. sqlite_heap:
- On first need,
sqlite::select_allrunsSELECT … FROM storage ORDER BY rowidand loads every physical row version into aVecheld in the scan's cursor. (This is lazy — a scan that's actually an index scan, or a bitmap scan that only needs a few rows, never pays for it.) - Each
scan_getnextslotcall walks thatVec, and for each version callsvisibility::row_visible(header, snapshot). Versions the snapshot shouldn't see are skipped; the first visible one is decoded into the slot and returned.
This is the simplest path: a sequential scan sees every physical row version — every generation an UPDATE or DELETE left behind — and the visibility filter sorts out which one this snapshot should see.
Index scan
"Find the rows where id = 5." The Postgres B-tree index returns one or more TIDs;
Postgres calls our index_fetch_tuple with each. sqlite_heap unpacks the TID to a
rowid and calls fetch_one_into_slot → sqlite::fetch_one_ref:
SELECT rowid, xmin, cmin, xmax, cmax, tuple
FROM storage WHERE rowid = ?rowid
If that row version is visible to the snapshot, decode it into the slot and we're done;
otherwise the index scan moves on to the next TID. (Every live and dead version has its
own index entries until VACUUM reaps it, so an old snapshot reaching an
already-updated row still finds an index entry pointing at the still-present old version.)
The index-fetch path is hot, so fetch_one_ref hands the caller a borrowed view
of the tuple BLOB pointing straight into SQLite's page cache (via
get_ref(…).as_blob()). The bytes are copied exactly once — into the Postgres
HeapTuple the slot needs — instead of the two copies a naive
SELECT-into-Vec would do.
UPDATE & DELETEBoth fall straight out of the MVCC model from §3 — there's nothing clever here, which is the point.
UPDATE
An UPDATE is, by MVCC's definition, a delete plus an insert. tuple_update calls
sqlite::update_row, which does exactly that, in two SQL statements:
UPDATE storage SET xmax = ?xid, cmax = ?cid WHERE rowid = ?old AND xmax = 0;
INSERT INTO storage (xmin, cmin, tuple) VALUES (?xid, ?cid, ?new_bytes);
The old row version stays physically on disk, now stamped dead by our transaction; the new
version is a fresh row at a fresh rowid. tuple_update then tells Postgres
TU_All — "the row moved to a new TID, update every index to point at it".
So both versions are independently reachable through the index, and
visibility::row_visible decides which one a given snapshot sees — a new
transaction gets "new", a transaction whose snapshot predates the update still gets
"old". No special chains, no in-place rewriting: just two ordinary rows and the
visibility rule.
The native heap optimises this with HOT updates: when an UPDATE changes no
indexed column, it avoids touching any index at all. sqlite_heap does not — every
UPDATE writes a new index entry and leaves a dead one behind, so UPDATE-heavy workloads run
~1.5× the native heap. Closing that gap means a "version chain": a prev
column linking each row to its previous version at a stable TID, plus the index-maintenance
logic to exploit it. That machinery is intricate enough to work against this project's main goal
— being understandable — so sqlite_heap keeps the plain delete-old +
insert-new path and accepts the gap as a deliberate trade.
DELETE
Even simpler. tuple_delete calls sqlite::set_xmax:
UPDATE storage SET xmax = ?xid, cmax = ?cid WHERE rowid = ?rowid AND xmax = 0;
The row stays on disk, just marked dead. Its index entries stay valid (a scan still fetches it,
and row_visible hides it from new snapshots). VACUUM physically removes
it later, once no snapshot can need it (§13).
COMMIT & the durability problemThis is the subtlest correctness issue in the whole project, so it gets its own section.
There are two independent storage engines that must agree: Postgres (which owns the catalog and its own WAL) and SQLite (which owns our row data). A transaction that touches a sqlite_heap table has to commit both. If they commit at different moments and the process or machine dies in between, they diverge.
The ordering that makes it sound
sqlite_heap registers a callback on Postgres's PreCommit event — which fires
just before Postgres writes its own commit record. At that point, sqlite_heap commits the
SQLite transaction.
Why this ordering? Consider the only crash window — between the two fsyncs:
- SQLite has committed and is durable; Postgres has not yet recorded the commit.
- On recovery, Postgres sees no commit record → it aborts that transaction.
- So we're left with rows in SQLite whose
xminbelongs to a transaction Postgres's CLOG says aborted. - And
visibility::row_visiblealready hides any row whosexminaborted. So those rows are simply invisible — the divergence self-heals, andVACUUMcan later reap them.
The reverse ordering (commit SQLite after Postgres) would be unsound: a crash in the window would leave Postgres thinking the transaction committed while SQLite is missing the data — committed-but-missing rows, with no way to recover.
For the ordering to actually be safe, SQLite's commit must be durable before
Postgres's — so SQLite runs with synchronous=FULL (fsync the WAL on every
commit). Postgres also fsyncs its own WAL on every commit. So a write transaction now does
two fsyncs where the native heap does one. That is why INSERT/
DELETE on sqlite_heap run ~1.3–1.5× the native heap — it is the
honest, unavoidable price of a sound two-engine design. It's the same durability/throughput knob
as Postgres's own synchronous_commit; FULL is the safe default.
Lazy transactions
One important optimisation: the SQLite transaction is opened lazily, on the first
write. A read-only Postgres transaction never issues a SQLite BEGIN/
COMMIT at all — SQLite's per-statement autocommit gives an identical consistent
read, and MVCC visibility is decided by us against the Postgres snapshot regardless. The
SQLite transaction was only ever there to give writes atomicity. This is tracked by the
in_txn flag on each connection, and it's why SELECT performance is
unaffected by synchronous=FULL.
Rollback is the easy direction: on Postgres Abort, sqlite_heap issues a SQLite
ROLLBACK (or, for savepoints, ROLLBACK TO SAVEPOINT) — the SQLite
transaction simply never happened. Postgres sub-transactions (savepoints, PL/pgSQL
EXCEPTION blocks) map onto SQLite SAVEPOINTs.
Each backend keeps a thread-local CONNS: HashMap<table-oid, RelConn>.
A RelConn is just an open SQLite Connection plus an in_txn
flag (the lazy-transaction state from §10) — and it
Derefs to the Connection, so call sites read naturally.
SQL statements are run through rusqlite's prepare_cached — it compiles a
statement once and reuses it, keyed by the SQL text. Holding the compiled
sqlite3_stmts directly on the RelConn would skip even that lookup (~17%
faster on the fetch path), but it makes RelConn self-referential — a
Statement borrows its Connection — which needs an
unsafe lifetime transmute to express. sqlite_heap favours
understandability over that speed: prepare_cached is plenty, and there is no
non-obvious unsafe anywhere in the crate.
Error handling follows one rule: any rusqlite call that fails uses
.unwrap_sql() (a small extension trait), which panics with the
SQLite error message. The panic unwinds — running destructors, so the RefCell
borrow guards release cleanly — and pgrx's #[pg_guard] boundary converts it
into a normal Postgres ERROR. A SQLite failure here (locked database, I/O error,
corruption) is genuinely unrecoverable on the spot, so failing the statement loudly is the correct
behaviour.
The Postgres planner needs to know roughly how big a table is to choose a good plan. It asks
via the relation_estimate_size callback — on every query
plan.
estimate_size answers it plainly: one stat of the SQLite file for a
byte size, plus a SELECT count(*) for the row count. That count(*) on
every plan isn't free, and it could be cached — but the planner leans on
ANALYZE's real statistics anyway, and the plain version is far easier to follow.
(This is one of several spots where sqlite_heap deliberately chooses the clear implementation over
the fast one.)
ANALYZE is more involved. Postgres drives table sampling through a callback
pair (scan_analyze_next_block / scan_analyze_next_tuple) and a
"ReadStream" abstraction built for real disk pages — which sqlite_heap doesn't have. The
trick: model the whole table as a single logical block, and advance the ReadStream with
read_stream_next_block (which moves Postgres's internal sampler counter without doing
any actual page I/O). Skipping that step left Postgres dividing by a zero "blocks scanned" count,
which produced a nonsense n_distinct of -Infinity. With it, ANALYZE
produces correct row counts and column statistics, and the planner can pick good plans.
Dead row versions — ones with xmax set, left behind by updates and deletes
— accumulate in the SQLite file. VACUUM reclaims them.
Postgres calls our relation_vacuum callback. sqlite_heap computes the "oldest
transaction id any running snapshot could still need"
(GetOldestNonRemovableTransactionId) and runs:
DELETE FROM storage WHERE xmax != 0 AND xmax < ?oldest_xmin
That removes exactly the versions that are deleted and globally invisible. A small meta-table watermark records how far the last VACUUM reached, so a repeated VACUUM with nothing new to do skips the scan entirely.
There's also index_delete_tuples — Postgres's B-tree calls it during
"bottom-up deletion" to ask "which of these index entries point to reclaimable rows?". sqlite_heap
answers by batch-checking the rows' xmax values in a single query.
Two layers:
- 51 in-process tests (tests.rs, run by
cargo pgrx test). These run real SQL inside a real Postgres backend and assert on results — round-trips, NULL handling, the MVCC rules, rollback, savepoints, index/bitmap scans, a cursor test that proves an old snapshot still sees the pre-update row, ANALYZE statistics, and a set of regression tests, each pinning a specific bug that was found and fixed. - A multi-backend concurrency suite
(tests/concurrency.sh). The in-process tests each run in a single
transaction-scoped backend, so they can't exercise cross-backend behaviour. This script
drives 8 real
psqlconnections at once through five phases — concurrent inserts, concurrent updates, delete/re-insert churn, readers-versus-a-writer, and aREPEATABLE READsnapshot held across concurrent updates — then checks the table is consistent and runsPRAGMA integrity_checkon the SQLite file.
Measured with pgbench over a warm connection, against the native Postgres heap.
Numbers are ratios — lower is better, 1.0× means "as fast as the native heap":
| Operation | sqlite_heap vs native heap | Why |
|---|---|---|
| SELECT (point lookup) | ~1.0× — effectively matched | read-only: no commit fsync, single-copy fetch path |
| UPDATE | ~1.5× | delete-old + insert-new touches every index; no HOT optimisation (see §9) |
| DELETE | ~1.2–1.3× | one statement + the write fsync |
| INSERT | ~1.3–1.5× | the second fsync (§10) — the price of two-engine durability |
The headline: reads are essentially free; writes carry a ~1.3–1.5× tax. Part of that is fundamental — fsyncing two write-ahead logs instead of one (§10) — and part of it is a deliberate trade: closing the rest would mean machinery (HOT-style version chains, held prepared statements, extra caches) intricate enough to work against this project's goal of being understandable. A faster build of this idea is possible; this one optimises for being readable.
The benchmark Postgres is a debug/assertion build, which dilutes the absolute numbers; the ratios are a fair apples-to-apples comparison but a release build would shift them.
This is an honest list. sqlite_heap is a strong prototype; it is not production-ready, and these are why:
- No row-level locking.
tuple_update/tuple_deletealways report success. Two backends updating the same row get a lost update — no serialization failure is raised. The concurrency suite uses disjoint row sets precisely because same-row contention isn't handled. The native heap does properTM_Updated+ EvalPlanQual; sqlite_heap doesn't. This is the most important gap. - Incompatible with streaming replication / PITR. The SQLite files live under
$PGDATA, so a filesystem-level base backup catches them — but streaming replication ships the Postgres WAL, which knows nothing about our files. A streaming replica would have the catalog entry for a sqlite_heap table and zero data. This is fundamental to the architecture, not a bug. - Crash safety is reasoned, not crash-tested. The argument in §10 is sound
on paper; nobody has actually killed the process mid-commit under power-loss simulation and
verified. Also, rows from an aborted insert leak in SQLite — invisible forever, but VACUUM
only reaps
xmax-set rows. - Several callbacks are simplified.
tuple_get_latest_tidis a no-op,tuple_lock(SELECT FOR UPDATE) doesn't really lock, and REINDEX CONCURRENTLY / TABLESAMPLE have documented simplifications. - Unaudited, unsoaked, single-author. No fuzzing, no long-running soak tests, no second pair of eyes, and benchmarks only on a debug Postgres.
If you wanted to actually ship this: row-level locking is the must-fix, and you would have to either accept "no streaming replication" as a documented constraint or rethink where the bytes live.
| Term | Meaning |
|---|---|
| TAM | Table Access Method — Postgres's pluggable
storage-engine interface; a struct of ~43 function pointers
(TableAmRoutine). |
| heap | Postgres's native on-disk table format; also the generic word for "a table's stored rows". sqlite_heap is a replacement for it. |
| MVCC | Multi-Version Concurrency Control — keeping multiple versions of each row so readers never block writers. |
| xmin / xmax | The transaction ids that created / deleted a row
version. xmax = 0 means "alive". |
| cmin / cmax | Command ids within a transaction — for correct visibility between statements of the same transaction. |
| snapshot | A transaction's frozen view of "which transactions had committed" — drives what rows it can see. |
| TID | Tuple identifier — a row's physical address,
(block, offset). Indexes store TIDs. |
| tuple | Postgres's word for a row (specifically, a row's on-disk byte image). |
| slot | TupleTableSlot — the container Postgres
passes rows around in. |
| CLOG | Commit log — maps transaction id → committed/aborted. |
| WAL | Write-ahead log — durability mechanism; both Postgres and SQLite have one. |
| HOT update | Heap-Only Tuple update — a native-heap optimisation where an UPDATE that changes no indexed column touches no index. sqlite_heap does not implement it (see §9). |
| VACUUM | Background reclamation of dead row versions. |
| pgrx | The Rust framework for writing Postgres extensions; provides
the pg_sys bindings, #[pg_guard], the test harness, etc. |
| rusqlite | The Rust SQLite binding sqlite_heap uses to talk to its storage files. |