sqlite_heap  / Engineering Reference
Overview

sqlite_heap is a pluggable Postgres storage engine that, instead of using Postgres's built-in on-disk format, keeps each table's rows inside a private SQLite database file — while still behaving like a normal Postgres table (transactions, MVCC, indexes, VACUUM, the planner, all of it).

This document explains how it works from the ground up — including the Postgres internals you need — so you can read the code with confidence. You write CREATE TABLE t (...) USING sqlite_heap; and from then on every row you insert into t lands in a SQLite file on disk. Queries, joins, indexes, and transactions all still work — Postgres doesn't know or care that the bytes live in SQLite.

Note — Why would anyone do this?

Mostly: because Postgres lets you, and it's a fantastic way to understand how Postgres's storage layer actually works. It's also a real, useful pattern — SQLite files are trivially copyable, embeddable, and inspectable. But the honest framing is: this is a serious prototype and a learning vehicle, not a production storage engine (see §16).

01 What it is, concretely

Three facts to anchor on:

  1. One SQLite file per Postgres table. When you create a sqlite_heap table, the extension creates a file at $PGDATA/sqlite_heap/<database-oid>/<table-oid>.sqlite.
  2. Inside that file is a table called storage with one row per row of your Postgres table — plus bookkeeping columns. The actual Postgres row is stored as an opaque BLOB; sqlite_heap never looks inside it. Postgres formats the bytes, sqlite_heap just holds them.
  3. The extension is ~3,700 lines of Rust built with pgrx, compiled into a shared library that Postgres loads. It implements the ~43 callbacks Postgres requires of a storage engine.

Here is the SQLite schema (schema version 4):

CREATE TABLE storage (
    rowid INTEGER PRIMARY KEY,   -- SQLite's own row id; becomes the Postgres TID
    xmin  INTEGER NOT NULL,      -- id of the transaction that created this row
    cmin  INTEGER NOT NULL,      -- command id within that transaction
    xmax  INTEGER NOT NULL,      -- id of the transaction that deleted it (0 = still live)
    cmax  INTEGER NOT NULL,      -- command id of the delete
    tuple BLOB    NOT NULL       -- the raw Postgres row bytes, opaque to us
);
CREATE TABLE meta (key TEXT PRIMARY KEY, value BLOB NOT NULL);  -- small key/value scratch

Every column except tuple exists to reproduce Postgres's transactional behaviour. The rest of this document is, essentially, the story of those columns.

1.5 A five-minute demo

Before the theory, here it is working. This walkthrough shows a sqlite_heap table behaving like any other Postgres table — then opens the hood: the SQLite file on disk, and the MVCC bookkeeping Postgres normally hides.

Setup

SQL

CREATE EXTENSION pg_sqlite_heap;

That registers a new access method called sqlite_heap.

A table that lives in SQLite

SQL

CREATE TABLE demo (id int primary key, label text) USING sqlite_heap;
INSERT INTO demo VALUES (1, 'a'), (2, 'b'), (3, 'c');
SELECT * FROM demo ORDER BY id;

Result

 id | label
----+-------
  1 | a
  2 | b
  3 | c

From here on it is an ordinary Postgres table — SELECT, JOIN, indexes, and transactions all work, and nothing in your SQL knows the difference. But the bytes are not in Postgres's heap. They are in a SQLite file:

SQL

SELECT sqlite_heap_file_path('demo'::regclass)      AS path,
       sqlite_heap_file_size('demo'::regclass)      AS bytes,
       sqlite_heap_schema_version('demo'::regclass) AS schema_v;

Result

                      path                       | bytes | schema_v
-------------------------------------------------+-------+----------
 $PGDATA/sqlite_heap/<db-oid>/<table-oid>.sqlite |  4096 |        4

One SQLite database file per Postgres table.

The reveal — MVCC bookkeeping Postgres hides

Run an UPDATE and a DELETE:

SQL

UPDATE demo SET label = 'B' WHERE id = 2;
DELETE FROM demo WHERE id = 3;
SELECT * FROM demo ORDER BY id;

Result

 id | label
----+-------
  1 | a
  2 | B

Postgres shows two rows. But the SQLite file still physically holds foursqlite_heap_storage() dumps the raw storage table, including the per-row MVCC header columns (xmin/xmax) that Postgres keeps out of sight:

SQL

SELECT * FROM sqlite_heap_storage('demo'::regclass);

Result

 rowid |  xmin   | cmin |  xmax   | cmax | tuple_bytes | live
-------+---------+------+---------+------+-------------+------
     1 | 3851268 |    0 |       0 |    0 |          30 | t
     2 | 3851268 |    0 | 3851269 |    0 |          30 | f   <- old version of id=2
     3 | 3851268 |    0 | 3851270 |    0 |          30 | f   <- the deleted id=3
     4 | 3851269 |    0 |       0 |    0 |          30 | t   <- new version of id=2

The story the file tells: row 2 was the original id=2 — the UPDATE (txn 3851269) stamped its xmax and wrote row 4 as the new version. Row 3 was id=3 — the DELETE (txn 3851270) stamped its xmax; dead, but still on disk. This is exactly how Postgres's own heap works (§3); sqlite_heap just lets you see it.

VACUUM reclaims the dead rows

SQL

VACUUM demo;
SELECT * FROM sqlite_heap_storage('demo'::regclass);

Result

 rowid |  xmin   | cmin | xmax | cmax | tuple_bytes | live
-------+---------+------+------+------+-------------+------
     1 | 3851273 |    0 |    0 |    0 |          30 | t
     4 | 3851274 |    0 |    0 |    0 |          30 | t

The two dead versions are physically gone — VACUUM issued a DELETE against the SQLite file for every row whose xmax is committed and no longer visible to anyone (§13).

Open the file with the plain sqlite3 CLI

Nothing about the file is special — it is an ordinary SQLite database. Point the standard sqlite3 tool at the path from above:

Shell

sqlite3 "$PGDATA/sqlite_heap/<db-oid>/<table-oid>.sqlite"
sqlite> SELECT rowid, xmin, xmax, length(tuple) FROM storage;

Result

1|3851273|0|30
4|3851274|0|30

The tuple column is the opaque Postgres tuple — the actual column values, in Postgres's on-disk format. Everything else is the MVCC header.

Inspection function reference
FunctionReturnsShows
sqlite_heap_file_path(regclass)text on-disk path of the table's SQLite file
sqlite_heap_file_size(regclass)bigint its size in bytes
sqlite_heap_schema_version(regclass)int the SQLite PRAGMA user_version
sqlite_heap_storage(regclass) TABLE raw row dump with the hidden MVCC header (rowid, xmin, cmin, xmax, cmax, tuple_bytes, live)
sqlite_heap_files()TABLE every sqlite_heap file in the database (table_oid, size_bytes)
sqlite_heap_physical_rows(regclass)bigint row count including dead versions
sqlite_heap_live_rows(regclass)bigint row count of live (xmax = 0) versions

All take a table OID — pass a table name with ::regclass, e.g. sqlite_heap_storage('demo'::regclass).

02 Background — Table Access Methods

You can't understand the code without three Postgres concepts. This is the first.

Postgres's normal storage format — the way it lays rows out in 8 KB pages on disk — is called the heap. Since Postgres 12, the heap is not hard-wired: it's just one implementation of a pluggable interface called a Table Access Method (TAM). You can register your own.

Postgres Concept — the TableAmRoutine

A table access method is a C struct called TableAmRoutine — a big bag of function pointers, about 43 of them. Postgres calls these whenever it needs to touch table storage: "begin a scan", "fetch the next row", "insert this tuple", "delete the row at this location", "vacuum", "estimate how big this table is for the planner", and so on.

To make a storage engine you fill in all 43 function pointers and register the struct under a name. sqlite_heap's 43 functions live in tam.rs; they are what translate "Postgres wants the next row" into "run a query against SQLite".

Registration is plain SQL, emitted by the extension at install time:

CREATE FUNCTION sqlite_heap_handler(internal) RETURNS table_am_handler
    LANGUAGE c AS 'MODULE_PATHNAME', 'sqlite_heap_handler';
CREATE ACCESS METHOD sqlite_heap TYPE TABLE HANDLER sqlite_heap_handler;

The sqlite_heap_handler function (in lib.rs) just returns a pointer to our filled-in TableAmRoutine. After that, USING sqlite_heap on a CREATE TABLE routes that table's storage through our 43 functions.

Fig. 2-1 — Request path, index lookup
SELECT * FROM t WHERE id = 5 │ ▼ ┌──────────────────────────────┐ │ Postgres planner + executor │ └───────────────┬──────────────┘ │ index says id=5 lives at TID (0,5); │ Postgres calls index_fetch_tuple(TID) ▼ ┌──────────────────────────────┐ │ sqlite_heap (TableAmRoutine)│ └───────────────┬──────────────┘ │ SELECT … FROM storage WHERE rowid = 5 ▼ ┌──────────────────────────────┐ │ <table-oid>.sqlite (SQLite) │ └──────────────────────────────┘
03 Background — MVCC & snapshots

This is the second concept, and it is the conceptual heart of the whole project.

Postgres lets many transactions read and write concurrently without readers blocking writers. It does this with MVCC — Multi-Version Concurrency Control. The core idea:

Postgres Concept — MVCC

Rows are never updated or deleted in place. Every row carries two transaction ids:

  • xmin — the id of the transaction that created this row version.
  • xmax — the id of the transaction that deleted it (0 means "still alive").

An UPDATE is therefore a delete + an insert: it sets xmax on the old row version and writes a brand-new row version with a fresh xmin. Both versions physically coexist on disk.

A DELETE just sets xmax — the row stays on disk, marked dead.

Later, VACUUM physically removes versions that no transaction can possibly need anymore.

So how does a transaction know which versions of a row it should see? With a snapshot.

Postgres Concept — Snapshots

When a transaction (or, in READ COMMITTED, each statement) starts, Postgres takes a snapshot: a record of which transactions had already committed at that instant.

A row version is visible to a snapshot when, roughly:

  • its xmin transaction had committed as of the snapshot — the row exists yet — and
  • its xmax is 0, or the xmax transaction had not committed as of the snapshot — the row isn't deleted yet, from this snapshot's point of view.

This is why two transactions can see different "current" contents of the same table at the same time, and why an old read never blocks a new write.

There is one more wrinkle, command ids (cmin/cmax): within a single transaction, a later command must see the effects of earlier commands, but not its own. So each row version also records the command number that created (cmin) and deleted (cmax) it. Without this, INSERT INTO t SELECT * FROM t would loop forever, re-reading the rows it just inserted.

How sqlite_heap does this

sqlite_heap is, fundamentally, a re-implementation of Postgres's MVCC visibility rules on top of SQLite. Look back at the storage schema: xmin, cmin, xmax, cmax are exactly those four MVCC values, stored as ordinary SQLite columns.

When Postgres asks sqlite_heap for "the rows of table t visible to this snapshot", sqlite_heap:

  1. reads all physical row versions out of the SQLite file (every xmin/ xmax generation),
  2. and for each one calls visibility::row_visible(header, snapshot) — which applies the visibility rule above — keeping only the versions this snapshot should see.

That function, in visibility.rs, is small but load-bearing. It dispatches on the kind of snapshot Postgres handed us — regular MVCC is the common one, but Postgres also has SnapshotSelf, SnapshotDirty, SnapshotAny, and SnapshotNonVacuumable, each with different rules — and for the regular case implements the xmin/xmax + cmin/cmax logic described above.

3.5 Background — TIDs, the CLOG, and the WAL

A few smaller concepts that come up constantly in the code.

Postgres Concept — TID (tuple identifier)

Postgres identifies a row's physical location with a TID: a pair (block number, offset within block). Indexes don't store rows — they store TIDs. "Look up id = 5" means "the index says id = 5 lives at TID (0, 6); go fetch the row there."

SQLite, on the other hand, identifies a row with a single 64-bit rowid. So sqlite_heap has to pack a SQLite rowid into a Postgres TID and unpack it again. It splits the rowid into (block, offset) with 256 rows per "block" — the function ffi::rowid_to_tid / ffi::tid_to_rowid. The 256 isn't arbitrary: Postgres validates that offsets stay under MaxHeapTuplesPerPage (~291), so a naive "just cast the rowid" overflows and Postgres rejects it once a table grows past a few hundred rows.

Postgres Concept — the CLOG (commit log)

Transaction ids in xmin/xmax are just numbers. To know whether transaction #4217 actually committed (vs. aborted, vs. still running), Postgres keeps the CLOG — a commit log mapping transaction id → outcome.

The visibility check needs this constantly ("did the transaction that wrote this row commit?"), so sqlite_heap asks Postgres via TransactionIdDidCommit and caches the answers per backend in CLOG_CACHE (committed/aborted are permanent, so they're safe to cache).

Postgres Concept — the WAL (write-ahead log)

Before Postgres changes any data on disk it writes the change to the WAL first, and fsyncs it. That's what makes a committed transaction survive a crash. SQLite has its own WAL, working the same way. The fact that there are now two write-ahead logs that must stay consistent with each other is the source of the trickiest problem in this codebase — see §10.

04 Architecture

Putting it together:

Fig. 4-1 — Process & file layout
┌──────────────────────────────────────────────┐ │ one Postgres backend process │ │ executor ──▶ sqlite_heap ──▶ rusqlite │ └───────────────────────┬──────────────────────┘ │ every write splits two ways ┌───────────┴───────────┐ ▼ ▼ Postgres's own files sqlite_heap's files ─────────────────── ─────────────────── base/, pg_wal/ $PGDATA/sqlite_heap/<db-oid>/<table-oid>.sqlite the catalog + its WAL the actual row data (+ its own -wal/-shm)

Key points:

  • Per-backend, per-table SQLite connections. Each Postgres backend process opens its own SQLite connection to each table file it touches, kept in a thread-local map (CONNS). Connections are opened lazily on first use.
  • The files live under $PGDATA. That means they're caught by a filesystem-level backup and follow Postgres's directory permissions — but it also means streaming replication (which only ships the Postgres WAL) won't replicate them. That's a fundamental trade-off of the design.
  • Postgres still owns the catalog. The table's existence, its columns, its indexes, its statistics — all of that is normal Postgres catalog data. Only the row bytes live in SQLite.
  • Indexes are normal Postgres indexes. A B-tree index on a sqlite_heap table is a regular Postgres B-tree. It stores TIDs that point back into our SQLite file (via the rowid↔TID packing).
05 The six modules

The crate is split into six files. Read them roughly in this order:

File~LinesWhat lives here
fxhash.rs30 A tiny fast hasher for u32 keys. The hot maps (CONNS, TX_RELS) are keyed by relation ids; Rust's default DoS-resistant hasher is overkill here, so we use a one-multiply hash instead.
sqlite.rs624 The SQLite backend. Connection management, the schema, every SQL operation (insert, select_all, update_row, …), and the transaction handshake between Postgres and SQLite. If you only read one file, read this one.
visibility.rs146 MVCC. row_visible — the per-snapshot-type visibility rules from §3.
ffi.rs173 Safe wrappers around the raw C pointers Postgres hands us: Rel (a relation), Slot (a tuple slot), OwnedHeapTuple. Also the TID↔rowid packing and the code that builds a Postgres HeapTuple from our stored bytes.
tam.rs1053 The ~43 TableAmRoutine callbacks, the scan-state struct, and the routine table itself. This is the "translate Postgres requests into SQLite calls" layer.
lib.rs122 Wiring only: the handler function, the extension_sql! that registers the access method and the drop-trigger, a couple of debug functions, and the test module.

tests.rs (~1,080 lines) holds the 51-case in-process test suite, include!d into lib.rs.

06 Walkthrough — CREATE TABLE … USING sqlite_heap

When you run CREATE TABLE t (...) USING sqlite_heap, Postgres creates the catalog entry as usual, then calls our relation_set_new_filelocator callback to set up storage. sqlite_heap:

  1. Computes the file path: $PGDATA/sqlite_heap/<db-oid>/<table-oid>.sqlite. The directory is mkdir'd once per backend and cached.
  2. Opens (creating) the SQLite file via rusqlite.
  3. Applies connection PRAGMAs (WAL mode, cache size, synchronous=FULL, …).
  4. If the file is new, runs install_schema — the CREATE TABLE storage … from §1 — and stamps a schema version into SQLite's PRAGMA user_version.
  5. Calls sqlite::reset to make sure storage is empty.

The user_version stamp is how migrations work: if a backend opens an older file, migrate_schema upgrades it in place (e.g. v2→v3 added the prev column with ALTER TABLE).

DROP TABLE is the mirror image: a sql_drop event trigger calls sqlite_heap_drop_storage, which schedules the .sqlite file (and its -wal/-shm siblings) to be unlinked at commit time — so a rolled-back DROP leaves the file intact.

07 Walkthrough — INSERT

Postgres calls our tuple_insert callback with a "tuple slot" — a container holding the row to insert. The flow:

tuple_insert(rel, slot, cid, …)            // tam.rs — the raw C entry point
  └─ do_tuple_insert(rel, slot, cid)
       ├─ let tup = slot.copy_heap_tuple()  // materialise the row as raw bytes
       ├─ let xid = current_xid_u32()       // our transaction's id
       ├─ let rowid = sqlite::insert(rel_oid, xid, cid, tup.bytes())
       └─ slot.set_inserted_tid(rel, rowid) // tell Postgres where it went

And sqlite::insert is just:

INSERT INTO storage (xmin, cmin, tuple) VALUES (?xid, ?cid, ?bytes)

xmax/cmax/prev default to 0 — the row is alive, not deleted, and a standalone version. SQLite assigns a rowid; sqlite_heap packs it into a TID and stamps it onto the slot so that any indexes, RETURNING clauses, or triggers see where the row landed.

Note — the row bytes are opaque

tup.bytes() is the raw Postgres on-disk tuple image — Postgres's own format, with its null bitmap and packed column values. sqlite_heap stores it as a BLOB and never parses it. When Postgres later reads the row back, sqlite_heap hands those exact bytes back and Postgres decodes them.

08 Walkthrough — SELECT

There are two ways Postgres reads a table, and sqlite_heap handles both.

Sequential scan

"Read every row." Postgres calls scan_begin, then scan_getnextslot repeatedly. sqlite_heap:

  1. On first need, sqlite::select_all runs SELECT … FROM storage ORDER BY rowid and loads every physical row version into a Vec held in the scan's cursor. (This is lazy — a scan that's actually an index scan, or a bitmap scan that only needs a few rows, never pays for it.)
  2. Each scan_getnextslot call walks that Vec, and for each version calls visibility::row_visible(header, snapshot). Versions the snapshot shouldn't see are skipped; the first visible one is decoded into the slot and returned.

This is the simplest path: a sequential scan sees every physical row version — every generation an UPDATE or DELETE left behind — and the visibility filter sorts out which one this snapshot should see.

Index scan

"Find the rows where id = 5." The Postgres B-tree index returns one or more TIDs; Postgres calls our index_fetch_tuple with each. sqlite_heap unpacks the TID to a rowid and calls fetch_one_into_slotsqlite::fetch_one_ref:

SELECT rowid, xmin, cmin, xmax, cmax, tuple
FROM storage WHERE rowid = ?rowid

If that row version is visible to the snapshot, decode it into the slot and we're done; otherwise the index scan moves on to the next TID. (Every live and dead version has its own index entries until VACUUM reaps it, so an old snapshot reaching an already-updated row still finds an index entry pointing at the still-present old version.)

Detail — zero-copy reads

The index-fetch path is hot, so fetch_one_ref hands the caller a borrowed view of the tuple BLOB pointing straight into SQLite's page cache (via get_ref(…).as_blob()). The bytes are copied exactly once — into the Postgres HeapTuple the slot needs — instead of the two copies a naive SELECT-into-Vec would do.

09 Walkthrough — UPDATE & DELETE

Both fall straight out of the MVCC model from §3 — there's nothing clever here, which is the point.

UPDATE

An UPDATE is, by MVCC's definition, a delete plus an insert. tuple_update calls sqlite::update_row, which does exactly that, in two SQL statements:

UPDATE storage SET xmax = ?xid, cmax = ?cid WHERE rowid = ?old AND xmax = 0;
INSERT INTO storage (xmin, cmin, tuple) VALUES (?xid, ?cid, ?new_bytes);

The old row version stays physically on disk, now stamped dead by our transaction; the new version is a fresh row at a fresh rowid. tuple_update then tells Postgres TU_All — "the row moved to a new TID, update every index to point at it".

Fig. 9-1 — UPDATE, both versions reachable
UPDATE t SET val = 'new' WHERE id = 5 storage, afterwards: index on `id`, afterwards: rowid xmin xmax val id=5 -> TID(rowid 5) (old, dead) ----- ---- ---- ----- id=5 -> TID(rowid 6) (new, live) 5 10 20 "old" <- dead 6 20 0 "new" <- live a scan fetches both TIDs; row_visible keeps the right one for each snapshot.

So both versions are independently reachable through the index, and visibility::row_visible decides which one a given snapshot sees — a new transaction gets "new", a transaction whose snapshot predates the update still gets "old". No special chains, no in-place rewriting: just two ordinary rows and the visibility rule.

Trade-off — the cost, and why

The native heap optimises this with HOT updates: when an UPDATE changes no indexed column, it avoids touching any index at all. sqlite_heap does not — every UPDATE writes a new index entry and leaves a dead one behind, so UPDATE-heavy workloads run ~1.5× the native heap. Closing that gap means a "version chain": a prev column linking each row to its previous version at a stable TID, plus the index-maintenance logic to exploit it. That machinery is intricate enough to work against this project's main goal — being understandable — so sqlite_heap keeps the plain delete-old + insert-new path and accepts the gap as a deliberate trade.

DELETE

Even simpler. tuple_delete calls sqlite::set_xmax:

UPDATE storage SET xmax = ?xid, cmax = ?cid WHERE rowid = ?rowid AND xmax = 0;

The row stays on disk, just marked dead. Its index entries stay valid (a scan still fetches it, and row_visible hides it from new snapshots). VACUUM physically removes it later, once no snapshot can need it (§13).

10 Walkthrough — COMMIT & the durability problem

This is the subtlest correctness issue in the whole project, so it gets its own section.

There are two independent storage engines that must agree: Postgres (which owns the catalog and its own WAL) and SQLite (which owns our row data). A transaction that touches a sqlite_heap table has to commit both. If they commit at different moments and the process or machine dies in between, they diverge.

The ordering that makes it sound

sqlite_heap registers a callback on Postgres's PreCommit event — which fires just before Postgres writes its own commit record. At that point, sqlite_heap commits the SQLite transaction.

Fig. 10-1 — Commit timeline (top to bottom = time)
… transaction does its work … │ ▼ PreCommit fires │ sqlite_heap COMMITs the SQLite txn + fsync ◀── SQLite durable here ▼ Postgres writes its own commit record + fsync ◀── Postgres durable here │ ▼ committed

Why this ordering? Consider the only crash window — between the two fsyncs:

  • SQLite has committed and is durable; Postgres has not yet recorded the commit.
  • On recovery, Postgres sees no commit record → it aborts that transaction.
  • So we're left with rows in SQLite whose xmin belongs to a transaction Postgres's CLOG says aborted.
  • And visibility::row_visible already hides any row whose xmin aborted. So those rows are simply invisible — the divergence self-heals, and VACUUM can later reap them.

The reverse ordering (commit SQLite after Postgres) would be unsound: a crash in the window would leave Postgres thinking the transaction committed while SQLite is missing the data — committed-but-missing rows, with no way to recover.

Cost — two fsyncs per write

For the ordering to actually be safe, SQLite's commit must be durable before Postgres's — so SQLite runs with synchronous=FULL (fsync the WAL on every commit). Postgres also fsyncs its own WAL on every commit. So a write transaction now does two fsyncs where the native heap does one. That is why INSERT/ DELETE on sqlite_heap run ~1.3–1.5× the native heap — it is the honest, unavoidable price of a sound two-engine design. It's the same durability/throughput knob as Postgres's own synchronous_commit; FULL is the safe default.

Lazy transactions

One important optimisation: the SQLite transaction is opened lazily, on the first write. A read-only Postgres transaction never issues a SQLite BEGIN/ COMMIT at all — SQLite's per-statement autocommit gives an identical consistent read, and MVCC visibility is decided by us against the Postgres snapshot regardless. The SQLite transaction was only ever there to give writes atomicity. This is tracked by the in_txn flag on each connection, and it's why SELECT performance is unaffected by synchronous=FULL.

Rollback is the easy direction: on Postgres Abort, sqlite_heap issues a SQLite ROLLBACK (or, for savepoints, ROLLBACK TO SAVEPOINT) — the SQLite transaction simply never happened. Postgres sub-transactions (savepoints, PL/pgSQL EXCEPTION blocks) map onto SQLite SAVEPOINTs.

11 Connections & error handling

Each backend keeps a thread-local CONNS: HashMap<table-oid, RelConn>. A RelConn is just an open SQLite Connection plus an in_txn flag (the lazy-transaction state from §10) — and it Derefs to the Connection, so call sites read naturally.

SQL statements are run through rusqlite's prepare_cached — it compiles a statement once and reuses it, keyed by the SQL text. Holding the compiled sqlite3_stmts directly on the RelConn would skip even that lookup (~17% faster on the fetch path), but it makes RelConn self-referential — a Statement borrows its Connection — which needs an unsafe lifetime transmute to express. sqlite_heap favours understandability over that speed: prepare_cached is plenty, and there is no non-obvious unsafe anywhere in the crate.

Error handling follows one rule: any rusqlite call that fails uses .unwrap_sql() (a small extension trait), which panics with the SQLite error message. The panic unwinds — running destructors, so the RefCell borrow guards release cleanly — and pgrx's #[pg_guard] boundary converts it into a normal Postgres ERROR. A SQLite failure here (locked database, I/O error, corruption) is genuinely unrecoverable on the spot, so failing the statement loudly is the correct behaviour.

12 Planner statistics & ANALYZE

The Postgres planner needs to know roughly how big a table is to choose a good plan. It asks via the relation_estimate_size callback — on every query plan.

estimate_size answers it plainly: one stat of the SQLite file for a byte size, plus a SELECT count(*) for the row count. That count(*) on every plan isn't free, and it could be cached — but the planner leans on ANALYZE's real statistics anyway, and the plain version is far easier to follow. (This is one of several spots where sqlite_heap deliberately chooses the clear implementation over the fast one.)

ANALYZE is more involved. Postgres drives table sampling through a callback pair (scan_analyze_next_block / scan_analyze_next_tuple) and a "ReadStream" abstraction built for real disk pages — which sqlite_heap doesn't have. The trick: model the whole table as a single logical block, and advance the ReadStream with read_stream_next_block (which moves Postgres's internal sampler counter without doing any actual page I/O). Skipping that step left Postgres dividing by a zero "blocks scanned" count, which produced a nonsense n_distinct of -Infinity. With it, ANALYZE produces correct row counts and column statistics, and the planner can pick good plans.

13 VACUUM

Dead row versions — ones with xmax set, left behind by updates and deletes — accumulate in the SQLite file. VACUUM reclaims them.

Postgres calls our relation_vacuum callback. sqlite_heap computes the "oldest transaction id any running snapshot could still need" (GetOldestNonRemovableTransactionId) and runs:

DELETE FROM storage WHERE xmax != 0 AND xmax < ?oldest_xmin

That removes exactly the versions that are deleted and globally invisible. A small meta-table watermark records how far the last VACUUM reached, so a repeated VACUUM with nothing new to do skips the scan entirely.

There's also index_delete_tuples — Postgres's B-tree calls it during "bottom-up deletion" to ask "which of these index entries point to reclaimable rows?". sqlite_heap answers by batch-checking the rows' xmax values in a single query.

14 How it's tested

Two layers:

  • 51 in-process tests (tests.rs, run by cargo pgrx test). These run real SQL inside a real Postgres backend and assert on results — round-trips, NULL handling, the MVCC rules, rollback, savepoints, index/bitmap scans, a cursor test that proves an old snapshot still sees the pre-update row, ANALYZE statistics, and a set of regression tests, each pinning a specific bug that was found and fixed.
  • A multi-backend concurrency suite (tests/concurrency.sh). The in-process tests each run in a single transaction-scoped backend, so they can't exercise cross-backend behaviour. This script drives 8 real psql connections at once through five phases — concurrent inserts, concurrent updates, delete/re-insert churn, readers-versus-a-writer, and a REPEATABLE READ snapshot held across concurrent updates — then checks the table is consistent and runs PRAGMA integrity_check on the SQLite file.
15 Performance

Measured with pgbench over a warm connection, against the native Postgres heap. Numbers are ratios — lower is better, 1.0× means "as fast as the native heap":

Operationsqlite_heap vs native heapWhy
SELECT (point lookup)~1.0× — effectively matched read-only: no commit fsync, single-copy fetch path
UPDATE~1.5× delete-old + insert-new touches every index; no HOT optimisation (see §9)
DELETE~1.2–1.3× one statement + the write fsync
INSERT~1.3–1.5× the second fsync (§10) — the price of two-engine durability

The headline: reads are essentially free; writes carry a ~1.3–1.5× tax. Part of that is fundamental — fsyncing two write-ahead logs instead of one (§10) — and part of it is a deliberate trade: closing the rest would mean machinery (HOT-style version chains, held prepared statements, extra caches) intricate enough to work against this project's goal of being understandable. A faster build of this idea is possible; this one optimises for being readable.

Caveat — benchmark conditions

The benchmark Postgres is a debug/assertion build, which dilutes the absolute numbers; the ratios are a fair apples-to-apples comparison but a release build would shift them.

16 Limitations — what "production" would still need

This is an honest list. sqlite_heap is a strong prototype; it is not production-ready, and these are why:

  1. No row-level locking. tuple_update/tuple_delete always report success. Two backends updating the same row get a lost update — no serialization failure is raised. The concurrency suite uses disjoint row sets precisely because same-row contention isn't handled. The native heap does proper TM_Updated + EvalPlanQual; sqlite_heap doesn't. This is the most important gap.
  2. Incompatible with streaming replication / PITR. The SQLite files live under $PGDATA, so a filesystem-level base backup catches them — but streaming replication ships the Postgres WAL, which knows nothing about our files. A streaming replica would have the catalog entry for a sqlite_heap table and zero data. This is fundamental to the architecture, not a bug.
  3. Crash safety is reasoned, not crash-tested. The argument in §10 is sound on paper; nobody has actually killed the process mid-commit under power-loss simulation and verified. Also, rows from an aborted insert leak in SQLite — invisible forever, but VACUUM only reaps xmax-set rows.
  4. Several callbacks are simplified. tuple_get_latest_tid is a no-op, tuple_lock (SELECT FOR UPDATE) doesn't really lock, and REINDEX CONCURRENTLY / TABLESAMPLE have documented simplifications.
  5. Unaudited, unsoaked, single-author. No fuzzing, no long-running soak tests, no second pair of eyes, and benchmarks only on a debug Postgres.

If you wanted to actually ship this: row-level locking is the must-fix, and you would have to either accept "no streaming replication" as a documented constraint or rethink where the bytes live.

17 Glossary

TermMeaning
TAMTable Access Method — Postgres's pluggable storage-engine interface; a struct of ~43 function pointers (TableAmRoutine).
heapPostgres's native on-disk table format; also the generic word for "a table's stored rows". sqlite_heap is a replacement for it.
MVCCMulti-Version Concurrency Control — keeping multiple versions of each row so readers never block writers.
xmin / xmaxThe transaction ids that created / deleted a row version. xmax = 0 means "alive".
cmin / cmaxCommand ids within a transaction — for correct visibility between statements of the same transaction.
snapshotA transaction's frozen view of "which transactions had committed" — drives what rows it can see.
TIDTuple identifier — a row's physical address, (block, offset). Indexes store TIDs.
tuplePostgres's word for a row (specifically, a row's on-disk byte image).
slotTupleTableSlot — the container Postgres passes rows around in.
CLOGCommit log — maps transaction id → committed/aborted.
WALWrite-ahead log — durability mechanism; both Postgres and SQLite have one.
HOT updateHeap-Only Tuple update — a native-heap optimisation where an UPDATE that changes no indexed column touches no index. sqlite_heap does not implement it (see §9).
VACUUMBackground reclamation of dead row versions.
pgrxThe Rust framework for writing Postgres extensions; provides the pg_sys bindings, #[pg_guard], the test harness, etc.
rusqliteThe Rust SQLite binding sqlite_heap uses to talk to its storage files.
Specsheet
DocumentSH-EXPLAINER
SchemaVersion 4
Built withRust / pgrx
Backendrusqlite
Modules6 files
Source~3,700 LOC
Callbacks~43 TAM fns
Tests51 in-process
ClassPrototype / Study
Reading Order
1stsqlite.rs
2ndvisibility.rs
3rdtam.rs
Alongside§3 + §10