Everything sqlite_heap adds to a Postgres database, function by
function. Eight pg_extern functions, four SQL-level objects, and the 44
TableAmRoutine callbacks — that is the entire surface. This page walks each
one: its signature, what it does, and how it is implemented.
The named functions split three ways. Inspection functions are
read-only windows onto what sqlite_heap keeps on disk — call them from psql to
poke at MVCC state. Lifecycle is the single plumbing function that cleans up
files. Registration is the SQL emitted at CREATE EXTENSION time that
wires the access method into Postgres. Below all of that, the TAM callbacks
(§04) are the real engine — the function pointers Postgres itself calls for every scan,
fetch, insert, vacuum, and plan.
Every inspection function takes a table OID — pass a table name with
::regclass, e.g. sqlite_heap_storage('demo'::regclass). For how
the engine works underneath, see the Engineering Reference.
| Function | Returns | Group |
|---|---|---|
sqlite_heap_physical_rows(oid) | bigint | Inspection |
sqlite_heap_live_rows(oid) | bigint | Inspection |
sqlite_heap_file_path(oid) | text | Inspection |
sqlite_heap_file_size(oid) | bigint | Inspection |
sqlite_heap_schema_version(oid) | integer | Inspection |
sqlite_heap_storage(oid) | TABLE(…) | Inspection |
sqlite_heap_files() | TABLE(…) | Inspection |
sqlite_heap_drop_storage(oid) | void | Lifecycle |
sqlite_heap_handler(internal) | table_am_handler | Registration |
sqlite_heap_drop_handler() | event_trigger | Registration |
Six read-only functions that expose what sqlite_heap keeps on disk. None of them write; all of them are thin wrappers over helpers in sqlite.rs.
sqlite_heap_physical_rows(rel oid) RETURNS bigint
The number of rows physically present in the SQLite file backing rel
— including dead, xmax-set versions. Compare against
sqlite_heap_live_rows to see how much dead weight a table
is carrying before a VACUUM.
Implementation
One line: sqlite::select_all(rel).len(). select_all runs
SELECT rowid, xmin, cmin, xmax, cmax, tuple FROM storage ORDER BY rowid and
collects every physical row version into a Vec<StoredRow>; the function
returns its length. It does not filter on visibility — that is the whole point.
sqlite_heap_live_rows(rel oid) RETURNS bigint
The number of live rows in the SQLite file — versions with
xmax = 0, i.e. no deleter stamped on them.
Implementation
The same sqlite::select_all(rel) dump, then
.iter().filter(|r| r.xmax == 0).count(). Note this is a physical
liveness test, not a snapshot-relative one: a row whose xmax belongs to a
transaction that later aborted still counts as alive here, because xmax = 0 is
the only thing checked. True per-snapshot visibility lives in
visibility.rs.
sqlite_heap_file_path(rel oid) RETURNS text
The on-disk path of the SQLite file backing rel:
$PGDATA/sqlite_heap/<database-oid>/<table-oid>.sqlite. Hand this
straight to the sqlite3 CLI to open the file yourself.
Implementation
sqlite::file_path → path_for →
with_dir(|d| d.join("{rel_id}.sqlite")). with_dir computes the
directory $PGDATA/sqlite_heap/<MyDatabaseId>/ once per backend, caches it in
a thread-local (SQLITE_DIR), and create_dir_alls it. The result is
returned as a lossy UTF-8 String.
This returns the path a table's file would have; it does not check the file
exists. Calling it does create the per-database directory as a side effect, but never the
.sqlite file itself.
sqlite_heap_file_size(rel oid) RETURNS bigint
Size in bytes of the SQLite file backing rel, or 0 if the file
has not been created yet.
Implementation
std::fs::metadata(sqlite::file_path(rel)).map(|m| m.len()).unwrap_or(0). A
plain stat on the path from file_path; any
error (most commonly "file does not exist") collapses to 0. This is also the byte
figure the planner's relation_estimate_size callback reports — see
Engineering Reference §12.
sqlite_heap_schema_version(rel oid) RETURNS integer
The SQLite schema version stamped into rel's file — the value of
PRAGMA user_version. The current on-disk schema is version 4.
Implementation
with_conn(rel, |conn| conn.query_row("PRAGMA user_version", …).unwrap_or(0)).
Going through with_conn means the file is opened lazily on first touch — and
open_conn will install_schema a brand-new file, or
migrate_schema an older one, before the query runs. So in practice this almost
always returns 4; a lower number would mean a file mid-migration, and
0 means the pragma could not be read at all.
sqlite_heap_storage(rel oid)
RETURNS TABLE(rowid bigint, xmin bigint, cmin bigint,
xmax bigint, cmax bigint, tuple_bytes integer, live boolean)
A raw dump of the SQLite storage table for rel — every
physical row version, with the per-row MVCC header columns Postgres normally keeps out of
sight. This is the function that powers the "reveal" in the demo: an UPDATE plus a
DELETE leaving four physical rows where SELECT shows two.
Example
SELECT * FROM sqlite_heap_storage('demo'::regclass);
rowid | xmin | cmin | xmax | cmax | tuple_bytes | live
-------+---------+------+---------+------+-------------+------
1 | 3851268 | 0 | 0 | 0 | 30 | t
2 | 3851268 | 0 | 3851269 | 0 | 30 | f
3 | 3851268 | 0 | 3851270 | 0 | 30 | f
4 | 3851269 | 0 | 0 | 0 | 30 | t
Implementation
sqlite::select_all(rel) for the row dump, then each StoredRow is
mapped into the output tuple. Two columns are derived rather than stored:
tuple_bytes is r.tuple.len() — the size of the opaque
Postgres tuple blob, never its contents — and live is the boolean
r.xmax == 0. The result is returned as a pgrx TableIterator, so
Postgres sees it as a set-returning function.
sqlite_heap_files()
RETURNS TABLE(table_oid oid, size_bytes bigint)
Every sqlite_heap file in the current database's storage directory.
Unlike the other functions it takes no argument — it lists the whole directory. Join
table_oid to pg_class.oid to recover table names:
Example
SELECT c.relname, f.size_bytes
FROM sqlite_heap_files() f
JOIN pg_class c ON c.oid = f.table_oid
ORDER BY 1;
Implementation
sqlite::list_files() reads the directory
$PGDATA/sqlite_heap/<db-oid>/ via with_dir. For each entry it
keeps only files with a .sqlite extension, parses the file stem as a
u32 table OID (skipping anything that does not parse — e.g.
-wal/-shm siblings), stats each for its byte size, and
sorts the results. lib.rs then maps the raw u32 back into a
pg_sys::Oid for the table_oid column.
One function. It is exposed as a pg_extern only because an event trigger has to be
able to call it — it is plumbing, not an inspection tool.
sqlite_heap_drop_storage(rel oid) RETURNS void
Schedules the SQLite file that backed a now-dropped table to be unlinked. It is called from
the sql_drop event trigger (see below), once per
dropped table. Safe to call for any OID — a non-sqlite_heap table simply has no
file at our path, and the unlink is a harmless no-op.
Implementation
sqlite::drop_relation(rel) does not unlink immediately. It
calls register_xact_callback(Commit, …) — registering a callback that runs
only if and when the surrounding transaction commits. That callback,
unlink_storage, removes the connection from the per-backend CONNS map
and unlinks the .sqlite file together with its -wal/-shm
siblings.
A DROP TABLE can be rolled back. If the file were unlinked the instant the
command ran, a ROLLBACK would leave Postgres with a live catalog entry pointing
at a deleted file. Deferring the unlink to the Commit xact callback means a
rolled-back DROP leaves the file completely intact — the callback never
fires.
These objects are created by extension_sql! blocks in
lib.rs when you run CREATE EXTENSION pg_sqlite_heap. They
are what actually wire sqlite_heap into Postgres — everything above is just observability on
top of this.
CREATE FUNCTION sqlite_heap_handler(internal)
RETURNS table_am_handler
LANGUAGE c AS 'MODULE_PATHNAME', 'sqlite_heap_handler';
The access-method handler. Postgres calls it once to obtain the function-pointer table that defines the storage engine.
Implementation
A C-ABI function (PG v1 calling convention) whose entire body is
Datum::from(tam::am_routine() as *const pg_sys::TableAmRoutine) — it returns
a pointer to sqlite_heap's static, filled-in TableAmRoutine struct. That struct,
built in tam.rs, holds the ~43 callbacks Postgres invokes for scans,
inserts, updates, vacuum, and so on. See
Engineering Reference §2.
CREATE ACCESS METHOD sqlite_heap
TYPE TABLE HANDLER sqlite_heap_handler;
Registers sqlite_heap as a named table access method backed by the handler
above. After this runs, CREATE TABLE … USING sqlite_heap routes that table's
storage through sqlite_heap's callbacks instead of the native heap.
This and the handler are emitted together in the extension_sql! block named
sqlite_heap_access_method.
CREATE FUNCTION sqlite_heap_drop_handler() RETURNS event_trigger
LANGUAGE plpgsql AS $$ … $$;
CREATE EVENT TRIGGER sqlite_heap_on_drop
ON sql_drop EXECUTE FUNCTION sqlite_heap_drop_handler();
A PL/pgSQL event-trigger function plus the trigger that fires it. Together they make sure the SQLite file is cleaned up whenever a table is dropped.
Implementation
The trigger fires on the sql_drop event. The handler loops over
pg_event_trigger_dropped_objects(), filters to rows where
object_type = 'table', and calls
sqlite_heap_drop_storage(obj.objid) for each —
which (as above) schedules the unlink for commit time. The extension_sql! block
is named sqlite_heap_drop_trigger and declares
requires = [sqlite_heap_drop_storage] so pgrx emits the Rust function first.
The trigger does not (and cannot easily) check whether a dropped table used
sqlite_heap. It calls sqlite_heap_drop_storage for every
dropped table; the function is a no-op for tables that have no file at our path. Simpler, and
correct.
The 44 function pointers that make up sqlite_heap's TableAmRoutine — the
struct returned by sqlite_heap_handler. Postgres calls these,
never you; they are the entire storage engine. Each is a #[pg_guard]'d
extern "C-unwind" function in tam.rs that wraps the raw
pointers Postgres hands it in ffi::Rel / ffi::Slot and delegates to a
safe inner function. They are grouped below exactly as they are wired up in
am_routine().
Every scan callback operates on a SqliteScanState (a palloc'd struct Postgres
treats as the opaque TableScanDesc) wrapping a boxed SqliteCursor. The
cursor holds the relation's rows in a Vec<StoredRow>, loaded lazily
by ensure_rows_loaded — a sequential, sample, or analyze scan needs the whole
table; an exact bitmap scan must not pay for it. It also carries the per-page bitmap-scan state
and the one-logical-block flag used by sampling.
slot_callbacks(rel) -> *const TupleTableSlotOps
Returns &TTSOpsHeapTuple. sqlite_heap reconstructs every row as a standard
Postgres HeapTuple from its stored bytes, so all slots use the built-in
heap-tuple slot ops — no custom slot type.
scan_begin(rel, snapshot, nkeys, key, pscan, flags) -> TableScanDesc
Allocates a SqliteScanState with palloc0, boxes a fresh
SqliteCursor inside it, and stashes the snapshot, scan keys, and flags into the
base descriptor. Rows are not loaded here — that is deferred to first use.
scan_end(scan)
Tears the scan down: reclaims the boxed SqliteCursor (running its destructor)
and pfrees the state struct.
scan_rescan(scan, key, set_params, allow_strat, allow_sync, allow_pagemode)
Restarts the scan from the top: resets the cursor position to 0. The loaded
rows Vec is kept — a rescan re-walks the same snapshot's data.
scan_getnextslot(scan, direction, slot) -> bool
The sequential-scan workhorse. On first call ensure_rows_loaded runs
sqlite::select_all to pull every physical row version into the cursor. Each call
then walks forward from cursor.pos, running
visibility::row_visible(header, snapshot) on each version; the first one the
snapshot should see is decoded into the slot and true is returned. Returns
false at end of table.
scan_set_tidrange(scan, mintid, maxtid)
For WHERE ctid BETWEEN … scans: stores the min/max TID bounds in the scan
descriptor and resets the cursor position, so the next getnextslot_tidrange call
starts fresh within the new range.
scan_getnextslot_tidrange(scan, direction, slot) -> bool
Like scan_getnextslot, but additionally skips any row whose rowid
falls outside the [min, max] range stashed by scan_set_tidrange
before applying the visibility check.
parallelscan_estimate(rel) -> Size
parallelscan_initialize(rel, pscan) -> Size
parallelscan_reinitialize(rel, pscan)
All three delegate straight to Postgres's generic block-based parallel-scan helpers
(table_block_parallelscan_*). sqlite_heap does not actually divide a scan across
workers — but providing valid implementations means a parallel plan over a sqlite_heap
table is well-formed rather than a crash.
index_fetch_begin(rel) -> *mut IndexFetchTableData
palloc0s an IndexFetchTableData and stores the relation pointer in
it. There is no per-scan SQLite state to set up — each fetch is an independent
point lookup.
index_fetch_reset(data)
Empty. Nothing is cached between fetches, so there is nothing to reset.
index_fetch_end(data)
pfrees the IndexFetchTableData allocated by
index_fetch_begin.
index_fetch_tuple(scan, tid, snapshot, slot, call_again, all_dead) -> bool
The hot index-scan path. Unpacks the index-supplied TID to a SQLite rowid via
ffi::tid_to_rowid, calls sqlite::fetch_one_ref for a zero-copy view
of that row, checks visibility::row_visible, and stores it into the slot if
visible. Sets call_again = false and all_dead = false —
sqlite_heap has no HOT-style version chains, so each TID is a single independent version.
tuple_fetch_row_version(rel, tid, snapshot, slot) -> bool
Fetches one specific row version by TID (used for ctid re-fetches, trigger
re-reads, and similar). Same fetch_one_into_slot path as
index_fetch_tuple: unpack TID → read by rowid → visibility-check →
store.
tuple_tid_valid(scan, tid) -> bool
Does a row physically exist at this TID? Unpacks the TID and returns
sqlite::select_one(rowid).is_some() — a physical-presence test, no
visibility involved.
tuple_get_latest_tid(scan, tid)
A no-op. The native heap walks an in-place update chain to find a row's newest version; sqlite_heap has no such chains (an UPDATE writes an independent new row), so "the latest TID" is just the TID passed in. This is a documented simplification — see Engineering Reference §16.
tuple_satisfies_snapshot(rel, slot, snapshot) -> bool
Re-checks whether the row currently in a slot is visible to a given snapshot. Reads the
slot's stored TID, re-fetches that row by rowid with sqlite::select_one, and runs
visibility::row_visible against it; false if the row is gone.
index_delete_tuples(rel, delstate) -> TransactionId
btree's "bottom-up deletion" hook: given a batch of index TIDs from one leaf page, mark
which point at reclaimable rows. sqlite_heap batch-reads the xmax of every rowid
in one WHERE rowid IN (…) query (select_xmax_batch), then per entry
decides deletable = the row is physically gone, or its xmax is
committed and older than the global oldest snapshot (the same bar as VACUUM). It sets
status[id].knowndeletable accordingly; if nothing is deletable it shrinks
ndeltids to 0 (otherwise btree's
_bt_delitems_delete would assert). Returns the newest reclaimed xid for the
recovery-conflict horizon.
tuple_insert(rel, slot, cid, options, bistate)
Materializes the slot's row as raw bytes (copy_heap_tuple), reads the current
transaction id, calls sqlite::insert(rel, xid, cid, bytes), and stamps the
SQLite-assigned rowid back onto the slot as a TID so indexes / RETURNING /
triggers see where it landed.
tuple_insert_speculative(rel, slot, cid, options, bistate, spec_token)
For INSERT … ON CONFLICT. Inserts the row exactly like a normal
tuple_insert; if the speculation loses,
tuple_complete_speculative retracts it.
tuple_complete_speculative(rel, slot, spec_token, succeeded)
If succeeded, nothing to do — the speculative row is now a normal row. If
the speculation lost, sqlite::physical_delete removes the speculative row
outright (a real delete, not an xmax stamp, since it never logically existed).
multi_insert(rel, slots, nslots, cid, options, bistate)
The COPY / multi-row INSERT path. Materializes every slot once,
then issues a single batched sqlite::insert_batch rather than N trips through the
per-row path, and stamps each assigned TID back onto its slot.
tuple_delete(rel, tid, cid, snapshot, crosscheck, wait, tmfd, changing_part) -> TM_Result
An MVCC delete: sqlite::set_xmax stamps the row at tid dead with
our xid/cid (it stays physically on disk). Returns TM_Ok, or
TM_Deleted if no live row was there to stamp.
tuple_update(rel, otid, slot, cid, snapshot, crosscheck, wait, tmfd, lockmode, update_indexes) -> TM_Result
An MVCC update = delete + insert. sqlite::update_row stamps the old row
(otid) dead and inserts the new version at a fresh rowid; the new TID is stamped
onto the slot, and update_indexes is set to TU_All so Postgres
repoints every index. Returns TM_Ok. No HOT optimisation — see
Engineering Reference §9.
tuple_lock(rel, tid, snapshot, slot, cid, mode, wait_policy, flags, tmfd) -> TM_Result
SELECT FOR UPDATE / row locking. Minimal: it re-fetches the row into the slot
and returns TM_Ok (or TM_Deleted). It holds no real row
lock — that would need a lock table — so it is best-effort under multi-backend
contention. This is the most significant documented limitation; see
Engineering Reference §16.
finish_bulk_insert(rel, options)
Empty. There is no buffered bulk-insert state to flush —
multi_insert already commits each batch through the normal SQLite path.
relation_set_new_filelocator(rel, newrlocator, persistence, freeze_xid, minmulti)
Called when a table needs fresh storage — CREATE TABLE, and the
rewriting form of TRUNCATE. Calls sqlite::reset, which opens (and, if
new, creates + schema-installs) the SQLite file and empties its storage table.
See Engineering Reference §6.
relation_nontransactional_truncate(rel)
The fast TRUNCATE path, taken when the table was created in the same
transaction. Also just sqlite::reset — empty the storage
table.
relation_copy_data(rel, newrlocator)
A no-op. This fires for ALTER TABLE … SET TABLESPACE, but sqlite_heap's files
are keyed by relation OID — which is stable across a tablespace move — so the data
stays exactly where it is.
relation_copy_for_cluster(old, new, old_index, use_sort, oldest_xmin, …, num_tuples, tups_vacuumed, tups_recently_dead)
CLUSTER / VACUUM FULL. The new table's storage was already created
fresh by relation_set_new_filelocator; this streams every live row
(xmax = 0) from the old file into the new one with sqlite::insert,
counting live rows into num_tuples and dead ones into
tups_vacuumed.
relation_vacuum(rel, params, bstrategy)
Computes the global "oldest transaction id any snapshot could still need"
(GetOldestNonRemovableTransactionId) and calls sqlite::vacuum_dead,
which physically DELETEs every row whose xmax is set, committed, and
older than that bound. See Engineering Reference
§13.
scan_analyze_next_block(scan, stream) -> bool
sqlite_heap models the whole table as a single logical block. The subtle part: it still
must advance the ReadStream with read_stream_next_block
(which moves the sampler counter acquire_sample_rows divides by, doing no buffer
I/O). Skipping it left that counter at 0, so ANALYZE extrapolated 0 rows and
n_distinct came out -Infinity. Resets the cursor and returns
true once, then false.
scan_analyze_next_tuple(scan, oldest_xmin, liverows, deadrows, slot) -> bool
Walks the cursor's physical rows; each live row (xmax = 0) is stored into the
slot and counted into liverows, each dead one into deadrows. The
sampled live rows feed Postgres's column-statistics estimator.
index_build_range_scan(table, index, index_info, …, callback, callback_state, scan) -> f64
Powers CREATE INDEX. Scans every live row of the table, builds the index datums
for each with FormIndexDatum, and invokes the supplied btree callback with the
row's (rowid → TID). Returns the count of indexed tuples. Visibility is
deliberately not checked here — it is applied later at index-fetch time.
index_validate_scan(table, index, index_info, snapshot, state)
The validate pass of REINDEX CONCURRENTLY. A no-op: index_build_range_scan
already saw every committed row. (Limitation: a multi-writer REINDEX CONCURRENTLY
may miss rows committed mid-build.)
relation_size(rel, fork) -> u64
sqlite_heap has no real Postgres relfile. It reports the main fork as exactly one
BLCKSZ block — matching the single logical block
scan_analyze_next_block yields, so ANALYZE's row-count arithmetic works —
and every other fork (FSM, visibility map, init) as 0.
relation_needs_toast_table(rel) -> bool
Always false. Row bytes are stored as SQLite BLOBs with no size
ceiling, so there is never a need for a Postgres TOAST side-table.
relation_toast_am(rel) -> Oid
Returns InvalidOid — there is no TOAST access method, consistent with
relation_needs_toast_table returning false.
relation_fetch_toast_slice(toastrel, valueid, attrsize, sliceoffset, slicelength, result)
Empty. Since sqlite_heap never creates a TOAST table, Postgres never has a reason to call this — it is present only to keep the routine table complete.
relation_estimate_size(rel, attr_widths, pages, tuples, allvisfrac)
The planner's per-plan size hook, called on every query plan. sqlite::estimate_size
returns the file's byte size and row count; the bytes are converted to a page count
(div_ceil(8192), minimum 1). allvisfrac is set to 1.0
— sqlite_heap keeps no visibility map, and claiming "all visible" keeps index-only scans
from being penalized. See Engineering Reference
§12.
scan_bitmap_next_tuple(scan, slot, recheck, lossy_pages, exact_pages) -> bool
Drives a bitmap heap scan. It drains the current page's offsets, fetching each row
visibility-filtered; when a page is exhausted it pulls the next from the TID-bitmap iterator.
For an exact page it extracts the page's offsets with
tbm_extract_page_tuple; for a lossy page (which only names a block) it
enumerates every loaded rowid that maps into that block. Tallies
lossy_pages / exact_pages as it goes.
scan_sample_next_block(scan, scanstate) -> bool
For TABLESAMPLE. The whole table is one logical block: this yields it exactly
once (setting sample_block_consumed and resetting the cursor), then returns
false.
scan_sample_next_tuple(scan, scanstate, slot) -> bool
Yields every visible live row of the table. TABLESAMPLE on sqlite_heap
therefore returns the whole table rather than the requested fraction —
imprecise, but never empty. A documented simplification.