GitHub - lucasgelfond/puffgres: Logical replication to keep Postgres entities updated in Turbopuffer

Puffgres is a change data capture pipeline that syncs changes in Postgres to Turbopuffer.

Quick Start

brew install puffgres
puffgres init

This will take you through an interactive setup. Puffgres largely should be able to configure itself; it will add a few __puffgres tables to your database for its own maintenence, and a puffgres folder to whatever project you are in to specify configs.

Why

pgvector is not very good at scale and there’s considerable performance hits to keeping large vectors on a main database instance. Turbopuffer is excellent, fast, and easy to use.

I found myself, in several projects, mirroring data from Postgres to Turbopuffer for search. Every time I wrote clunky, bespoke logic. Usually this meant some sort of tracking row or separate table (updated_in_turbopuffer_at), effectively, polling to see if a row had changed on each run of a data pipeline. Obviously this is much less efficient than something event-based, and I kept rewriting the same annoying-to-maintain and brittle batching/retry logic. In the primacy of toolmaking tradition, that I should build something more generic for myself for this purpose, that might also solve others' similar problems.

Fundamentals

Puffgres has two primary surfaces, the CLI, for configuring one’s puffgres setup and a runner service that mirrors changes. You run the CLI for dev setup and the runner wherever your main services live.

The two primitives in Puffgres are:

migrations, much like a regular database. These are structured as .toml files, and are immutable. I felt this was the best solution for configurations, to indicate that changes DO NOT by default apply retroactively
transforms, a Typescript API for specifying how rows are changed before they are upserted to Turbopuffer. I did this because I found I often was not simply upserting (or even) embedding rows before they went up. Sometimes I would combine two columns in the text I embedded, add some sort of prompt or guidance before embedding, truncate it (based on tokenization), or use nonstandard embedding models. Leaving these as (highly flexible) code makes it easy to maintain these.

Acknowledgements

This project was inspired by reading Martin Kleppman’s Designing Data-Intensive Applications, and, in particular, his thinking around unbundling databases and using change data capture in Turning the database inside out with Apache Samza.

This package makes extensive use of prior art; particularly of use were wal2json, pgwire-replication, supabase/etl.

There were no good turbopuffer Rust clients (the others I saw were largely untyped), so I cloned the Go/TS/Ruby/Java clients and had Claude build (and test!) a Rust one. I figured it was better to split out versus keep in this repo because it is more broadly useful. You can find it at rs-puff or with cargo add rs-puff

Areas of future work (open to PRs!)

Support for other package managers than pnpm

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
crates		crates
debug-ui		debug-ui
dev		dev
examples/01		examples/01
npm		npm
scripts		scripts
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
dist-workspace.toml		dist-workspace.toml
package.json		package.json
rebuild-puffgres		rebuild-puffgres

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

Why

Fundamentals

Acknowledgements

Areas of future work (open to PRs!)

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Why

Fundamentals

Acknowledgements

Areas of future work (open to PRs!)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages