FRESH

Hacker News

Home

Show HN: Turbolite – a SQLite VFS serving sub-250ms cold JOIN queries from S3

174 points by russellthehippo

by carlsverre

1 subcomments

You might be interested in taking a look at Graft (https://graft.rs/). I have been iterating in this space for the last year, and have learned a lot about it. Graft has a slightly different set of goals, one of which is to keep writes fast and small and optimize for partial replication. That said, Graft shares several design decisions, including the use of framed ZStd compression to store pages.
I do like the B-tree aware grouping idea. This seems like a useful optimization for larger scan-style workloads. It helps eliminate the need to vacuum as much.
Have you considered doing other kinds of optimizations? Empty pages, free pages, etc.

by edinetdb

0 subcomment

The "mostly-cold databases" framing maps well to something I ran into with financial data APIs: most listed companies are queried rarely — the long tail of small-caps sees < 1 req/day, while the top ~200 companies take the majority of traffic.
We ended up going in a different direction: BigQuery as canonical store with Gold-layer summary tables (~26MB) loaded into memory on startup and refreshed every 30 minutes. The in-memory path keeps p99 under 100ms for the hot tier, and BQ handles the cold tail without S3 round-trips. Simple to reason about, though it only works because the Gold tables are small enough to fit in memory.
Turbolite feels like it targets a different point in the design space — where the data is too large to cache fully but still read-heavy. The write amplification trade-off seems acceptable for append-only daily ETL workloads where you're writing once and reading many times. Is the target workload truly immutable-after-creation, or do you expect point updates?

by wgjordan

1 subcomments

Nice set of experiments! I appreciate that you're running benchmarks on real object storage setups to validate rapid design variations. (Meta-note: I love how agents have recently made this kind of experimental-research work possible with much less human time investment.)
I've been doing some experiments of my own in a relatively similar space, also focusing on S3/Tigris-backed SQLite on ephemeral compute, also with B-tree aware prefetching (see https://github.com/wjordan/sqlite-prefetch).
I think the idea of storing grouped pages together to optimize read-locality is interesting. Note that it steers in the opposite direction of the temporal locality that a format like LTX/Litestream uses to provide transaction-aware features like point-in time restore. The tradeoff also involves significantly greater write amplification (re-upload the entire page group every time a single page dirties), heavily favoring cold-read-heavy workloads over mixed-write or warm-read workloads.
The query-plan frontrunning is a very novel experiment as well, discovering in advance that SQLite is about to run a full-table scan seems like a very useful optimization hint to work with. I'd love to see experiments validating how much of an improvement that offers compared to simple reactive prefetch (which takes at least a couple page faults to get up to speed).

by russellthehippo

1 subcomments

A bit more color on what I found interesting building this:
The motivating question for me was less “can SQLite read over the network?” and more “what assumptions break once the storage layer is object storage instead of a filesystem?”
The biggest conceptual shift was around *layout*.
What felt most wrong in naive designs was that SQLite page numbers are not laid out in a way that matches how you want to fetch data remotely. If an index is scattered across many unrelated page ranges, then “prefetch nearby pages” is kind of a fake optimization. Nearby in the file is not the same thing as relevant to the query.
That pushed me toward B-tree-aware grouping. Once the storage layer starts understanding which table or index a page belongs to, a lot of other things get cleaner: more targeted prefetch, better scan behavior, less random fetching, and much saner request economics.
Another thing that became much more important than I expected is that *different page types matter a lot*. Interior B-tree pages are tiny in footprint but disproportionately important, because basically every query traverses them. That changed how I thought about the system: much less as “a database file” and much more as “different classes of pages with very different value on the critical path.”
The query-plan-aware “frontrun” part came from the same instinct. Reactive prefetch is fine, but SQLite often already knows a lot about what it is about to touch. If the storage layer can see enough of that early, it can start warming the right structures before the first miss fully cascades. That’s still pretty experimental, but it was one of the more fun parts of the project.
A few things I learned building this:
1. *Cold point reads and small joins seem more plausible than I expected.* Not local-disk fast, obviously, but plausible for the “many mostly-cold DBs” niche.
2. *The real enemy is request count more than raw bytes.* Once I leaned harder into grouping and prefetch by tree, the design got much more coherent.
3. *Scans are still where reality bites.* They got much less bad, but they are still the place where remote object storage most clearly reminds you that it is not a local SSD.
4. *The storage backend is super important.* Different storage backends (S3, S3 Express, Tigris) have verg different round trip latencies and it's the single most important thing in determining how to tune prefetching.
Anyway, happy to talk about the architecture, the benchmark setup, what broke, or why I chose this shape instead of raw-file range GETs / replication-first approaches / etc.

by agosta

2 subcomments

This is awesome! With all of the projects/teams working on improving sqlite, it feels like it's just a matter of time before it becomes a better default than postgres for serious projects.
I do wonder - for projects that do ultimately enforce single writer sqlite setups - it still feels to me as if it would always be better to keep the sqlite db local (and then rsync/stream backups to whatever S3 storage one prefers).
The nut I've yet to see anyone crack on such setup is to figure out a way to achieve zero downtime deploys. For instance, adding a persistent disk to VMs on Render prevents zero downtime deploys (see https://render.com/docs/disks#disk-limitations-and-considera...) which is a real unfortunate side effect. I understand that the reason for this is because a VM instance is attached to the volume and needs to be swapped with the new version of said instance...
There are so many applications where merely scaling up a single VM as your product grows simplifies devops / product maintenance so much that it's a very compelling choice vs managing a cluster/separate db server. But getting forced downtime between releases to achieve that isn't acceptable in a lot of cases.
Not sure if it's truly a cheaply solvable problem. One potential option is to use a tool like turbolite as a parallel data store and, only during deployments, use it to keep the application running for the 10 to 60 seconds during a release swap. During this time, writes to the db are slower than usual but entirely online. And then, when your new release is live, it can sync the difference of data written to s3 back to the local db. In this way, during regular operation, we get the performance of local IO and fallback onto s3 backed sqlite during upgrades for persistent uptime.
Sounds like a fraught thing to build. But man it really is hard/impossible to beat the speed of local reads!

by russellthehippo

0 subcomment

Also I want to acknowledge the other projects in adjacent parts of this space — raw SQLite range-request VFSes, Litestream/LiteFS-style replication approaches, libSQL/Turso, Neon, mvsqlite, etc. I took a lot of inspiration from them, thanks!

by bob1029

2 subcomments

Have you considered using techniques like conditional PUT to enable multiple writers?
https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3...
https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...

by manudaro

1 subcomments

Sub 250ms for cold queries from S3 is impressive, but curious about the consistency of those numbers. Are you doing any prefetching of table schemas or statistics? With geographic datasets we often see huge variance in S3 latency depending on object size and region - a 10Mb spatial index file might take 400ms to fetch while smaller lookup tables stay under 100ms.

by swaminarayan

1 subcomments

Very cool approach. I’m working on a related KV storage project (snkv - https://github.com/hash-anu/snkv)
How do you handle manifest consistency and updates—append-only or rewrite-heavy?

by ovaistariq

1 subcomments

Ovais - co-founder of Tigris here.
This is very cool. I have been thinking about embedded databases running on Tigris. Specially from an agent perspective, agents can suspend and continue their sessions. Would love to collaborate.

by mememememememo

1 subcomments

It's great. Well done. 250ms is slow for a DB query though but for what you are doing and how you donit it is surprisingly fast.

by pdyc

1 subcomments

this is a great project, does it support wasm? i want to use it in browser with sqlite wasm.

by alex_hirner

2 subcomments

What are your thoughts on eviction, re how easy to add some basic policy?

by michaeljelly

0 subcomment

Really cool

by jijji

2 subcomments

i wonder how much that costs per hour to run any normal load? what benefit does this have versuss using mysql (or any similar rdbms) for the queries? mysql/pgsql/etc is free remember, so using S3 obviously charges by the request, or am i wrong?

by inferense

0 subcomment

very cool!

by ryanjso

0 subcomment

super sick

by yalvhe2009

0 subcomment

[dead]

by yalvhe2009

0 subcomment

[dead]

by edinetdb

0 subcomment

[flagged]