FRESH

Hacker News

Home

Do you even need a database?

148 points by upmostly

by devilsdata

0 subcomment

This is a cool exercise, but I would hesitate to choose files over SQLite or another Dockerised relational database in production.
They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.
See: https://danluu.com/file-consistency/
What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?
Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.
There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.

by ozgrakkurt

7 subcomments

You need databases if you need any kind of atomicity. Doing atomic writes is extremely fragile if you are just on top of the filesystem.
This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.

by z3ugma

8 subcomments

At some point, don't you just end up making a low-quality, poorly-tested reinvention of SQLite by doing this and adding features?

by koliber

3 subcomments

I love this article as it shows how fast computers really are.
There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."
One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.
So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.
If you need scale, cache the most often accessed data in memory and you have the best of both worlds.
My winning combo is sqlite + in-memory cache.

by kabir_daki

2 subcomments

We built a PDF processing tool and faced this exact question early on.
For our use case — merge, split, compress — we went fully stateless. Files are processed in memory and never stored. No database needed at all.
The only time a database becomes necessary is when you need user accounts, history, or async jobs for large files. For simple tools, a database is often just added complexity.
The real question isn't "do you need a database" but "do you need state" — and often the answer is no.

by allthetime

0 subcomment

Funny, I was just hard at work on my new article “Just use a database”.

by orthogonal_cube

0 subcomment

SQLite did decently well but I think they should’ve done an additional benchmark with the database loaded completely into memory.
Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.
Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.

by forinti

1 subcomments

Many eons ago I wrote a small sales web application in Perl. I couldn't install anything on the ISP's machine, so I used file-backed hashes: one for users, one for orders, another for something else.
As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).
The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.

by shafoshaf

1 subcomments

Relational Databases Aren’t Dinosaurs, They’re Sharks. https://www.simplethread.com/relational-databases-arent-dino...
The very small bonus you get on small apps is hardly worth the time you spend redeveloping the wheel.

by ktzar

0 subcomment

Writing your own storage is a great way to understand how databases work (if you do it efficiently, keeping indexes, correct data structures, etc.) and to come to the conclusion that if your intention wasn't just tinkering, you should've used a database from day 1.

by throwway120385

0 subcomment

I dunno. Even in embedded systems every time I've started without a database I've eventually come to need something like a database, and in every case I've found myself building essentially an ad-hoc poorly managed database into the application including marshalling/unmarshalling, file management, notification, and so on because each new feature over the top of regular files was just that much easier to add versus switching to a database system.
However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.
And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?
Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.

by jrecursive

4 subcomments

I suggest every developer write a database from scratch at least once, and use it for something real. Or, even better, let somebody else use it for something real. Then you will know "why database".

by jmull

0 subcomment

> Binary search beats SQLite... For a pure ID lookup, you're paying for machinery you're not using.
You'll likely end up quite a chump if you follow this logic.
sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.
(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)
Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.
I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.

by agustechbro

0 subcomment

To not destroy the article author and apreciate his effort to prove something, that might be useful in a extreme case of optimization with a limited ammount of data and NO NEED to update/write the files. Just a read cache only.
If you need to ever update a single byte in your data, please USE A PROPER DATABASE, databases does a lot of fancy thing to ensure you are not going to corrupt/broke your data on disk among other safety things.

by waldrews

0 subcomment

File systems are nice if you need to do manual or transparent script-based manipulations. Like 'oh hey, I just want to duplicate this entry and hand-modify it, and put these others in an archive.' Or use your OS's access control and network sharing easily with heterogeneous tools accessing the data from multiple machines. Or if you've got a lot of large blobs that aren't going to get modified in place.
What the world needs is a hybrid - database ACID/transaction semantics with the ability to cd/mv/cp file-like objects.

by Joeboy

0 subcomment

Don't know if it counts, but my London cinema listings website just uses static json files that I upload every weekend. All of the searching and stuff is done client side. Although I do use sqlite to create the files locally.
Total hosting costs are £0 ($0) other than the domain name.

by nishagr

0 subcomment

The real question - do you really need to hack around with in-memory maps and files when you could just use a database?

by cold_tom

0 subcomment

you can get surprisingly far with files, but the moment you care about things like concurrent writes or not losing data on crash, the whole thing changes at that point you're not choosing speed vs simplicity anymore -you're choosing how much risk you're willing to carry

by vovanidze

2 subcomments

people wildly underestimate the os page cache and modern nvme drives tbh. disk io today is basically ram speeds from 10 years ago. seeing startups spin up managed postgres + redis clusters + prisma on day 1 just to collect waitlist emails is peak feature vomit.
a jsonl file and a single go binary will literally outlive most startup runways.
also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.

by randusername

1 subcomments

Separate from performance, I feel like databases are a sub-specialty that has its own cognitive load.
I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.
I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.

by rglover

1 subcomments

A few months back I decided to write an embedded db for my firm's internal JS framework. Learned a lot about how/why databases work the way they do. I use stuff like reading memory cached markdown files for static sites, but there are certain things that a database gives you (chief of which for me was query ergonomics—I loved MongoDB's query language but grew too frustrated with the actual runtime) that you'll miss once you move past a trivial data set.
I think a better way to ask this question is "does this application and its constraints necessitate a database? And if so, which database is the correct tool for this context?"

by jmaw

0 subcomment

Very interesting, I'd never heard of JSONL before: https://jsonlines.org/
Also notable mention for JSON5 which supports comments!: https://json5.org/

by swiftcoder

0 subcomment

I feel like someone who works for a DB company ought to mention at least some of the pitfalls in file-based backing stores (data loss due to crashes, file truncation, fsync weirdness, etc)

by ghc

1 subcomments

I'm so old I remember working on databases that were designed to use RAW, not files. I'm betting some databases still do, but probably only for mainframe systems nowadays.

by theshrike79

2 subcomments

I have a vague recollection that 4chan (At least at one point) didn't use any kind of backend database, they just rewrote the static pages with new content and that was it.
That's why it could handle massive traffic with very little issues.

by jmaw

0 subcomment

While this is certainly cool to see. And I love seeing how fast webservers can go.. The counter question "Do you even need 25,000 RPS and sub-ms latency?" comes to mind.
I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.

by oliviergg

2 subcomments

Please … Every few years the pendulum swings. First it was “relational databases are too rigid, just use NoSQL.” Then “NoSQL is a mess, just go back to Postgres.” Now: “do you even need a database at all, just use flat files.” Each wave is partially right. But… each wave is about to rediscover, the hard way, exactly why the previous generation made the choices they did. SQLite is the answer to every painful lesson learned, every scar from long debug night the last time someone thought “a JSON file is basically a database.”

by 827a

1 subcomments

I'm a big fan of using S3 as a database. A lot of apps can get a lot of mileage just doing that for a good chunk of their data; that which just needs lookup by a single field (usually ID, but doesn't have to be).

by zkmon

0 subcomment

Sure. Go ahead and use JSONL files and implement every feature of SQL query. Congrats, you just reinvented a database, while trying to prove you don't need database.

by tracker1

1 subcomments

I'd argue for using LevelDB or similar if I just wanted to store arbitrary data based on a single indexable value like TFA. That said, I'd probably just default to SQLite myself since the access, backup, restore patterns are relatively well known and that you can port/grow your access via service layers that include Turso or Cloudflare D1, etc.

by the_inspector

1 subcomments

In many cases not. E.g. for caching with python, diskcache is a good choice. For small amounts of data, a JSON file does the job (you pointed to JSONL as an option). But for larger collections, that should be searchable/processable, postgres is a good choice.
Memory of course, as you wrote, also seems reasonable in many cases.

by chuckadams

1 subcomments

I need a filesystem that does some database things. We got teased with that with WinFS and Beos's BFS, but it seems the football always gets yanked away, and the mainstream of filesystems always reverts back to the APIs established in the 1980s.

by matja

0 subcomment

If you think files are easier than a database, check out https://danluu.com/file-consistency/

by gavinray

0 subcomment

Not to nitpick, but it would be interesting to see profiling info of the benchmarks
Different languages and stdlib methods can often spend time doing unexpected things that makes what looks like apples-to-apples comparisons not quite equivalent

by winrid

2 subcomments

My recent project - a replacement for CodeMaster's RaceNet, runs on flat files! https://dirtforever.net/
Just have to use locks to be careful with writes.
I figured I'd migrate it to a database after maybe 10k users or so.

by traderj0e

0 subcomment

The "database" in this article is only a read-only KV-store. Mind that the hard part of a KV store is writing. Still the benchmarks are interesting.

by goerch

0 subcomment

Pretty sure the origin should be `dbunpro.app`, no? I'd think the consensus should be: do you even need the file system?

by stackskipton

2 subcomments

SRE here. My "Huh, neat" side of my brain is very interested. The SRE side of my brain is screaming "GOD NO, PLEASE NO"
Overhead in any project is understanding it and onboarding new people to it. Keeping on "mainline" path is key to lower friction here. All 3 languages have well supported ORM that supports SQLite.

by jwitchel

1 subcomments

This is a great incredibly well written piece. Nice work showing under the hood build up of how a db works. It makes you think.

by jbiason

1 subcomments

Honestly, I have been thinking about the same topic for some time, and I do realize that direct files could be faster.
In my (hypothetical, 'cause I never actually sat down and wrote that) case, I wanted the personal transactions in a month, and I realized I could just keep one single file per month, and read the whole thing at once (also 'cause the application would display the whole month at once).
Filesystems can be considered a key-value (or key-document) database. The funny thing about the example used in the link is that one could simply create a structure like `user/[id]/info.json` and directly access the user ID instead of running some file to find them -- again, just 'cause the examples used, search by name would be a pain, and one point where databases would handle things better.

by JohnMakin

0 subcomment

everyone thinks this is a great idea until they learn about file descriptor limits the hard way

by freedomben

1 subcomments

I avoided DBs like the plague early in my career, in favor of serialized formats on disk. I still think there's a lot of merit to that, but at this point in my career I see a lot more use case for sqlite and the relational features it comes with. At the least, I've spent a lot less time chasing down data corruption bugs since changing philosophy.
Now that said, if there's value to the "database" being human readable/editable, json is still well worth a consideration. Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.

by charcircuit

0 subcomment

>So the question is not whether to use files. You're always using files. The question is whether to use a database's files or your own.
It's the opposite. A file system is a database. And databases can recursively store their data within another database.

by XorNot

0 subcomment

I've just built myself a useful tool which now really would benefit from a database and I'm deeply regretting not doing that from the get-go.
So my opinion has thoroughly shifted to "start with a database, and if you _really_ don't need one it'll be obvious.
But you probably do.

by srslyTrying2hlp

0 subcomment

I tried doing this with csv files (and for an online solution, Google Sheets)
I ended up just buying a VPS, putting openclaw on it, and letting it Postgres my app.
I feel like this article is outdated since the invention of OpenClaw/Claude Opus level AI Agents. The difficulty is no longer programming.

by 0x457

0 subcomment

> Do you even need a database?
Then proceeds to (poorly) implement database on files.
Sure, Hash Map that take ~400mb in memory going to offer you fast lookups. Some workloads will never reach this size can be done as argument, but what are you losing by using SQLite?
What happens when services shutdowns mid write? Corruption that later results in (poorly) implemented WAL being added?
SQLite also showed something important - it was consistent in all benchmarks regardless of dataset size.

by FpUser

0 subcomment

I think this whole article and post is an attention / points seeking exercise. It is hard to imagine programmer who would not know difference between DBMS and just bunch of files and when to use which

by hnlmorg

0 subcomment

> Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open().
Nope. There are non-persistent in-memory databases too.
In fact, a database can be a plethora of things and the stuff they were building is just a subset of a subset (persistent, local relational database)

by fifilura

2 subcomments

Isn't this the same case the NoSQL movement made.

by allknowingfrog

0 subcomment

I've used foreign keys and unique indexes to enforce validity on even the smallest, most disposable toy applications I've ever written. These benchmarks are really interesting, but the idea that performance is the only consideration is kind of silly.

by tonymet

0 subcomment

If the cloud is just someone else’s hard disks (etc) then RDBMS is just someone else’s btree

by pstuart

1 subcomments

In order to ask this question it's important to understand the lifecycle of the data in question. If it is constantly being updated and requires "liveness" (updates are reflected in queries immediately), the simple answer is: yes, you need a database.
But if you have data that is static or effectively static (data that is updated occasionally or batched), then serving via custom file handling can have its place.
If the records are fixed width and sorted on the key value, then it becomes trivial to do a binary search on the mmapped file. It's about as lightweight as could be asked for.

by rasengan

0 subcomment

Sounds like a good way to waste the only scarce resource: time.

by cratermoon

0 subcomment

I worked one place that shoehorned SQL Server into a system to hold a small amount of static data that could easily have been a config file or even (eek) hard-coded.

by ForHackernews

1 subcomments

Surprised to see this beating SQLite after previously reading https://sqlite.org/fasterthanfs.html

by MattRogish

0 subcomment

"Do not cite the deep magic to me witch, I was there when it was written"
If you want to do this for fun or for learning? Absolutely! I did my CS Masters thesis on SQL JOINS and tried building my own new JOIN indexing system (tl;dr: mine wasn't better). Learning is fun! Just don't recommend people build production systems like this.
Is this article trolling? It feels like trolling. I struggle to take an article seriously that conflates databases with database management systems.
A JSON file is a database. A CSV is a database. XML (shudder) is a database. PostgreSQL data files, I guess, are a database (and indexes and transaction logs).
They never actually posit a scenario in which rolling your own DBMS makes sense (the only pro is "hand rolled binary search is faster than SQLite"), and their "When you might need" a DBMS misses all the scenarios, the addition of which would cause the conclusion to round to "just start with SQLite".
It should basically be "if you have an entirely read-only system on a single server/container/whatever" then use JSON files. I won't even argue with that.
Nobody - and I mean nobody - is running a production system processing hundreds of thousands of requests per second off of a single JSON file. I mean, if req/sec is the only consideration, at that point just cache everything to flat HTML files! Node and Typescript and code at all is unnecessary complexity.
PostgreSQL (MySQL, et al) is a DBMS (DataBase Management System). It might sound pedantic but the "MS" part is the thing you're building in code:
concurrency, access controls, backups, transactions: recovery, rollback, committing, etc., ability to do aggregations, joins, indexing, arbitrary queries, etc. etc.
These are not just "nice to have" in the vast, vast majority of projects.
"The cases where you'll outgrow flat files:"
Please add "you just want to get shit done and never have to build your own database management system". Which should be just about everybody.
If your app is meaningfully successful - and I mean more than just like a vibe-coded prototype - it will break. It will break in both spectacular ways that wake you up at 2AM and it will break in subtle ways that you won't know about until you realize something terrible has happened and you lost your data.
Didn't we just have this discussion like yesterday (https://ultrathink.art/blog/sqlite-in-production-lessons)?
It feels like we're throwing away 50 years of collective knowledge, skills, and experience because it "is faster" (and in the same breath note that nobody is gonna hit these req/sec.)
I know, it's really, really hard to type `yarn add sqlite3` and then `SELECT * FROM foo WHERE bar='baz'`. You're right, it's so much easier writing your own binary search and indexing logic and reordering files and query language.
Not to mention now you need a AGENTS.md that says "We use our own home-grown database nonsense if you want to query the JSON file in a different way just generate more code." - NOT using standard components that LLMs know backwards-and-forwards? Gonna have a bad time. Enjoy burning your token budget on useless, counter-productive code.
This is madness.

by SyndicateLinks

0 subcomment

[dead]

by amw-zero

0 subcomment

I think so, yea.

by fatih-erikli-cg

3 subcomments

I agree. Databases are useless. You don't even need to load it into the memory. Reading it from the disk when there is a need to read something must be ok. I don't believe the case that there are billions of records so the database must be something optimized for handling it. That amount of records most likely is something like access logs etc, I think they should not be stored at all, for such case.
Even it's postgres, it is still a file on disk. If there is need something like like partitioning the data, it is much more easier to write the code that partitions the data.
If there is a need to adding something with textinputs, checkboxes etc, database with their admin tools may be a good thing. If the data is something that imported exported etc, database may be a good thing too. But still I don't believe such cases, in my ten something years of software development career, something like that never happened.

by linuxhansl

0 subcomment

Hmm... Sure, if you do not need a database then do not use a database.
Don't use a sports-car to haul furniture or a garbage truck as an ambulance. For the use case and scale mentioned in the article it's obvious not to use a database.
Am I missing something? I guess many people are the using the tools they are familiar with and rarely question whether they are really applicable. Is that the message?
I think a more interesting question is whether you will need a single source of truth. If you don't you can scale on many small data sets without a database.
I will say this before I shut up with my rant: If you start with a design that scales you will have an easier to scale when it is time without re-engineering your stack. Whether you think you will need to scale depends on your projected growth and the nature of your problem (do you need a single source of truth, etc.)
Edits: Spelling