FRESH

Hacker News

Home

Using PostgreSQL as a Dead Letter Queue for Event-Driven Systems

252 points by tanelpoder

by TexanFeller

1 subcomments

Ofc I wouldn't us it for extremely high scale event processing, but it's great default for a message/task queue for 90% of business apps. If you're processing under a few 100m events/tasks per day with less than ~10k concurrent processes dequeuing from it it's what I'd default to.
I work on apps that use such a PG based queue system and it provides indispensable features for us we couldn't achieve easily/cleanly with a normal queue system such as being able to dynamically adjust the priority/order of tasks being processed and easily query/report on the content of the queue. We have many other interesting features built into it that are more specific to our needs as well that I'm more hesitant to describe in detail here.

by rbranson

6 subcomments

Biggest thing to watch out with this approach is that you will inevitably have some failure or bug that will 10x, 100x, or 1000x the rate of dead messages and that will overload your DLQ database. You need a circuit breaker or rate limit on it.

by exabrial

2 subcomments

> FOR UPDATE SKIP LOCKED
Learned something new today. I knew what FOR UPDATE did, but somehow I've never RTFM'd hard enough to know about the SKIP LOCKED directive. Thats pretty cool.

by jeeybee

0 subcomment

I maintain a small Postgres-native job queue for Python called PGQueuer: https://github.com/janbjorge/pgqueuer
It uses the same core primitives people are discussing here (FOR UPDATE SKIP LOCKED for claiming work; LISTEN/NOTIFY to wake workers), plus priorities, scheduled jobs, retries, heartbeats/visibility timeouts, and SQL-friendly observability. If you’re already on Postgres and want a pragmatic “just use Postgres” queue, it might be a useful reference / drop-in.

by with

1 subcomments

Great application of first principles. I think it's totally reasonable also, at even most production loads. (Example: My last workplace had a service that constantly roared at 30k events per second, and our DLQs would at most have orders of hundreds of messages in them). We would get paged if a message's age was older than an hour in the queue.
The idea is that if your DLQ has consistently high volume, there is something wrong with your upstream data, or data handling logic, not the architecture.

by kristov

0 subcomment

Why use shedlock and select-for-update-skip-locked? Shedlock stops things running in parallel (sort-of), but the other thing makes parallel processing possible.

by cmgriffing

0 subcomment

Only slightly related, but I have been using Oban as a Postgres native message queue in the elixir ecosystem and loving it. For my use case, it’s so much simpler than spinning up another piece of infrastructure like Kafka or rabbitmq

by nottorp

0 subcomment

Hmm that raises a question for me.
I haven't done a project that uses a database (be it sql or no sql) where the amount of deletes is comparable to the amount of inserts (and far larger than like tens per day, of course).
How does your average db server work with that, performance wise? Intuitively I'd think it's optimized more for inserts than for deletes, but of course I may be wrong.

by shoo

1 subcomments

re: SKIP LOCKED, introduced in postgres 9.5, here's an an archived copy [†] of the excellent 2016 2ndquadrant post discussing it
https://web.archive.org/web/20240309030618/https://www.2ndqu...
corresponding HN discussion thread from 2016 https://news.ycombinator.com/item?id=14676859
[†] it seems that all the old 2ndquadrant.com blog post links have been broken after their acquisition by enterprisedb

by Andys

0 subcomment

We did this at Chargify, but with MySQL. If Redis was unavailable, it would dump the job as a JSON blob to a mysql table. A cron job would periodically clean it out by re-enqueuing jobs, and it worked well.

by branko_d

2 subcomments

Why use string as status, instead of a boolean? That just wastes space for no discernable benefit, especially since the status is indexed. Also, consider turning event_type into an integer if possible, for similar reasons.
Furthermore, why have two indexes with the same leading field (status)?

by cpursley

1 subcomments

https://github.com/pgmq/pgmq

by nicoritschel

1 subcomments

lol a FOR UPDATE SKIP LOCKED post hits the HN homepage every few months it feels like

by renewiltord

0 subcomment

Segment uses MySQL as queue not even as DLQ. It works at their scale. So there are many (not all) systems that can tolerate this as queue.
I have simple flow: tasks are order of thousands an hour. I just use postgresql. High visibility, easy requeue, durable store. With appropriate index, it’s perfectly fine. LLM will write skip locked code right first time. Easy local dev. I always reach for Postgres for event bus in low volume system.

by gytisgreitai

0 subcomment

Would be interesting to see the numbers this system processes. My bet is that they are not that high.

by awesome_dude

0 subcomment

I think that using Postgres as the message/event broker is valid, and having a DLQ on that Postgres system is also valid, and usable.
Having SEPARATE DLQ and Event/Message broker systems is not (IMO) valid - because a new point of failure is being introduced into the architecture.

by tantalor

1 subcomments

This is logging.

by reactordev

7 subcomments

Another day, another “Using PostgreSQL for…” thing it wasn’t designed for. This isn’t a good idea. What happens when the queue goes down and all messages are dead lettered? What happens when you end up with competing messages? This is not the way.

by tonymet

1 subcomments

Postgres is essentially a b-tree with a remote interface. Would you use a b-tree to store a dead letter queue? What is big O of insert & delete? what happens when it grows?
Postgres has a query interface, replication, backup and many other great utilities. And it’s well supported, so it will work for low-demand applications.
Regardless, you’re using the wrong data structure with the wrong performance profile, and at the margins you will spend a lot more money and time than necessary running it . And service will suffer.