Imo that's the killer feature of database-based queues, because it dramatically simplifies reasoning about retries, i.e. "did my endpoint logic commit _and_ my background operation enqueue both atomically commit, or atomically fail"?
Same thing for performing jobs, if my worker's business logic commits, but the job later retries (b/c marking the job as committed is a separate transaction), then oof, that's annoying.
And I might as well be using SQS at that point.
Or how would you scale this to support thousands of events per second?
I’m curious: When you say FOR UPDATE SKIP LOCKED does not scale to 25k queries/s, did you observe a threshold at which it became untenable for you?
I’m also curious about the two points of:
- buffered reads and writes
- switching all high-volume tables to use identity columns
What do you mean by these? Were those (part of) the solution to scale FOR UPDATE SKIP LOCKED up to your needs?
Only fix we could find was using unlogged tables and a full vacuum on a schedule. We aren’t big Postgres experts but since you are I was wondering if you have fixed this issue/this framework works well for large payloads.
More importantly: can this be used to run untrusted jobs? E.g. user-supplied or AI supplied code?
Would love to see some sort of architecture overview in the docs
The top-level docs have a section on "Deploying workers" but I think there are more components than that?
It's cool there's a Helm chart but the docs don't really say what resources it would deploy
https://docs.hatchet.run/self-hosting/docker-compose
...shows four different Hatchet services plus, unexpectedly, both a Postgres server and RabbitMQ. Can't see anywhere that describes what each one of those does
Also in much of the docs it's not very clear where the boundary between Hatchet Cloud and Hatchet the self-hostable OSS part lies
The open source support and QuickStart are excellent. The engineering work put into the system is very noticeable!
All I ever want is a queue where I submit a message and then it hits an HTTP endpoint with that message as POST. It is such a better system than dedicated long running worker listeners, because then you can just scale your HTTP workers as needed. Pairs extremely well with autoscaling Cloud Functions, but could be anything really.
I also find that DAGs tend to get ugly really fast because it generally involves logic. I'd prefer that logic to not be tied into the queue implementation because it becomes harder to unit test. Much easier reason about if you have the HTTP endpoint create a new task, if it needs to.
[0] https://github.com/oneapplab/lq
P.S: far from being alternative to Hatchet product
But that requires you to keep the job history around, which at scale starts to impact performance.
Although there was support for pydantic validation in v0, now that the v1 SDK has arrived, I would definitely say that the #1 distinguishing feature (at least from a dx perspective) for anyone thinking of switching from Celery or working on a greenfield project is the type safety that comes with the first class pydantic support in v1. That is a huge boon in my opinion.
Another big boon for me was that the combo of both Python and Typescript SDKs - being able to integrate things into frontend demos without having to set up a separate Python api is great.
There are a couple rough edges around asyncio/single worker concurrency IMO - for instance, choosing between 100 workers each with capacity for 8 concurrent task runs vs 800 workers each with capacity for 1 concurrent task run. In Celery it’s a little bit easier to launch a worker node which uses separate processes to handle its concurrent tasks, whereas right now with Hatchet, that’s not possible as far as I am aware, due to how asyncio is used to handle the concurrent task runs which a single worker may be processing. If most of your work is IO bound or already asyncio friendly, this does not really affect you and you can safely use eg a worker with 8x task run capacity, but if you are CPU bound there might be some cases where you would prefer the full process isolation and feel more assured that you are maximally utilizing all your compute in a given node, and right now the best way to do that is only through horizontal scaling or 1x task workers I think. Generally, if you do not have a great mental model already of how Python handles asyncio, threads, pools, etc, the right way to think about this stuff can be a little confusing IMO, but the docs on this from Hatchet have improved. In the future though, I’d love to see an option to launch a Python worker with capacity for multiple simultaneous task runs in separate processes, even if it’s just a thin wrapper around launching separate workers under the hood.
There are also a couple of rough edges in the dashboard right now, but the team has been fixing them, and coming from celery/flower or SQS, it’s already such an improved dashboard/monitoring experience that I can’t complain!
It’s hard to describe, but there is just something fun about working with Hatchet for me, compared to Celery or my previous SQS system. Almost all of the design decision just align with what I would desire, and feel natural.