FRESH

Hacker News

Home

Show HN: Hatchet v1 – A task orchestration platform built on Postgres

240 points by abelanger

by followben

3 subcomments

How does this compare to other pg-backed python job runners like Procrastinate [0] or Chancy [1]?
[0] https://github.com/procrastinate-org/procrastinate/
[1] https://github.com/TkTech/chancy

by diarrhea

1 subcomments

This is very exciting stuff.
I’m curious: When you say FOR UPDATE SKIP LOCKED does not scale to 25k queries/s, did you observe a threshold at which it became untenable for you?
I’m also curious about the two points of:
- buffered reads and writes
- switching all high-volume tables to use identity columns
What do you mean by these? Were those (part of) the solution to scale FOR UPDATE SKIP LOCKED up to your needs?

by morsecodist

0 subcomment

This is great timing. I am in the process of designing an event/workflow driven application and nothing I looked at felt quite right for my use case. This feels really promising. Temporal was close but it just felt like not the perfect fit. I like the open source license a lot it gives me more confidence designing an application around it. The conditionals are also great. I have been looking for something just like CEL and despite my research I had never heard of it. It is exactly how I want my expressions implemented, I was on the verge of trying to build something like this myself.

by stephen

2 subcomments

Do queue operations (enqueue a job & mark this job as complete) happen in the same transaction as my business logic?
Imo that's the killer feature of database-based queues, because it dramatically simplifies reasoning about retries, i.e. "did my endpoint logic commit _and_ my background operation enqueue both atomically commit, or atomically fail"?
Same thing for performing jobs, if my worker's business logic commits, but the job later retries (b/c marking the job as committed is a separate transaction), then oof, that's annoying.
And I might as well be using SQS at that point.

by nik736

0 subcomment

The readme assumes users with darkmode outweigh users without (the logo is white, invisible without darkmode). Would be interesting to see stats from Github for this!

by lysecret

1 subcomments

This is awesome and I will take a closer look! One question: We ran into issue with using Postgres as a message queue with messages that need to be toasted/have large payloads (50mb+).
Only fix we could find was using unlogged tables and a full vacuum on a schedule. We aren’t big Postgres experts but since you are I was wondering if you have fixed this issue/this framework works well for large payloads.

by fabcairo

0 subcomment

This looks super promising, really like the deep PostgreSQL integration and the effort toward durable execution.
One aspect I’d be curious to hear more about (and might be worth expanding on in docs or future posts) is how hatchet holds up operationally in production. For example, what does a typical alerting setup look like for common failure modes? And since the system relies on partitioned tables and tuned schemas, how do you approach migrations or schema changes without downtime?
A lot of open-source job orchestration systems shine at the core execution model but fall short when it comes to observability and smooth day-2 operations. If Hatchet nails that too, it’s a huge win.

by sgarland

0 subcomment

> Improved performance across every dimension we’ve tested, which we attribute to six improvements to the Hatchet architecture: range-based partitioning of time series tables, hash-based partitioning of task events (for updating task statuses), separating our monitoring tables from our queue, buffered reads and writes, switching all high-volume tables to use identity columns, and aggressive use of Postgres triggers.
Amazing what you can do when you read the manual, eh?
Seriously though, that’s awesome, and I’m very happy to see someone leaning hard into RDBMS features like triggers instead of shying away.

by kianN

0 subcomment

Congratulations on the v1 launch! I’ve been tinkering with hatchet for almost a year, deployed it in production about 6 months ago.
The open source support and QuickStart are excellent. The engineering work put into the system is very noticeable!

by bosky101

1 subcomments

Here is my feedback after spending 15 mins on your docs.
Nice work on the lite mode, open source, logging, dx interface.
You may want to replace Hello world examples with real world scenarios.
The workflows that involve multiple steps tasks, dag in your terminology - the code simply isn't intuitive.
You now have to get into the hatchets mindset, patterns, terminology. Eg: the random number example is riddled with too many. How many of the logos on your homepage did you have to write code for? Be honest.
Knowing to program should be 90% enough. Eg for js:
```
   // send("hi", user => user.signed_up_today)
   //  .waitFor("7d")
   //  .send("upgrade", user => !user.upgraded)
```
Just made this up, but something like this is more readable. (PS:would love to be proved wrong by an implementation of exactly the above example here in the comments). The whole point of being smart is for your team at hatchet to absorb difficulty at the benefit of an easy interface that looks simple and magic. Your 5 line examples has types to learn, functions to learn, arguments to know, 5-10 kinds of things to learn. It showed little effort to make it easy for customers.
An engineering post on what's under the hood makes sense. But customers really don't care about your cloud infra flexes in a post introducing your company pitching the product. It's just koolaid.
Same with complete rewrite so early. I'm glad you are open to change. But the workflow market today with so many options, i don't belive this is the last rewrite or pivot to come.
The DAGs itself aren't very readable. You are better off switching to something like react flow that lets you nocode edit as well.
Focus on automation journeys that are common. Like cookbooks. And allow folks to just import them or change some configurations. like drip marketing, renewals, expired cards, forgot password handlers, shortlink creators, maybe pdf merging, turning a bunch of saved links to a daily blog post, etc
How does a workflow replace a saas they are paying $99 for. That's powerful.
Tough to serialize a worflow to json . Or atleast didn't see it. this makes it easy to have workflows as code, create nocode editors in your own roadmap. You want people to hop from 1 company to another taking their hatchet workflows with them
Good luck, and sorry for coming off as rude. It's just a space I am very passionate about.

by krainboltgreene

1 subcomments

A lot of these tools show off what a full success backlog looks like, in reality I care significantly more about what failure looks like, debugging, etc.

by latchkey

4 subcomments

Cool project. Every time one of these projects comes up, I'm always somewhat disappointed it isn't an open source / postgres version of GCP Cloud Tasks.
All I ever want is a queue where I submit a message and then it hits an HTTP endpoint with that message as POST. It is such a better system than dedicated long running worker listeners, because then you can just scale your HTTP workers as needed. Pairs extremely well with autoscaling Cloud Functions, but could be anything really.
I also find that DAGs tend to get ugly really fast because it generally involves logic. I'd prefer that logic to not be tied into the queue implementation because it becomes harder to unit test. Much easier reason about if you have the HTTP endpoint create a new task, if it needs to.

by drdaeman

0 subcomment

Looks nice on the first glance, congrats on the launch! May I ask a few questions, please?
- Does it support durable tasks that should be essentially ran forever and produce an endless "stream" of events, self-healing in case of intermittent failures? Or would those be a better fit for some different kind of orchestrator?
- Where and how task inputs and outputs are stored? Are there any conveniences to make passing "weird" (that is, not some simple and reasonably-small JSON-encoded objects) things around easier (like Dagster's I/O managers) or is it all out of scope for Hatchet?
- Assuming that I can get ballpark estimates for the desirable number of tasks, their average input and output sizes, and my PostgreSQL instance's size and I/O metrics, can I somehow make a reasonable guesstimate on how many tasks per second the whole system can put through safely?
I'm currently in search of the Holy Grail (haha), evaluating all sorts of tools (Temporal, Dagster, Prefect, Faust, now looking at Hatchet) to find something that I would like the most. My project is a synchronization+processing system that has a bunch of dynamically-defined workflows that continuously work with external services (stores), look for updates (determine new, updated, or deleted products) and spawn product-level workflows to process those updates (standardize store-specific data into an unified shape, match against the canonical product catalog, etc etc). Surely, this kind of a pipeline can be built on nearly anything - I'm just trying to get a gist of how each of those system feels like to work with, what it's actually good at and what are the gotchas and limitations, and which tool would allow me to have least amount of boilerplate.
Thanks!

by avan1

1 subcomments

Don't want to steal your topic but I had written a lightweight task runner to learn GoLang [0]. Would be great to have your and others' comments. It works only as a Go library.
[0] https://github.com/oneapplab/lq
P.S: far from being alternative to Hatchet product

by themanmaran

1 subcomments

How does queue observability work in hatchet? I've used pg as a queueing system before, and that was one of my favorite aspects. Just run a few SQL queries to have a dashboard for latency/throughput/etc.
But that requires you to keep the job history around, which at scale starts to impact performance.

by programmarchy

0 subcomment

Wow, this looks awesome. Been using Temporal, but this fits so perfectly into my stack (Postgres, Pydantic), and the first-class support for DAG workflows is chef's kiss. Going to take a stab at porting over some of my workflows.

by hyuuu

0 subcomment

i have been looking for something like this, the closest I could find by googling was celery workflow, i think you should do better marketing, I didn't even realize that hatchet existed!

by rohan_

1 subcomments

How close to Postgres does this need to be? Like could you host this on Aurora DSQL and have unlimited scalability?
Or how would you scale this to support thousands of events per second?

by lysecret

0 subcomment

I would appreciate a comparison to cloud tasks in your docs.

by digdugdirk

2 subcomments

Interesting! How does it compare with DBOS? I noticed it's not in the readme comparisons, and they seem to be trying to solve a similar problem.

by anentropic

1 subcomments

Quick feedback:
Would love to see some sort of architecture overview in the docs
The top-level docs have a section on "Deploying workers" but I think there are more components than that?
It's cool there's a Helm chart but the docs don't really say what resources it would deploy
https://docs.hatchet.run/self-hosting/docker-compose
...shows four different Hatchet services plus, unexpectedly, both a Postgres server and RabbitMQ. Can't see anywhere that describes what each one of those does
Also in much of the docs it's not very clear where the boundary between Hatchet Cloud and Hatchet the self-hostable OSS part lies

by bluelightning2k

0 subcomment

Is this Python only?
More importantly: can this be used to run untrusted jobs? E.g. user-supplied or AI supplied code?

by wilted-iris

1 subcomments

This looks very cool! I see a lot of Python in the docs; is it usable in other languages?

by pkiv

0 subcomment

Congrats on the launch guys!

by bomewish

1 subcomments

Why not fix all the broken doc links and make sure you have the full sdk spec down first, ready to go? Then drop it all at once, when it’s actually ready. That’s better and more respectful of users. I love the product and want y’all to succeed but this came off as extremely unprofessional.

by szvsw

0 subcomment

I’ve been using Hatchet since the summer, and really do love it over celery. I’ve been using Hatchet for academic research experiments with embarrassingly parallel tasks - ie thousands of simultaneous tasks just with different inputs, each CPU bound and on the order of 10s-2min, totaling in the millions of tasks per experiment - and it’s been going great. I think the team is putting together a very promising product. Switching from a roll-my-own SQS+AWS batch system to Hatchet has made my research life so much better. Though part of that also probably comes from the forced improvements you get when re-designing a system a second time.
Although there was support for pydantic validation in v0, now that the v1 SDK has arrived, I would definitely say that the #1 distinguishing feature (at least from a dx perspective) for anyone thinking of switching from Celery or working on a greenfield project is the type safety that comes with the first class pydantic support in v1. That is a huge boon in my opinion.
Another big boon for me was that the combo of both Python and Typescript SDKs - being able to integrate things into frontend demos without having to set up a separate Python api is great.
There are a couple rough edges around asyncio/single worker concurrency IMO - for instance, choosing between 100 workers each with capacity for 8 concurrent task runs vs 800 workers each with capacity for 1 concurrent task run. In Celery it’s a little bit easier to launch a worker node which uses separate processes to handle its concurrent tasks, whereas right now with Hatchet, that’s not possible as far as I am aware, due to how asyncio is used to handle the concurrent task runs which a single worker may be processing. If most of your work is IO bound or already asyncio friendly, this does not really affect you and you can safely use eg a worker with 8x task run capacity, but if you are CPU bound there might be some cases where you would prefer the full process isolation and feel more assured that you are maximally utilizing all your compute in a given node, and right now the best way to do that is only through horizontal scaling or 1x task workers I think. Generally, if you do not have a great mental model already of how Python handles asyncio, threads, pools, etc, the right way to think about this stuff can be a little confusing IMO, but the docs on this from Hatchet have improved. In the future though, I’d love to see an option to launch a Python worker with capacity for multiple simultaneous task runs in separate processes, even if it’s just a thin wrapper around launching separate workers under the hood.
There are also a couple of rough edges in the dashboard right now, but the team has been fixing them, and coming from celery/flower or SQS, it’s already such an improved dashboard/monitoring experience that I can’t complain!
It’s hard to describe, but there is just something fun about working with Hatchet for me, compared to Celery or my previous SQS system. Almost all of the design decision just align with what I would desire, and feel natural.

by revskill

2 subcomments

Confusing docs as there is no setup self hosted for postgres.

by curtisszmania

0 subcomment

[dead]

by tombhowl

0 subcomment

[dead]