- One thing that needs to be emphasized with “durable execution” engines is they don’t actually get you out of having to handle errors, rollbacks, etc. Even the canonical examples everyone uses - so you’re using a DE engine to restart a sales transaction, but the part of that transaction that failed was “charging the customer” - did it fail before or after the charge went through? You failed while updating the inventory system - did the product get marked out or not? All of these problems are tractable, but once you’ve solved them - once you’ve built sufficient atomicity into your system to handle the actual failure cases - the benefits of taking on the complexity of a DE system are substantially lower than the marketing pitch.
by qianli_cs
1 subcomments
- I really enjoyed this post and love seeing more lightweight approaches! The deep dive on tradeoffs between different durable-execution approaches was great. For me, the most interesting part is that Persistasaurus (cool name btw) use of bytecode generation via ByteBuddy is a clever way to improve DX: it can transparently intercept step functions and capture execution state without requiring explicit API calls.
(Disclosure: I work on DBOS [1]) The author's point about the friction from explicit step wrappers is fair, as we don't use bytecode generation today, but we're actively exploring it to improve DX.
[1]: https://github.com/dbos-inc
by the_mitsuhiko
2 subcomments
- I think this is great. We should see more simple solution to this problem.
I recently started doing something very similar on Postgres [1] and I'm greatly enjoying using it. I think the total solution I ended up with is under 3000 lines of code for both the SQL and the TypeScript SDK combined, and it's much easier to use and to operate than many of the solutions on the market today.
[1]: https://github.com/earendil-works/absurd
by adamzwasserman
0 subcomment
- Reminds me of IBM TPF (Transaction Processing Facility) - the system that powered airline reservations for decades. TPF used per-transaction logging with restart/recovery semantics at massive scale. You could literally unplug the power mid-transaction, plug it back in, and resume exactly where you left off.
The embedded database approach here is interesting though - low latency, no network calls, perfect for single-agent workflows. TPF assumed massive concurrent load across distributed terminals. Different problems, similar durability patterns.
by fiddlerwoaroof
4 subcomments
- Every several years people reinvent serializable continuations
by websiteapi
5 subcomments
- there's a lot of hype around durable execution these days. why do that instead of regular use of queues? is it the dev ergonomics that's cool here?
you can (and people already) model steps in any arbitrarily large workflow and have those results be processed in a modular fashion and have whatever process that begins this workflow check the state of the necessary preconditions prior to taking any action and thus go to the currently needed step, or retry ones that failed, and so forth.
- Sorry for the off-topic but I have been lately seeing a lot of hype around durable execution.
I still cannot figure out how this is any different than launching a workflow in something like Airflow. Is the novel thing here that it can be done using the same DB you already have running?
- Serious question: How does "Durable Execution" differ from "Atomic Transaction"? At most, it seems that DE refers to more concrete details around implementing Atomic Transactions.
by 9642370096647
0 subcomment
- This is irrelevant for local LLMs where a seed value may be specified and generation is totally deterministic. This only helps with online LLMs.
by throwaway290
1 subcomments
- > A workflow engine running a BPMN job
Does anyone really do this?
by Brian-Watkins
0 subcomment
- [dead]