I think this is great. We should see more simple solution to this problem.
I recently started doing something very similar on Postgres [1] and I'm greatly enjoying using it. I think the total solution I ended up with is under 3000 lines of code for both the SQL and the TypeScript SDK combined, and it's much easier to use and to operate than many of the solutions on the market today.
there's a lot of hype around durable execution these days. why do that instead of regular use of queues? is it the dev ergonomics that's cool here?
you can (and people already) model steps in any arbitrarily large workflow and have those results be processed in a modular fashion and have whatever process that begins this workflow check the state of the necessary preconditions prior to taking any action and thus go to the currently needed step, or retry ones that failed, and so forth.
Yup. Being able to write imperative code that automatically resumes where it left off is very valuable. It's best to represent durable turing completeness using modern approaches of authoring such logic - programming languages. Being able to loop, try/catch, apply advanced conditional logic, etc in a crash-proof algorithm that can run for weeks/months/years and is introspectable has a lot of value over just using queues.
Durable execution is all just queues and task processing and event sourcing under the hood though.
As you say it can be done but it's an anti-pattern to use a message queue as a database which is essentially what you are doing for these kinds of long running tasks. The reason is that their are a lot of state your likely going to want to status as a task runs and persist and checkpoint yes you can carefully string together a series of database calls chained with message transactions so you don't lose something when an issue happens but then you also need bespoke logic to restart or retry each step and it can turn into a bit of a mess.
Message queues (e.g. SQS) are inappropriate for tracking long-running tasks/workflows. This is due to the operational requirements such as:
- Checking the status of a task (queued, pending, failed, cancelled, completed)
- Cancelling a queued task (or pending task if the execution environment supports it)
- Re-prioritizing queued tasks
- Searching for tasks based off an attribute (e.g. tag)
We build what is effectively a durable execution "engine" for our orchestrator (ours is backed by boltdb and not SQLite, which I objected to, correctly). The steps in our workflows build running virtual machines and include things like allocating addresses, loading BPF programs, preparing root filesystems, and registering services.
Short answer: we need to be able to redeploy and bounce the orchestrator without worrying about what stage each running VM on our platform is in.
JP, the dev that built this out for us, talks a bit about the design rationale (search for "Cadence") here:
I think this is great. We should see more simple solution to this problem.
I recently started doing something very similar on Postgres [1] and I'm greatly enjoying using it. I think the total solution I ended up with is under 3000 lines of code for both the SQL and the TypeScript SDK combined, and it's much easier to use and to operate than many of the solutions on the market today.
[1]: https://github.com/earendil-works/absurd
Every several years people reinvent serializable continuations
Yupp, making that same point in the post :)
> You could think of [Durable Execution] as a persistent implementation of the memoization pattern, or a persistent form of continuations.
Haha so true. Shame image based programming never really caught on.
Janet lang lets you serialize coroutines which is fun. Make this sort of stuff trivial.
Is this reinvention somehow "transactional" in nature?
there's a lot of hype around durable execution these days. why do that instead of regular use of queues? is it the dev ergonomics that's cool here?
you can (and people already) model steps in any arbitrarily large workflow and have those results be processed in a modular fashion and have whatever process that begins this workflow check the state of the necessary preconditions prior to taking any action and thus go to the currently needed step, or retry ones that failed, and so forth.
> is it the dev ergonomics that's cool here?
Yup. Being able to write imperative code that automatically resumes where it left off is very valuable. It's best to represent durable turing completeness using modern approaches of authoring such logic - programming languages. Being able to loop, try/catch, apply advanced conditional logic, etc in a crash-proof algorithm that can run for weeks/months/years and is introspectable has a lot of value over just using queues.
Durable execution is all just queues and task processing and event sourcing under the hood though.
As you say it can be done but it's an anti-pattern to use a message queue as a database which is essentially what you are doing for these kinds of long running tasks. The reason is that their are a lot of state your likely going to want to status as a task runs and persist and checkpoint yes you can carefully string together a series of database calls chained with message transactions so you don't lose something when an issue happens but then you also need bespoke logic to restart or retry each step and it can turn into a bit of a mess.
Message queues (e.g. SQS) are inappropriate for tracking long-running tasks/workflows. This is due to the operational requirements such as:
- Checking the status of a task (queued, pending, failed, cancelled, completed) - Cancelling a queued task (or pending task if the execution environment supports it) - Re-prioritizing queued tasks - Searching for tasks based off an attribute (e.g. tag)
You really do need a database for this.
We build what is effectively a durable execution "engine" for our orchestrator (ours is backed by boltdb and not SQLite, which I objected to, correctly). The steps in our workflows build running virtual machines and include things like allocating addresses, loading BPF programs, preparing root filesystems, and registering services.
Short answer: we need to be able to redeploy and bounce the orchestrator without worrying about what stage each running VM on our platform is in.
JP, the dev that built this out for us, talks a bit about the design rationale (search for "Cadence") here:
https://fly.io/blog/the-exit-interview-jp/
The library itself is open:
https://github.com/superfly/fsm