(Dec. 9th 2024) Flawless Beta 3 is here. Try it out!

Introduction to durable execution

Flawless is a durable execution engine for Rust. But what is durable execution actually?

The easiest way to think about durable execution is as code that runs until completion, even in the presence of external failure. The external failure part is very important.

Rust as a language already pushes developers to explicitly handle errors. Almost every API touching the operating system can fail and returns a Result, letting the developer decide what should happen in that case. This is great! But there are some failure scenarios that we can't handle directly from code. For example, if the process executing the code is shut down. Can't if-else a kill -9. That's what I like to call an external failure. Flawless gives us tools to deal with such failures directly from code.

Let's look at some code and see how external failures can affect your system!

// extend_subscription.rs

fn extend_subscription(user: User) {
    // 1. Charge user's credit card.
    let transaction = stripe_api::charge_card(user.card());

    // 2. Send invoice to user.
    let invoice = generate_invoice(user, transaction);
    loops_api::send_invoice(user.email(), invoice);

    // 3. Extend subscription.
    subscription_service::extend(user.id(), Month::new(1));
}

This is a fairly straight forward function. It charges the credit card of a user, sends an invoice to the same user and extends their subscription. And in the majority of cases it will work just fine. However, there are a few cases where it might fail, or even worse, leave the system in an inconsistent state.

One such case would be if the VM is restarted by an administrator to apply a security update. Right at the moment when this function is executing! Maybe the credit card was charged, but the code to extend the subscription was never called. Resulting in a very upset customer.

The most obvious way to protect against such scenarios is to create some kind of state machine and persist it to a database/queue. So next time the app is started up again, it knows exactly how far it got and can continue.

Of course, this is not enough. Failures can come at the most inconvenient of times. What happens if we persisted that an HTTP call to the Stripe API is about to be called, but don't know if it finished. Is it safe to continue executing and repeating the call? Is the user going to be charged twice? Idempotence and retry safety are other topics that we need to care about. At this point we are developing a very sophisticated state machine, with complex resume rules.

Flawless takes this burden away from us. It allows us to just write business logic instead, and gives us tools to deal with such failures directly from the code. Similar to how a database abstracts away all the little quirks of the file-system API and gives the developer a robust way to store their data, durable execution does the same for running code. It allows you to resume the execution from any arbitrary point and gives you tools to correctly model retries.

How does it work?

The most naive way of implementing a durable execution system would be to snapshot the whole thing. Stack, heap and registers. After every instruction!

This would also be a very inefficient and resource intensive task. Flawless takes a much leaner approach. It uses the fact that modern CPUs are very fast, and usually a much cheaper resource than memory, storage or networking. In the case of failure, it will re-execute the code from the beginning, but only the deterministic parts of it.

Everything that has a side effect, like HTTP calls, is executed only once and the result of the operation persisted to a log file. The log turns side effects into deterministic executions, if we ever need to re-execute the function.

The following animation demonstrates this visually.

workflow.rs ×

side effect log

let user = "Adele Goldberg";

let comic_id: u32 = flawless::rand::random();

let url = format!("https://xkcd.com/{comic_id}/");

let content = flawless_http::get(url).send();

let quote = parse_comic(content);

let greeting = format!("Hi {user}! // '{quote}'")

...

recv msg {...}

HTTP request

send msg {...}

get clock time

▓———————————————————————————————————————————————————————————————————————————█

Only the minimal amount of data is preserved to disk and everything else is recalculated on demand. Flawless also uses WebAssembly as the compilation target, to guarantee determinism across operating systems and CPU architectures. You can start a workflow on one machine and finish it on another. But from the perspective of an outside observer, it seems as if the code just executes from start to finish. Only the side effects, that are guaranteed to execute once, can be observed from the outside.

Flawless uses this log mechanism to hook back into the code. This allows us to programmatically handle external failures. Let's look at the following example.

let result
    = flawless_http::post("https://takes-a-long-time.com")
        .body(form)
        .send();

If we have a service that takes a very long time to respond to our HTTP request, it could happen that an external failure happens while we are still waiting for the response. Flawless uses a double commit system to detect if we actually got a response when re-running some side effect. If this is not the case, the call will return an error of kind ErrorKind::RequestInterrupted. With this, we are bringing external failures back into our code.

If we know that it's safe to retry the request, we can just signalize this directly from the code.

let result
    = flawless_http::post("https://takes-a-long-time.com")
        .body(form)
        .idempotent()
        .send();

Now, Flawless will always retry the code, even if it failed mid-request.

Use cases

Once you have this guarantee, that a piece of code will run until the end, even if temporarily interrupted, the abstractions you can build on top of it are much more interesting. Let's look at two of them, long-running workflows and transactional behavior.

Long-running workflows

The longer something is executing, the more likely it is that it will fail in the middle of the execution, and the harder it becomes to manually construct a state machine to resume from an arbitrary point. That's where durable execution shines.

Functions that need to run for months, years or even forever, are a valid pattern when it comes to durable execution. Throwing in a casual sleep(1 year) into your business logic is not a big deal, because we can completely shut off this one "thread" for a year. We already have the means to resume it from any point. During this period it will use 0 CPU and 0 memory resources.

Transactional behavior

You have probably heard of database transactions. They ensure that all database operations inside a transaction are successfully performed, or none get executed. Leaving the system in a consistent state. But what do you do when your operations are spread across different microservices, external APIs or different databases? You can't have a transaction start in PostgreSQL and finish in MySQL.

Durable execution lets you build such a system using the Saga pattern. If you can guarantee that all steps of a transaction or the reverting logic (in case of failure) are going to be executed, it becomes "trivial" to build very robust transactional systems.

There are many other patterns that can be built on top of durable execution. In the end, durable execution is just code, so almost anything can be expressed with it.

Wondering if it would be a good solution for your use case? Join the Flawless discord and tell us more about it!

Ready to try it out?

Flawless is a single binary that you run as a server and send your workflows to. If you would like to try Flawless, check out the installation instructions.