Skip to main content

Command Palette

Search for a command to run...

What if Your Code Could Survive Any Crash? (It Can)

How Durable Execution Eliminates Infrastructure Code Through Journaling

Updated
25 min read
What if Your Code Could Survive Any Crash? (It Can)

TL;DR

In Part 1, we identified the problem: 60-80% of microservice code is infrastructure. Retries, state machines, idempotency, distributed locks.

This post introduces durable execution: the runtime journals every step of your code. Crash mid-execution? Replay from the journal. No duplicate work, no lost progress, no infrastructure code.

The formal definition (from Dominik Tornow): "Durable Executions are interruption-agnostic definitions of functions that result in interruption-tolerant executions of functions." Your code ignores crashes; the runtime compensates.

Framework used: Restate (open source, Rust-based runtime with TypeScript/Java/Kotlin/Python/Go/Rust SDKs). Available self-hosted or as Restate Cloud.

Companion code: All examples are from restate-oms-demo, a complete Order Management System implementation.


The Promise That Changes Everything

"Write normal code, get fault tolerance."

Every durable execution framework makes this promise. But what does "normal code" actually mean? And what is "fault tolerance" in this context?

The insight is deceptively simple: an interruption (whether a voluntary sleep or an involuntary crash) should be handled the same way. Suspend on one process, resume on another. Your code shouldn't care which happened.

This is what separates durable execution from retry libraries and circuit breakers. Those tools help you handle failures. Durable execution makes failures irrelevant.


Why Distributed Tasks Are Hard

I've spent eight years building backend systems that coordinate work across services. Order processing, payment workflows, container operations at port terminals. Every time, I end up writing the same infrastructure code to handle the same problems.

Let me break down what makes distributed task execution genuinely difficult.

The Failure Problem

When you call an external service, three things can happen:

Case 3 is the killer. You don't know if the payment went through. You don't know if the inventory was reserved. Now you need to decide: retry (and risk duplicate charges) or skip (and risk lost transactions).

So you write retry logic with exponential backoff. Then idempotency checks. Then dead letter queues. Then monitoring to track all of this.

The State Problem

Where do you store progress? Let's say your workflow has 5 steps:

fun processOrder(order: Order) {
    val payment = chargePayment(order)           // Step 1
    val inventory = reserveInventory(order)       // Step 2
    val shipping = scheduleShipping(order)        // Step 3
    updateOrderStatus(order, "CONFIRMED")         // Step 4
    sendConfirmationEmail(order)                  // Step 5
}

If you crash after step 2, how do you resume? You need to persist state somewhere. But now you have two writes: the business action AND the state update. This is the dual-write problem. Either can fail independently, leaving your system inconsistent.

You could use a database transaction, but your external calls (payment, inventory) aren't part of that transaction. You could use an outbox pattern, but now you're managing change data capture and message ordering.

The Concurrency Problem

Two requests arrive for the same order at the same time. Both read balance = 100. Both add 50. Both write 150. You just lost 50.

The lost update problem. Without proper synchronization, concurrent operations on the same entity lead to data corruption.

Traditional solution: Distributed locks (Redis SETNX + expiration). But what if the process holding the lock dies? What if it takes longer than the lock timeout? Now you're tuning timeouts and handling lock contention.

Restate solution: Virtual Objects with built-in single-writer semantics. No locks needed.

The Infrastructure Problem

To solve all of this "properly," you end up with:

Five systems to manage, each with its own failure modes, scaling challenges, and operational overhead.

The Debugging Problem

Something went wrong in production. Which step failed? What was the state at that point? What retries happened? You grep through logs across multiple services, correlate trace IDs, and piece together what happened.


The Root Cause: Invisible Progress

All these problems share a root cause: progress is invisible.

The customer was charged. The inventory was reserved. But your process doesn't know that. It was all in memory, and memory is gone.

This is why we build all that infrastructure. We're trying to make progress observable and recoverable.

But what if the runtime handled this for us?


The Four Characteristics of Durable Execution

Before diving into solutions, let's establish what durable execution actually provides. These four characteristics apply universally across all durable execution platforms:

1. Virtualized Execution

Execution spans processes and machines. If a process crashes during step 3 of 5, execution resumes in a new process with all variables restored to their pre-crash values. The execution is virtual. It's not bound to a single physical process.

2. Not Limited by Time

Because execution survives crashes, applications can run for as long as needed. Milliseconds, days, or years. A payment workflow completes in seconds. A mortgage approval workflow runs for months, pausing while waiting for documents and approvals.

3. Automatic State Preservation

You don't need defensive database writes to guard against crashes. All variables (including local variables) are durable. They have the same values after a crash as before.

4. Hardware Agnostic

Unlike hardware fault tolerance (redundant CPUs, exotic machines), durable execution builds reliability into software. It works on VMs, containers, serverless, and across clouds without specialized infrastructure.

These aren't implementation details. They're the definition of what durable execution provides. Any system claiming to offer durable execution should deliver all four.


The Solution: Journal Everything

Durable execution makes progress visible by recording every step to a persistent log the moment it completes.

What happened:

  1. Steps 1-3 completed and were journaled before the crash

  2. On restart, the runtime reads the journal

  3. Completed steps return cached results (no re-execution)

  4. Execution resumes at Step 4

Result: No duplicate charges. No lost reservations. No infrastructure code.

Visual: What Changed

What disappeared: 80% of infrastructure code. What remains: Pure business logic.

What This Gives You

Traditional ApproachDurable Execution
Manual retry logic with backoffAutomatic retries with exponential backoff
Idempotency keys + deduplication logicAutomatic (same step = cached result)
Distributed locks for concurrencySingle-writer semantics per entity
Outbox pattern for dual-writeState + journal updated atomically
Multiple infrastructure systemsOne runtime
Log correlation for debuggingExecution history built-in

The Key Insight

From the Restate architecture documentation on write path and step lifecycle:

"As the service executes, it streams back a step result event. The processor appends this step journal entry to the log. The moment this append is replicated to quorum defines 'the step happened.' From then on, the step will be recovered on retries and won't be re-executed."

Each step is persisted immediately when it completes. Not batched, not at handler end. This is what makes replay deterministic and safe.

Why "Crash-Proof" Doesn't Mean "Crashes Can't Happen"

To be precise: describing durable execution as crash-proof doesn't mean crashes are prevented. It means crashes have no consequence.

Analogy: A waterproof watch doesn't stop you from falling into a pool. It means the fall doesn't matter.

This distinction matters when explaining durable execution to skeptics. You're not claiming to prevent failures. You're claiming failures become operationally irrelevant.


What the Code Looks Like

Here's the mental shift. Instead of thinking about infrastructure, you write business logic. The framework handles the rest.

Traditional Approach (What We've All Written)

class OrderProcessor(
    private val paymentService: PaymentService,
    private val inventoryService: InventoryService,
    private val orderRepository: OrderRepository,
    private val lockService: RedisLockService,
    private val outbox: OutboxService
) {
    suspend fun processOrder(orderId: String, request: OrderRequest) {
        // Acquire distributed lock
        val lock = lockService.acquire("order:$orderId", timeout = 30.seconds)
            ?: throw ConcurrentModificationException("Order locked")

        try {
            // Check idempotency
            val existing = orderRepository.findByIdempotencyKey(request.idempotencyKey)
            if (existing != null) return existing

            // Retry with backoff
            val payment = retryWithBackoff(maxAttempts = 5) {
                paymentService.charge(request.payment)
            }

            val inventory = retryWithBackoff(maxAttempts = 5) {
                inventoryService.reserve(request.items)
            }

            // Dual-write: update DB + publish event
            orderRepository.transaction {
                orderRepository.save(Order(orderId, status = CONFIRMED))
                outbox.save(OrderConfirmedEvent(orderId))
            }

        } catch (e: Exception) {
            // Compensation logic
            payment?.let { paymentService.refund(it.paymentId) }
            inventory?.let { inventoryService.release(it.reservationId) }
            throw e
        } finally {
            lockService.release(lock)
        }
    }

    private suspend fun <T> retryWithBackoff(
        maxAttempts: Int,
        block: suspend () -> T
    ): T {
        var lastError: Exception? = null
        repeat(maxAttempts) { attempt ->
            try {
                return block()
            } catch (e: Exception) {
                lastError = e
                delay((2.0.pow(attempt) * 100).toLong())
            }
        }
        throw lastError!!
    }
}

This is 50+ lines before you've written any business logic.

Durable Execution Approach

Now let's see the same workflow with Restate. Watch what disappears:

  • ❌ No lock acquisition/release

  • ❌ No idempotency key management

  • ❌ No retry logic with backoff

  • ❌ No outbox pattern

  • ❌ No exception handling for each external call

Just business logic:

import dev.restate.sdk.annotation.Handler
import dev.restate.sdk.annotation.Service
import dev.restate.sdk.Context
import dev.restate.sdk.kotlin.runBlock

@Service
class OrderProcessor {

    @Handler
    suspend fun processOrder(ctx: Context, request: OrderRequest): Order {
        // Each ctx.runBlock() result is journaled immediately
        // Replay skips completed steps, returns cached results

        val payment = ctx.runBlock("charge-payment") {
            paymentService.charge(request.payment)
        }

        val inventory = ctx.runBlock("reserve-inventory") {
            inventoryService.reserve(request.items)
        }

        // Deterministic UUID generation (same on replay)
        val orderId = ctx.random().nextUUID().toString()

        return Order(
            orderId = orderId,
            paymentId = payment.paymentId,
            reservationId = inventory.reservationId,
            status = OrderStatus.CONFIRMED
        )
    }
}

That's it. The runtime handles:

  • Retries: Failed steps retry with exponential backoff

  • Idempotency: Same invocation = same result (no duplicate payments)

  • Persistence: Each ctx.runBlock() result is journaled immediately

  • Recovery: On crash, replay from journal, skip completed steps

  • Determinism: ctx.random() returns the same value on replay


Virtual Objects: The Concurrency Solution We Promised

Remember the concurrency problem? Two requests modifying the same order? Traditional solution: distributed locks. Restate's solution: built-in single-writer semantics.

import dev.restate.sdk.annotation.Handler
import dev.restate.sdk.annotation.Shared
import dev.restate.sdk.annotation.VirtualObject
import dev.restate.sdk.ObjectContext
import dev.restate.sdk.SharedObjectContext
import dev.restate.sdk.kotlin.stateKey

@VirtualObject
class OrderAggregate {
    companion object {
        // State is scoped to this Virtual Object key (e.g., order-123)
        private val STATE = stateKey<Order>("state")
    }

    @Handler  // Exclusive: one writer per order ID at a time
    suspend fun handle(ctx: ObjectContext, command: OrderCommand): Order {
        // ctx.key() returns the Virtual Object key (e.g., "order-123")
        val state = ctx.get(STATE) ?: Order.empty(ctx.key())

        // Pure business logic (no side effects, returns Either)
        val events = decider.decide(command, state)
        val newState = events.fold(state, decider::evolve)

        // Persist atomically with journal
        ctx.set(STATE, newState)
        return newState
    }

    @Shared  // Concurrent reads allowed
    suspend fun get(ctx: SharedObjectContext): Order? = ctx.get(STATE)
}

How it works:

  • Each Virtual Object has a key (e.g., order-123)

  • Requests for the same key are queued and processed sequentially

  • Different keys process in parallel

  • No distributed locks needed (Restate guarantees single-writer semantics)

Key insight: Same key = sequential. Different keys = parallel. No locks, no contention tuning.

From the Virtual Objects documentation:

"Only one handler with write access can run at a time per object key to prevent concurrent/lost writes or race conditions."


The Architecture: How Restate Works

Restate is a server (not a library) that sits in front of your services, similar to a reverse proxy or message broker. This is a key architectural choice. Your services remain stateless while Restate handles all the durability.

Why Build From First Principles?

The Restate team built their own distributed log (Bifrost) rather than using Kafka or an existing database. Why?

They needed a specific combination of properties not found in existing systems:

  • Single-roundtrip latency with quorum replication

  • Active-active deployments with flexible quorums

  • Segmented log that can be dynamically reconfigured

The design draws from Virtual Consensus and LogDevice, using strong consensus in the control plane and relaxed requirements on the data plane for efficiency.

Three Stateful Components

1. Metadata Store (Raft-based)

The internal source of truth for node membership, partition assignments, and cluster configuration. Uses Raft consensus for correctness.

2. Bifrost (Durable Log)

The primary durability layer. Each partition has a single sequencer/leader that orders events and replicates them to peer replicas on other Restate nodes. A write is committed when a quorum of replicas acknowledges the append.

Restate uses a segmented virtual log: the active segment receives appends; reconfiguration seals the active segment and atomically publishes a new segment as the head. This segmentation enables clean and fast leadership changes, placement updates, and other reconfiguration without copying data.

All events (invocations, journal entries, state updates, durable promises) go to the log first.

Think of Bifrost like a database's transaction WAL. Or like Kafka, but with partition processors that tail the log and maintain materialized state locally in RocksDB for fast reads.

3. Partition Processor (RocksDB)

Materializes state for fast reads. Continuously tails the log and acts on events. Can always be rebuilt from log + snapshots.

High Availability

For production, Restate supports clustered deployment with Raft consensus for the metadata store and quorum-based replication for Bifrost.

Like other Raft-based systems, the cluster requires a majority of nodes to be available:

  • A 3-node cluster tolerates 1 node failure

  • A 5-node cluster tolerates 2 node failures

Key architectural decisions that enable this:

  • Log-first architecture: Making an append log durable across regions is simpler than making a full database work across regions. Databases have complex access patterns; logs just append.

  • Push model for invocations: Unlike systems where workers pull tasks from queues, Restate pushes invocations to services. Fewer pieces to coordinate = faster failover.

  • Tight integration between log and processor: Processors use the log for leader election, avoiding coordination overhead.

For details, see the architecture reference.


Production-Ready Features

Here's where Restate differentiates itself from basic workflow engines.

Snapshots and Log Trimming

A durable log that grows forever is a liability. Restate solves this:

From the snapshot documentation:

"Processors create periodic snapshots of RocksDB and upload them to S3. On restart or takeover, a fresh partition processor can download the latest snapshot and replays the log suffix since the snapshot's sequence number."

What this means:

  • Bounded storage: Log entries are trimmed after snapshots

  • Fast recovery: New nodes download snapshot + replay only recent entries

  • No unbounded growth: Log size stays under control

Configuration:

[admin]
log-trim-check-interval = "1h"  # How often to check for trimming

FaaS Suspension: Don't Pay for Wait Time

If you're running on AWS Lambda or Cloud Run, Restate can suspend your handler while it waits for external events.

import dev.restate.sdk.annotation.Workflow
import dev.restate.sdk.WorkflowContext
import dev.restate.sdk.SharedWorkflowContext
import dev.restate.sdk.kotlin.durablePromiseKey

@Workflow
class ApprovalWorkflow {

    companion object {
        private val APPROVAL = durablePromiseKey<ApprovalDecision>("approval")
    }

    @Handler
    suspend fun run(ctx: WorkflowContext, request: ApprovalRequest): ApprovalResult {
        // Step 1: Submit for approval (immediate)
        ctx.runBlock("submit") { submitForApproval(request) }

        // Step 2: Wait for human approval (could be hours or days)
        // On Lambda: Handler SUSPENDS here. No charges while waiting.
        val decision = ctx.promise(APPROVAL).awaitable().await()

        // Step 3: Process decision (resumes when approval arrives)
        return ctx.runBlock("process") { processDecision(decision) }
    }

    @Handler
    suspend fun approve(ctx: SharedWorkflowContext, decision: ApprovalDecision) {
        ctx.promiseHandle(APPROVAL).resolve(decision)
    }
}

From the FaaS suspension documentation:

"Restate automatically suspends workflows when they are waiting for events or timers, and resumes them when the event occurs or the timer expires. This means you can run long-running workflows on function-as-a-service platforms without paying for the wait time."

The handler suspends, Restate holds the state, and Lambda only runs when there's actual work to do.

Cost savings: For a 24-hour approval workflow, you pay for ~2 seconds of Lambda execution instead of 86,400 seconds.

SQL-Powered Observability

Debugging distributed systems usually means correlating logs across services. Restate gives you SQL queries over your execution state, powered by Apache DataFusion for live query execution against the running system.

This isn't export-to-database observability. This is querying the runtime's internal state directly. Restate exposes your invocations, journals, and state through SQL tables you can query in real-time.

Example: Find all invocations stuck in retry backoff

SELECT id, target, last_failure, next_retry_at
FROM sys_invocation
WHERE status = 'backing-off';

This query runs against the live Restate runtime. No log aggregation. No external database. The execution state is queryable.

Additional queries:

-- Find all running invocations for a service
SELECT id, status, created_at, retry_count
FROM sys_invocation
WHERE target_service_name = 'OrderProcessor'
AND status = 'running';

-- Inspect the journal of a specific invocation
SELECT index, entry_type, name, completed
FROM sys_journal
WHERE id = 'inv_1gdJBtdVEcM942bjcDmb1c1k';

-- Check what's in the inbox for a Virtual Object
SELECT * FROM sys_inbox
WHERE target_service_key = 'order-123';

Available tables (full schema in SQL introspection docs):

  • sys_invocation: All invocations with status, timing, retry counts

  • sys_journal: Every journal entry for every invocation

  • sys_inbox: Pending invocations queued for Virtual Objects

  • sys_keyed_service_status: Virtual Object status

  • state: Application state stored in Virtual Objects

The Restate UI (bundled with the server at port 9070) provides:

  • Complete execution timeline for every step

  • Visual journal inspection

  • Service and deployment management

  • Interactive SQL console

Plus OpenTelemetry export to Jaeger, DataDog, or Langfuse for traces in your existing observability stack.

Immutable Deployments and Safe Versioning

When workflows run for hours or days, you can't just deploy new code and hope for the best. Restate handles this with immutable deployments:

# Deploy v1
restate deployments register http://order-service-v1/

# Later: Deploy v2 (new endpoint)
restate deployments register http://order-service-v2/

What happens:

  • New requests automatically route to v2

  • Existing in-flight requests continue on v1

  • v1 stays running until all its invocations complete

  • You can safely remove v1 once it's drained

From the versioning documentation:

"Restate automatically routes new requests to the latest deployment. Existing requests continue on their original deployment until completion."


What Disappears

Let me be concrete about what you stop writing.

Retry Logic

Before: 30+ lines of retry with exponential backoff, circuit breakers, timeout handling

After:

val result = ctx.runBlock("external-call") {
    externalService.call()
}
// Retries with exponential backoff are automatic

Distributed Locks

Before: Redis SETNX, lock timeouts, renewal, cleanup

After: Virtual Object with @Handler annotation. Single-writer per key is guaranteed.

Idempotency

Before: Generate idempotency key, check cache, store result, handle TTL

After: The journal handles it. Same invocation ID = cached result.

Dual-Write / Outbox Pattern

Before: Database transaction + outbox table + CDC + message relay

After: State and journal are updated atomically by the runtime.


Event Sourcing: Where Do Business Events Go?

Important clarification: Restate's journal is not your business event store. The journal is an execution log designed for deterministic replay. It can be trimmed after snapshots.

For permanent business events, you have two options:

Option 1: Hybrid (Restate + Kafka)

@VirtualObject
class OrderAggregate {
    @Handler
    suspend fun create(ctx: ObjectContext, cmd: CreateOrder): String {
        val events = OrderDecider.decide(cmd, currentState).getOrElse { ... }

        // Business events go to Kafka (permanent event store)
        events.forEach { event ->
            ctx.runBlock("publish-${event::class.simpleName}") {
                eventPublisher.publish("order-events", event)
            }
        }

        // Evolve local state
        val newState = events.fold(currentState, OrderEvolver::evolve)
        ctx.set(STATE, newState)

        return "Created"
    }
}

Best for:

  • Systems with external event consumers

  • Compliance requiring permanent audit trail

  • Event-driven architectures with multiple downstream services

Option 2: Pure Restate

@VirtualObject
class OrderAggregate {
    companion object {
        // Events stored in Restate state (permanent, not journal)
        private val EVENTS = stateKey<List<OrderEvent>>("events")
    }

    @Handler
    suspend fun create(ctx: ObjectContext, cmd: CreateOrder): String {
        val currentEvents = ctx.get(EVENTS) ?: emptyList()

        // Decide on new events (pure function)
        val newEvents = OrderDecider.decide(cmd, currentEvents).getOrElse { ... }

        // Append to event log (stored in Restate state, NOT journal)
        ctx.set(EVENTS, currentEvents + newEvents)

        return "Created"
    }

    @Shared
    suspend fun get(ctx: SharedObjectContext): Order? {
        val events = ctx.get(EVENTS) ?: emptyList()
        // Rebuild from events (pure function)
        return OrderEvolver.rehydrate(events)
    }
}

Best for:

  • Self-contained services without external event consumers

  • Simpler infrastructure footprint

  • Event sourcing within an aggregate boundary

State is permanent: Virtual Object state is stored indefinitely and queryable via SELECT * FROM state WHERE service_name = 'OrderAggregate'.


Why I Chose Restate

When I started evaluating durable execution frameworks, I had a specific problem: stateful entities with high concurrency. Shopping carts being updated simultaneously. Inventory reservations racing against each other. Order processing with complex state transitions.

I'd been solving this with Redis distributed locks, PostgreSQL for state, and Kafka for events. It worked. But I was maintaining five different systems, each with its own failure modes.

What I Needed

Stateful entities with single-writer semantics No more distributed locks. No more SETNX timeouts. No more lock contention tuning.

Simple operations I'm on a small team. I can't run a multi-service orchestrator with dedicated DBs and worker pools. One binary, one Docker container.

FaaS-friendly Our workloads are spiky. I wanted the option to deploy on Lambda and not pay for idle time during lulls.

Event-driven integration We're already using Kafka. Whatever I chose needed to play nicely with our existing event-driven architecture.

What Restate Delivered

Virtual Objects eliminated distributed lock hell Single-writer per key is guaranteed by the runtime. No Redis. No lock timeouts. No contention. Just queue the requests, process them sequentially per entity.

// Before: 30 lines of lock acquisition, timeout handling, cleanup
val lock = redisLock.acquire("order:$orderId")
try { /* business logic */ } finally { lock.release() }

// After: Runtime guarantees single-writer
@VirtualObject
class OrderAggregate {
    @Handler
    suspend fun handle(ctx: ObjectContext, cmd: OrderCommand): Order {
        // Just write business logic. Concurrency is handled.
    }
}

One binary vs. multi-service clusterdocker run restatedev/restate and I'm done. No separate database to manage. No worker deployment. No service mesh. For a small team, this operational simplicity matters.

FaaS suspension saved compute costs When a workflow waits for approval (could be hours), the handler suspends. Lambda only runs when there's actual work. We saw 80% reduction in compute costs for long-running approval workflows.

Kafka integration was straightforward Business events go to Kafka via ctx.runBlock(). Restate journals the publish operation. If the service crashes mid-publish, replay ensures exactly-once semantics without manual deduplication.

What I Didn't Evaluate

Full disclosure: I didn't run Temporal, Step Functions, or Camunda in production.

I read their docs. I watched talks. But my decision came down to this: Virtual Objects solved my specific problem (stateful entities + concurrency) with the simplest operational footprint.

If you're evaluating durable execution, your mileage will vary. Temporal has more production deployments at scale. Step Functions has zero ops overhead. But if your pain point is stateful entities with concurrency constraints, Restate's approach is worth a serious look.

The Real Question

Don't choose based on feature checklists. Ask yourself:

  • Do you have stateful entities? (carts, sessions, accounts, inventory)

  • Are they updated concurrently? (multiple requests hitting the same entity)

  • Do you want to avoid distributed locks? (Redis SETNX, Zookeeper, etc.)

If yes to all three, that's where Virtual Objects shine. Everything else (retries, idempotency, persistence) comes along for the ride.

The Broader Landscape

There's a convergence happening between durable execution and stream processing. Restate's Bifrost, Temporal's event history, Kafka-based solutions. They all center on the same principle: the log is the source of truth.

We're seeing the emergence of a new infrastructure category. Not workflow engines. Not stream processors. Durable execution platforms that combine both.


When to Use Durable Execution

Good Fit

  • Multi-step workflows: Order processing, payment flows, onboarding sequences

  • Stateful entities: Shopping carts, user sessions, aggregate roots

  • Coordination: Distributed locks, semaphores, rate limiting per entity

  • Long-running processes: Approval workflows, scheduled tasks

  • Event processing: Stateful stream processing, aggregations

  • AI Agents: Multi-step LLM workflows with tool calls and human oversight

When a Traditional Database is Better

Restate is not a general-purpose database. Use PostgreSQL/MySQL when you need:

  • Complex queries: Multi-table joins, full-text search, analytical queries

  • Shared reference data: Product catalogs, configuration, user profiles accessed by many services

  • Long-term audit storage: Regulatory compliance requiring 7+ years of retention

The Mental Model

Restate: Where your business logic runs with built-in reliability Database: Where your queryable business data lives for the long term

They're complementary. In production, you'll use both.


Common Pitfalls (And How to Avoid Them)

1. Don't Confuse Journal with Event Store

The mistake: Storing business events in ctx.runBlock() results, thinking they're permanent.

The reality: Journal entries are trimmed after snapshots. They're for execution replay, not business events.

The fix: Business events go in Virtual Object state or external event store (Kafka).

2. Don't Mix Blocking and Async IO

The mistake: Using async/await inside ctx.runBlock().

The reality: Inside run blocks, use blocking calls. Restate handles the async execution.

The fix:

// ❌ WRONG
ctx.runBlock("call") {
    suspendingApiCall()  // Will break determinism
}

// ✅ CORRECT
ctx.runBlock("call") {
    blockingApiCall()  // Or .await() inside the block
}

3. Don't Skip Retention Settings

The mistake: Assuming workflows can run indefinitely.

The reality: Default retention is 24 hours after completion.

The fix: Configure retention settings appropriately for your use case.

4. Don't Use Virtual Objects for Everything

The mistake: Using Virtual Objects for high read-throughput scenarios.

The reality: Virtual Objects excel at writes with concurrency control. High-volume reads should use projections/read models.

The fix:

  • Writes: Virtual Object (single-writer, consistent)

  • Reads: Projection in PostgreSQL (queryable, scalable)

5. Don't Ignore Hot Keys

The mistake: Putting all traffic through one Virtual Object key.

The reality: Each key processes sequentially. One hot key = bottleneck.

The fix: Design keys for distribution. Use composite keys if needed: region-${orderId}.

6. The Determinism Tax: AI Agents and LLM Calls

Special Note for AI Agent Builders: LLM calls are inherently non-deterministic. The same prompt can produce different responses across retries. When building AI agents with Restate, you must wrap all LLM calls in ctx.runBlock() to ensure the agent's state remains consistent during replays.

The mistake: Calling LLMs directly without journaling the response.

// ❌ WRONG - Non-deterministic on replay!
val response = openAI.chat(prompt)  // Different response on replay
agent.processResponse(response)     // Inconsistent state

The reality: If your agent crashes mid-execution and replays, the LLM will generate a different response. Your agent's decision tree diverges. State becomes inconsistent.

The fix: Journal the LLM response immediately.

// ✅ CORRECT - Response journaled, deterministic replay
val response = ctx.runBlock("llm-call") {
    openAI.chat(prompt)  // Executes once, result cached
}
agent.processResponse(response)  // Always uses the same response

On replay, ctx.runBlock() returns the cached LLM response. Your agent follows the exact same execution path. The state evolution is deterministic, even though the LLM itself is not.

This applies to all non-deterministic operations in AI agents:

  • LLM completions (GPT, Claude, Llama)

  • Tool calls with external APIs

  • Vector database similarity searches

  • Random sampling or temperature-based generation

Wrap them in ctx.runBlock(). Let Restate handle the determinism tax.


Adopting Restate: The Incremental Path

You don't need to rewrite everything. Start small:

Phase 1: Extract One Pain Point

Identify your most retry-heavy, state-complex workflow. Port it to Restate while keeping everything else unchanged.

// Before: Standalone service with tons of retry logic
// After: Restate handler (let the framework handle it)
@Service
class PaymentProcessor {
    @Handler
    suspend fun processPayment(ctx: Context, payment: Payment): Receipt {
        val validated = ctx.runBlock("validate") { validate(payment) }
        val charged = ctx.runBlock("charge") { stripe.charge(validated) }
        val receipt = ctx.runBlock("receipt") { generateReceipt(charged) }
        return receipt
    }
}

Phase 2: Add Stateful Entities

Move high-contention entities (cart, inventory) to Virtual Objects. Remove distributed locks.

Before: Redis locks + PostgreSQL state After: Virtual Object (state + concurrency in Restate)

Phase 3: Event-Driven Integration

Connect to existing Kafka infrastructure. Restate becomes one component in your event-driven architecture.

You don't need to go all-in. Restate plays nicely with existing systems. Run it alongside your current architecture.


Try It Yourself

Quick Start (30 seconds)

# Start Restate (single command, zero dependencies)
docker run --rm -p 8080:8080 -p 9070:9070 docker.io/restatedev/restate:latest

# Open the UI
open http://localhost:9070

That's it. No database to configure. No message broker to set up. No Zookeeper ensemble. One binary.

The Simplest Possible Service

Here's a complete Restate service in Kotlin (the "Hello World" of durable execution):

import dev.restate.sdk.annotation.Handler
import dev.restate.sdk.annotation.Service
import dev.restate.sdk.kotlin.Context

@Service
class Greeter {
    @Handler
    suspend fun greet(ctx: Context, name: String): String {
        // This result is journaled. If we crash and replay,
        // we get the same greeting without re-executing.
        val greeting = ctx.runBlock("build-greeting") {
            "Hello, $name! The time is ${System.currentTimeMillis()}"
        }
        return greeting
    }
}

On replay after a crash, ctx.runBlock returns the cached result. The greeting (including the original timestamp) is identical. No duplicate side effects.

Complete Working Example

For a production-ready implementation with Event Sourcing, CQRS, and Virtual Objects:

📦 restate-oms-demo (Order Management System with):

  • Pure functional domain logic (Decider pattern)

  • Virtual Objects for aggregates

  • Spring Boot 4 + Restate 2.4.1 integration

  • Complete docker-compose setup

git clone https://github.com/ravitejalanka/restate-oms-demo.git
cd restate-oms-demo
docker-compose up -d
./gradlew :order-handlers:bootRun

Part 4 of this series walks through the implementation in detail.

Resources

Further Reading (the articles that shaped this post):


What's Next

In Part 3, we'll explore why this code is so simple.

The secret is pure functions. Notice how the Restate code has no try/catch? No error handling? That's not carelessness. It's Railway-Oriented Programming with the Decider pattern.

We'll show you:

  • Why separating decide/evolve makes workflows trivial to test

  • How Arrow-KT's Either type eliminates exception-based error handling

  • Why pure domain logic means you can test business rules in milliseconds (no infrastructure needed)


Series Navigation

PartTitleStatus
1The Glue Code Tax✅ Complete
2Durable Execution📍 You are here
3The Functional FoundationNext
4OMS Demo: Complete Implementation
5Building Reliable AI Agents

The Bigger Picture

We're at an inflection point in distributed systems. For decades, we've been building the same infrastructure over and over: retry logic, idempotency, distributed locks, state machines. Every team, every service, every company.

Durable execution changes the equation. Not by making these problems easier to solve, but by making them disappear into the runtime.

The market is early. Kai Waehner notes it's "on the upward slope of the hype cycle, with immense potential for growth." But the fundamentals are solid. The log-based architecture isn't new (it's how databases have worked for decades). The innovation is exposing it as a programming model.

If you're building distributed systems today, durable execution deserves a serious look. Not as a workflow engine. As the foundation for how you build reliable software.


About the Author

Ravi Lanka is a Senior Backend Engineer building production distributed systems that handle global container logistics. He specializes in event-driven architectures, CQRS/Event Sourcing, functional programming (Kotlin/Arrow-KT), and durable execution frameworks.

🔗 Connect: GitHub | LinkedIn 📦 Code: restate-oms-demo


Building something with durable execution? Hit a wall migrating from traditional workflows? I'd love to hear about it. Drop a comment or reach out on LinkedIn.

Durable Execution for Production Systems

Part 2 of 3

A 5-part series on eliminating infrastructure complexity with durable execution, functional programming, and modern distributed systems patterns.

Up next

Your Business Logic is Hostage to Your Database

Why Pure Functions Are the Only Code Pattern That Survives a Crash