Skip to main content

Command Palette

Search for a command to run...

I Deleted 500 Lines of Infrastructure Code. Nothing Broke.

The Glue Code Tax: Why 60% of Your Microservices Code is Infrastructure

Updated
7 min read
I Deleted 500 Lines of Infrastructure Code. Nothing Broke.

TL;DR

Most microservice codebases are 60-80% infrastructure, 20-40% business logic. We call this the "glue code tax"—and it's killing engineering velocity. This series shows you how to eliminate it.


The Audit That Changed Everything

Last month, I audited a production order service. The results surprised even me:

CategoryLines of Code
Business logic100
Retry patterns80
State management120
Idempotency handling100
Observability100
Total infrastructure500+

5x more infrastructure than business logic.

This isn't an outlier. It's the norm across our industry.

I call it the glue code tax—the infrastructure you write to make business logic production-ready. And after building distributed systems at Maersk and other enterprises, I've seen it everywhere.

Let me show you exactly how it accumulates.


The Promise vs Reality

Microservices promised us freedom:

✅ Deploy independently
✅ Choose your own stack
✅ Scale teams autonomously
✅ Move faster

What we actually got:

❌ Distributed state synchronization
❌ Message queue choreography
❌ Retry logic in every service
❌ 3 AM production incidents

Why the gap?

Building distributed systems means implementing distributed systems primitives in your application code. Every service needs:

  • Retry logic (exponential backoff, circuit breakers)

  • State management (transactions, locks, consistency)

  • Idempotency (deduplication, caching)

  • Message delivery (producers, consumers, correlation)

  • Observability (logging, metrics, tracing)

This is infrastructure. And you're writing it over and over.


A Simple Service Goes Wrong

Consider an order processing service with these requirements:

  1. Validate the order

  2. Reserve inventory

  3. Process payment

  4. Notify warehouse

  5. Send confirmation email

The Naive Implementation

// Clean business logic with immutable data
data class Order(
    val id: OrderId,
    val customerId: CustomerId,
    val items: List<Item>,
    val total: Money
)

class OrderService {
    fun processOrder(order: Order): OrderResult {
        val validated = validateOrder(order)
        val reservation = inventoryService.reserve(validated.items)
        val payment = paymentService.charge(validated.customerId, validated.total)
        warehouseService.createShipment(order, reservation)
        emailService.sendConfirmation(order.customerId, order)
        return OrderResult.success(order.id)
    }
}

This looks clean—good separation of concerns, immutable data classes, clear flow.

But it's not production-ready.


The Reliability Gap

There's a massive gap between writing business logic and making it production-ready.

What Can Go Wrong?

The Nightmare Scenario:

Payment succeeds → Warehouse notification fails

Result:
• Inventory: Reserved ✓
• Payment: Charged ✓  
• Warehouse: Not notified ✗

Customer charged, order never ships!

Every external call is a failure point. In distributed systems, failures are the norm, not the exception.


The Infrastructure Accumulates

Attempt 1: Add Retries

"Easy, just retry failed operations!"

suspend fun reserveInventory(items: List<Item>): Reservation {
    var attempts = 0
    var lastError: Exception? = null

    while (attempts < 5) {
        try {
            return inventoryService.reserve(items)
        } catch (e: Exception) {
            lastError = e
            attempts++
            if (attempts >= 5) throw e
            delay((2.0.pow(attempts) * 100).toLong())
        }
    }
    throw lastError!!
}

New problem: Payment gets charged 5 times! Each retry calls paymentService.charge() again.

Attempt 2: Add Idempotency

"We need idempotency keys!"

suspend fun chargePayment(order: Order): Payment {
    val idempotencyKey = "payment:${order.id}"

    val cached = redis.get(idempotencyKey)
    if (cached != null) return JSON.parse(cached)

    val payment = paymentService.charge(
        order.customerId, order.total,
        idempotencyKey = idempotencyKey
    )

    redis.setex(idempotencyKey, 3600, JSON.stringify(payment))
    return payment
}

We just added: Redis dependency, key generation, cache management, serialization, TTL decisions.

Attempt 3: Handle Crashes

"What if the process crashes between payment and warehouse notification?"

suspend fun processOrder(order: Order): OrderResult {
    val state = database.transaction {
        OrderState.findById(order.id) ?: OrderState.create(order.id)
    }

    if (state.step >= Step.INVENTORY_RESERVED) {
        // Already done, skip
    } else {
        val reservation = reserveInventory(order.items)
        database.transaction {
            state.update(step = Step.INVENTORY_RESERVED, reservationId = reservation.id)
        }
    }

    if (state.step >= Step.PAYMENT_PROCESSED) {
        // Already done, skip  
    } else {
        val payment = chargePayment(order)
        database.transaction {
            state.update(step = Step.PAYMENT_PROCESSED, paymentId = payment.id)
        }
    }
    // Continue for each step...
}

We just added: Database dependency, state machine, transaction management, step tracking.


The Full Picture

Here's what our "simple" order service now requires:

The ratio: 100 lines of business logic, 500+ lines of infrastructure.

This is the glue code tax.


Where the Complexity Comes From

The glue code implements distributed systems primitives:

1. Retry Logic

Every external call needs exponential backoff, maximum attempts, timeout handling, and circuit breaker state.

2. State Management

Distributed state requires persistence (survive crashes), transactions (atomic updates), locks or versioning (prevent races), and state machines (track progress).

3. Idempotency

Preventing duplicate operations requires unique keys, result caching, TTL decisions, and cache invalidation.

4. Race Conditions

Solving this requires distributed locks (complex, deadlocks), optimistic locking (version numbers, retry loops), or database-level locks (SELECT FOR UPDATE).

Every solution adds complexity.


The Real Cost

The glue code tax isn't just about lines of code. It affects everything.

Development Velocity

ActivitySimple ServiceDistributed ServiceDifference
Feature Development2 days5-7 days+150%
Testing1 day3 days+200%
Production Issues2/month10-15/month+500%

Where does the time go?

  • 60% writing glue code

  • 40% testing distributed scenarios

  • Debugging race conditions, timeouts, partial failures

The 3 AM Debugging Session

A real incident from my experience:

3:17 AM - Alert: "Order processing failing - 95% error rate"
3:20 AM - Check metrics: All services healthy, no CPU/memory spikes
3:25 AM - Check Kafka: Consumer lag building up (50,000 messages)
3:30 AM - Check logs: "Timeout connecting to Redis" every 3rd request
3:40 AM - SSH into Redis: Connection count 10,000/10,000 (MAX!)
3:45 AM - Root cause: Connection leak in error path of retry logic
3:50 AM - Deploy fix, restart services
4:30 AM - Back to normal

The pattern: Most production issues are in glue code, not business logic.

The connection leak was hidden in 80 lines of retry logic, not in the 20 lines of order validation.

Onboarding New Engineers

MilestoneTraditional BackendDistributed Services
First PR merged3 days2 weeks
Comfortable with codebase2 weeks6 weeks
Can design features1 month3 months

The difference: In distributed systems, engineers must understand retry patterns, state machines, message queue semantics, cache strategies, circuit breakers, and distributed tracing.

This is 2x the business domain knowledge.


The Pattern

We started with a simple order service. To make it production-ready, we added:

RequirementInfrastructure Added
Services failRetry logic
Processes crashState management
Retries duplicateIdempotency
Synchronous is slowMessage queues
Race conditionsDistributed locks
Debugging is hardObservability

Every reliability feature adds infrastructure code.


The Core Problem

Why do we write the same infrastructure in every service?

Think about it:

  • Retry logic: Same exponential backoff pattern everywhere

  • State management: Same state machine pattern everywhere

  • Idempotency: Same caching pattern everywhere

  • Message handling: Same consumer loop everywhere

We're reinventing the wheel, badly, in every service.


The Question

What if infrastructure could handle:

  • Retries automatically?

  • State management automatically?

  • Idempotency automatically?

  • Message delivery automatically?

What if the runtime provided these primitives instead of forcing you to implement them in application code?

In Part 2, we'll explore exactly that: Durable Execution, an architecture that moves infrastructure from your code to the runtime.


Key Takeaways

  1. The glue code tax is real: 60-80% of microservices code is infrastructure

  2. It compounds: More infrastructure = more bugs = more 3 AM incidents

  3. It slows everything: Development, onboarding, debugging

  4. There's a pattern: We implement the same primitives everywhere

  5. There's a solution: Move primitives to the runtime (next post)


Series Navigation

PartTitleStatus
1The Glue Code Tax📍 You are here
2Durable Execution: Moving Infrastructure to the RuntimeNext →
3The Functional Foundation: Why Durable Execution Works
4OMS Demo: From 6 Services to 2
5Building Reliable AI Agents

About the Author

Ravi Lanka is a Senior Backend Engineer building production distributed systems at Maersk. He specializes in event-driven architectures, functional programming (Kotlin/Arrow-KT), and durable execution frameworks.

🔗 Connect: GitHub | LinkedIn


Have you calculated your glue code tax? I'd love to hear your numbers.
Drop a comment below or DM me on
LinkedIn.

Durable Execution for Production Systems

Part 1 of 3

A 5-part series on eliminating infrastructure complexity with durable execution, functional programming, and modern distributed systems patterns.

Up next

What if Your Code Could Survive Any Crash? (It Can)

How Durable Execution Eliminates Infrastructure Code Through Journaling