The Glue Code Tax: Why 60% of Microservices Code is Infrastructure

TL;DR

Most microservice codebases are 60-80% infrastructure, 20-40% business logic. We call this the "glue code tax"—and it's killing engineering velocity. This series shows you how to eliminate it.

The Audit That Changed Everything

Last month, I audited a production order service. The results surprised even me:

Category	Lines of Code
Business logic	100
Retry patterns	80
State management	120
Idempotency handling	100
Observability	100
Total infrastructure	500+

5x more infrastructure than business logic.

This isn't an outlier. It's the norm across our industry.

I call it the glue code tax—the infrastructure you write to make business logic production-ready. And after building distributed systems at Maersk and other enterprises, I've seen it everywhere.

Let me show you exactly how it accumulates.

The Promise vs Reality

Microservices promised us freedom:

✅ Deploy independently
✅ Choose your own stack
✅ Scale teams autonomously
✅ Move faster

What we actually got:

❌ Distributed state synchronization
❌ Message queue choreography
❌ Retry logic in every service
❌ 3 AM production incidents

Why the gap?

Building distributed systems means implementing distributed systems primitives in your application code. Every service needs:

Retry logic (exponential backoff, circuit breakers)
State management (transactions, locks, consistency)
Idempotency (deduplication, caching)
Message delivery (producers, consumers, correlation)
Observability (logging, metrics, tracing)

This is infrastructure. And you're writing it over and over.

A Simple Service Goes Wrong

Consider an order processing service with these requirements:

Validate the order
Reserve inventory
Process payment
Notify warehouse
Send confirmation email

The Naive Implementation

// Clean business logic with immutable data
data class Order(
    val id: OrderId,
    val customerId: CustomerId,
    val items: List<Item>,
    val total: Money
)

class OrderService {
    fun processOrder(order: Order): OrderResult {
        val validated = validateOrder(order)
        val reservation = inventoryService.reserve(validated.items)
        val payment = paymentService.charge(validated.customerId, validated.total)
        warehouseService.createShipment(order, reservation)
        emailService.sendConfirmation(order.customerId, order)
        return OrderResult.success(order.id)
    }
}

This looks clean—good separation of concerns, immutable data classes, clear flow.

But it's not production-ready.

The Reliability Gap

There's a massive gap between writing business logic and making it production-ready.

What Can Go Wrong?

The Nightmare Scenario:

Payment succeeds → Warehouse notification fails

Result:
• Inventory: Reserved ✓
• Payment: Charged ✓  
• Warehouse: Not notified ✗

Customer charged, order never ships!

Every external call is a failure point. In distributed systems, failures are the norm, not the exception.

The Infrastructure Accumulates

Attempt 1: Add Retries

"Easy, just retry failed operations!"

suspend fun reserveInventory(items: List<Item>): Reservation {
    var attempts = 0
    var lastError: Exception? = null

    while (attempts < 5) {
        try {
            return inventoryService.reserve(items)
        } catch (e: Exception) {
            lastError = e
            attempts++
            if (attempts >= 5) throw e
            delay((2.0.pow(attempts) * 100).toLong())
        }
    }
    throw lastError!!
}

New problem: Payment gets charged 5 times! Each retry calls paymentService.charge() again.

Attempt 2: Add Idempotency

"We need idempotency keys!"

suspend fun chargePayment(order: Order): Payment {
    val idempotencyKey = "payment:${order.id}"

    val cached = redis.get(idempotencyKey)
    if (cached != null) return JSON.parse(cached)

    val payment = paymentService.charge(
        order.customerId, order.total,
        idempotencyKey = idempotencyKey
    )

    redis.setex(idempotencyKey, 3600, JSON.stringify(payment))
    return payment
}

We just added: Redis dependency, key generation, cache management, serialization, TTL decisions.

Attempt 3: Handle Crashes

"What if the process crashes between payment and warehouse notification?"

suspend fun processOrder(order: Order): OrderResult {
    val state = database.transaction {
        OrderState.findById(order.id) ?: OrderState.create(order.id)
    }

    if (state.step >= Step.INVENTORY_RESERVED) {
        // Already done, skip
    } else {
        val reservation = reserveInventory(order.items)
        database.transaction {
            state.update(step = Step.INVENTORY_RESERVED, reservationId = reservation.id)
        }
    }

    if (state.step >= Step.PAYMENT_PROCESSED) {
        // Already done, skip  
    } else {
        val payment = chargePayment(order)
        database.transaction {
            state.update(step = Step.PAYMENT_PROCESSED, paymentId = payment.id)
        }
    }
    // Continue for each step...
}

We just added: Database dependency, state machine, transaction management, step tracking.

The Full Picture

Here's what our "simple" order service now requires:

The ratio: 100 lines of business logic, 500+ lines of infrastructure.

This is the glue code tax.

Where the Complexity Comes From

The glue code implements distributed systems primitives:

1. Retry Logic

Every external call needs exponential backoff, maximum attempts, timeout handling, and circuit breaker state.

2. State Management

Distributed state requires persistence (survive crashes), transactions (atomic updates), locks or versioning (prevent races), and state machines (track progress).

3. Idempotency

Preventing duplicate operations requires unique keys, result caching, TTL decisions, and cache invalidation.

4. Race Conditions

Solving this requires distributed locks (complex, deadlocks), optimistic locking (version numbers, retry loops), or database-level locks (SELECT FOR UPDATE).

Every solution adds complexity.

The Real Cost

The glue code tax isn't just about lines of code. It affects everything.

Development Velocity

Activity	Simple Service	Distributed Service	Difference
Feature Development	2 days	5-7 days	+150%
Testing	1 day	3 days	+200%
Production Issues	2/month	10-15/month	+500%

Where does the time go?

60% writing glue code
40% testing distributed scenarios
Debugging race conditions, timeouts, partial failures

The 3 AM Debugging Session

A real incident from my experience:

3:17 AM - Alert: "Order processing failing - 95% error rate"
3:20 AM - Check metrics: All services healthy, no CPU/memory spikes
3:25 AM - Check Kafka: Consumer lag building up (50,000 messages)
3:30 AM - Check logs: "Timeout connecting to Redis" every 3rd request
3:40 AM - SSH into Redis: Connection count 10,000/10,000 (MAX!)
3:45 AM - Root cause: Connection leak in error path of retry logic
3:50 AM - Deploy fix, restart services
4:30 AM - Back to normal

The pattern: Most production issues are in glue code, not business logic.

The connection leak was hidden in 80 lines of retry logic, not in the 20 lines of order validation.

Onboarding New Engineers

Milestone	Traditional Backend	Distributed Services
First PR merged	3 days	2 weeks
Comfortable with codebase	2 weeks	6 weeks
Can design features	1 month	3 months

The difference: In distributed systems, engineers must understand retry patterns, state machines, message queue semantics, cache strategies, circuit breakers, and distributed tracing.

This is 2x the business domain knowledge.

The Pattern

We started with a simple order service. To make it production-ready, we added:

Requirement	Infrastructure Added
Services fail	Retry logic
Processes crash	State management
Retries duplicate	Idempotency
Synchronous is slow	Message queues
Race conditions	Distributed locks
Debugging is hard	Observability

Every reliability feature adds infrastructure code.

The Core Problem

Why do we write the same infrastructure in every service?

Think about it:

Retry logic: Same exponential backoff pattern everywhere
State management: Same state machine pattern everywhere
Idempotency: Same caching pattern everywhere
Message handling: Same consumer loop everywhere

We're reinventing the wheel, badly, in every service.

The Question

What if infrastructure could handle:

Retries automatically?
State management automatically?
Idempotency automatically?
Message delivery automatically?

What if the runtime provided these primitives instead of forcing you to implement them in application code?

In Part 2, we'll explore exactly that: Durable Execution, an architecture that moves infrastructure from your code to the runtime.

Key Takeaways

The glue code tax is real: 60-80% of microservices code is infrastructure
It compounds: More infrastructure = more bugs = more 3 AM incidents
It slows everything: Development, onboarding, debugging
There's a pattern: We implement the same primitives everywhere
There's a solution: Move primitives to the runtime (next post)

Part	Title	Status
1	The Glue Code Tax	📍 You are here
2	Durable Execution: Moving Infrastructure to the Runtime	Next →
3	The Functional Foundation: Why Durable Execution Works
4	OMS Demo: From 6 Services to 2
5	Building Reliable AI Agents

About the Author

Ravi Lanka is a Senior Backend Engineer building production distributed systems at Maersk. He specializes in event-driven architectures, functional programming (Kotlin/Arrow-KT), and durable execution frameworks.

🔗 Connect: GitHub | LinkedIn

Have you calculated your glue code tax? I'd love to hear your numbers.
Drop a comment below or DM me on LinkedIn.

I Deleted 500 Lines of Infrastructure Code. Nothing Broke.

TL;DR

The Audit That Changed Everything

The Promise vs Reality

A Simple Service Goes Wrong

The Naive Implementation

The Reliability Gap

What Can Go Wrong?

The Infrastructure Accumulates

Attempt 1: Add Retries

Attempt 2: Add Idempotency

Attempt 3: Handle Crashes

The Full Picture

Where the Complexity Comes From

1. Retry Logic

2. State Management

3. Idempotency

4. Race Conditions

The Real Cost

Development Velocity

The 3 AM Debugging Session

Onboarding New Engineers

The Pattern

The Core Problem

The Question

Key Takeaways

Series Navigation

About the Author

Comments

Durable Execution for Production Systems

What if Your Code Could Survive Any Crash? (It Can)

More from this blog

Your Business Logic is Hostage to Your Database

What if Your Code Could Survive Any Crash? (It Can)

Command Palette

TL;DR

The Audit That Changed Everything

The Promise vs Reality

A Simple Service Goes Wrong

The Naive Implementation

The Reliability Gap

What Can Go Wrong?

The Infrastructure Accumulates

Attempt 1: Add Retries

Attempt 2: Add Idempotency

Attempt 3: Handle Crashes

The Full Picture

Where the Complexity Comes From

1. Retry Logic

2. State Management

3. Idempotency

4. Race Conditions

The Real Cost

Development Velocity

The 3 AM Debugging Session

Onboarding New Engineers

The Pattern

The Core Problem

The Question

Key Takeaways

Series Navigation

About the Author

Comments

Durable Execution for Production Systems

What if Your Code Could Survive Any Crash? (It Can)

More from this blog