I Deleted 500 Lines of Infrastructure Code. Nothing Broke.
The Glue Code Tax: Why 60% of Your Microservices Code is Infrastructure

TL;DR
Most microservice codebases are 60-80% infrastructure, 20-40% business logic. We call this the "glue code tax"—and it's killing engineering velocity. This series shows you how to eliminate it.
The Audit That Changed Everything
Last month, I audited a production order service. The results surprised even me:
| Category | Lines of Code |
| Business logic | 100 |
| Retry patterns | 80 |
| State management | 120 |
| Idempotency handling | 100 |
| Observability | 100 |
| Total infrastructure | 500+ |
5x more infrastructure than business logic.
This isn't an outlier. It's the norm across our industry.
I call it the glue code tax—the infrastructure you write to make business logic production-ready. And after building distributed systems at Maersk and other enterprises, I've seen it everywhere.
Let me show you exactly how it accumulates.
The Promise vs Reality
Microservices promised us freedom:
✅ Deploy independently
✅ Choose your own stack
✅ Scale teams autonomously
✅ Move faster
What we actually got:
❌ Distributed state synchronization
❌ Message queue choreography
❌ Retry logic in every service
❌ 3 AM production incidents
Why the gap?
Building distributed systems means implementing distributed systems primitives in your application code. Every service needs:
Retry logic (exponential backoff, circuit breakers)
State management (transactions, locks, consistency)
Idempotency (deduplication, caching)
Message delivery (producers, consumers, correlation)
Observability (logging, metrics, tracing)
This is infrastructure. And you're writing it over and over.
A Simple Service Goes Wrong
Consider an order processing service with these requirements:
Validate the order
Reserve inventory
Process payment
Notify warehouse
Send confirmation email
The Naive Implementation
// Clean business logic with immutable data
data class Order(
val id: OrderId,
val customerId: CustomerId,
val items: List<Item>,
val total: Money
)
class OrderService {
fun processOrder(order: Order): OrderResult {
val validated = validateOrder(order)
val reservation = inventoryService.reserve(validated.items)
val payment = paymentService.charge(validated.customerId, validated.total)
warehouseService.createShipment(order, reservation)
emailService.sendConfirmation(order.customerId, order)
return OrderResult.success(order.id)
}
}
This looks clean—good separation of concerns, immutable data classes, clear flow.
But it's not production-ready.
The Reliability Gap
There's a massive gap between writing business logic and making it production-ready.

What Can Go Wrong?
The Nightmare Scenario:
Payment succeeds → Warehouse notification fails
Result:
• Inventory: Reserved ✓
• Payment: Charged ✓
• Warehouse: Not notified ✗
Customer charged, order never ships!
Every external call is a failure point. In distributed systems, failures are the norm, not the exception.
The Infrastructure Accumulates
Attempt 1: Add Retries
"Easy, just retry failed operations!"
suspend fun reserveInventory(items: List<Item>): Reservation {
var attempts = 0
var lastError: Exception? = null
while (attempts < 5) {
try {
return inventoryService.reserve(items)
} catch (e: Exception) {
lastError = e
attempts++
if (attempts >= 5) throw e
delay((2.0.pow(attempts) * 100).toLong())
}
}
throw lastError!!
}
New problem: Payment gets charged 5 times! Each retry calls paymentService.charge() again.
Attempt 2: Add Idempotency
"We need idempotency keys!"
suspend fun chargePayment(order: Order): Payment {
val idempotencyKey = "payment:${order.id}"
val cached = redis.get(idempotencyKey)
if (cached != null) return JSON.parse(cached)
val payment = paymentService.charge(
order.customerId, order.total,
idempotencyKey = idempotencyKey
)
redis.setex(idempotencyKey, 3600, JSON.stringify(payment))
return payment
}
We just added: Redis dependency, key generation, cache management, serialization, TTL decisions.
Attempt 3: Handle Crashes
"What if the process crashes between payment and warehouse notification?"
suspend fun processOrder(order: Order): OrderResult {
val state = database.transaction {
OrderState.findById(order.id) ?: OrderState.create(order.id)
}
if (state.step >= Step.INVENTORY_RESERVED) {
// Already done, skip
} else {
val reservation = reserveInventory(order.items)
database.transaction {
state.update(step = Step.INVENTORY_RESERVED, reservationId = reservation.id)
}
}
if (state.step >= Step.PAYMENT_PROCESSED) {
// Already done, skip
} else {
val payment = chargePayment(order)
database.transaction {
state.update(step = Step.PAYMENT_PROCESSED, paymentId = payment.id)
}
}
// Continue for each step...
}
We just added: Database dependency, state machine, transaction management, step tracking.
The Full Picture
Here's what our "simple" order service now requires:

The ratio: 100 lines of business logic, 500+ lines of infrastructure.
This is the glue code tax.
Where the Complexity Comes From
The glue code implements distributed systems primitives:
1. Retry Logic
Every external call needs exponential backoff, maximum attempts, timeout handling, and circuit breaker state.
2. State Management
Distributed state requires persistence (survive crashes), transactions (atomic updates), locks or versioning (prevent races), and state machines (track progress).
3. Idempotency
Preventing duplicate operations requires unique keys, result caching, TTL decisions, and cache invalidation.
4. Race Conditions

Solving this requires distributed locks (complex, deadlocks), optimistic locking (version numbers, retry loops), or database-level locks (SELECT FOR UPDATE).
Every solution adds complexity.
The Real Cost
The glue code tax isn't just about lines of code. It affects everything.
Development Velocity
| Activity | Simple Service | Distributed Service | Difference |
| Feature Development | 2 days | 5-7 days | +150% |
| Testing | 1 day | 3 days | +200% |
| Production Issues | 2/month | 10-15/month | +500% |
Where does the time go?
60% writing glue code
40% testing distributed scenarios
Debugging race conditions, timeouts, partial failures
The 3 AM Debugging Session
A real incident from my experience:
3:17 AM - Alert: "Order processing failing - 95% error rate"
3:20 AM - Check metrics: All services healthy, no CPU/memory spikes
3:25 AM - Check Kafka: Consumer lag building up (50,000 messages)
3:30 AM - Check logs: "Timeout connecting to Redis" every 3rd request
3:40 AM - SSH into Redis: Connection count 10,000/10,000 (MAX!)
3:45 AM - Root cause: Connection leak in error path of retry logic
3:50 AM - Deploy fix, restart services
4:30 AM - Back to normal
The pattern: Most production issues are in glue code, not business logic.
The connection leak was hidden in 80 lines of retry logic, not in the 20 lines of order validation.
Onboarding New Engineers
| Milestone | Traditional Backend | Distributed Services |
| First PR merged | 3 days | 2 weeks |
| Comfortable with codebase | 2 weeks | 6 weeks |
| Can design features | 1 month | 3 months |
The difference: In distributed systems, engineers must understand retry patterns, state machines, message queue semantics, cache strategies, circuit breakers, and distributed tracing.
This is 2x the business domain knowledge.
The Pattern
We started with a simple order service. To make it production-ready, we added:
| Requirement | Infrastructure Added |
| Services fail | Retry logic |
| Processes crash | State management |
| Retries duplicate | Idempotency |
| Synchronous is slow | Message queues |
| Race conditions | Distributed locks |
| Debugging is hard | Observability |
Every reliability feature adds infrastructure code.
The Core Problem
Why do we write the same infrastructure in every service?
Think about it:
Retry logic: Same exponential backoff pattern everywhere
State management: Same state machine pattern everywhere
Idempotency: Same caching pattern everywhere
Message handling: Same consumer loop everywhere
We're reinventing the wheel, badly, in every service.
The Question
What if infrastructure could handle:
Retries automatically?
State management automatically?
Idempotency automatically?
Message delivery automatically?
What if the runtime provided these primitives instead of forcing you to implement them in application code?
In Part 2, we'll explore exactly that: Durable Execution, an architecture that moves infrastructure from your code to the runtime.
Key Takeaways
The glue code tax is real: 60-80% of microservices code is infrastructure
It compounds: More infrastructure = more bugs = more 3 AM incidents
It slows everything: Development, onboarding, debugging
There's a pattern: We implement the same primitives everywhere
There's a solution: Move primitives to the runtime (next post)
Series Navigation
| Part | Title | Status |
| 1 | The Glue Code Tax | 📍 You are here |
| 2 | Durable Execution: Moving Infrastructure to the Runtime | Next → |
| 3 | The Functional Foundation: Why Durable Execution Works | |
| 4 | OMS Demo: From 6 Services to 2 | |
| 5 | Building Reliable AI Agents |
About the Author
Ravi Lanka is a Senior Backend Engineer building production distributed systems at Maersk. He specializes in event-driven architectures, functional programming (Kotlin/Arrow-KT), and durable execution frameworks.
Have you calculated your glue code tax? I'd love to hear your numbers.
Drop a comment below or DM me on LinkedIn.

