Welcome, my name is Joe Badaczewski

# post

Reliable inter-service communication in distributed systems

2025-05-23 • From message queues to graceful degradation - patterns for communication that survives failure.

A service oriented architecture (SOA) often suffers from over-engineering and complexity. But as an organization and domain grow, SOAs provide a host of benefits: horizontal scalability (both in server hardware and developer contributions), isolated test areas, explicit security boundaries, more flexibility in deployment, just to name a few.

This post focuses on a specific area of SOAs: building durable communication between services that is resilient to failure. Eliminating failures in a distributed system is practically impossible due to the inherent complexity and interdependencies; the most resilient systems can identify failing behavior and mitigate the customer impact automatically. I have some experience in building and maintaining SOAs in production and would like to discuss my findings on what it takes to build a durable messaging mechanism in distributed systems.

There is a great chapter in "Building Microservices" by Sam Newman that covers messaging patterns for microservices and distributed systems. The author covers several subjects, technologies, and paradigms but I am going to distill the importance of this chapter to this philosophy: communication between microservices must promote decoupling, durability, and resiliency. If we agree on this philosophy as a best practice, I recommend RabbitMQ as an excellent choice for microservice communication for most cases because:

It features forced acknowledgement. When we enable forced acknowledgement for message queue, messages are only removed when a consumer explicitly communicates to the messaging service that the message has been received. This feature increases the reliability of inter-service communication. For example, If a consumer crashes while processing an event, the message returns to the queue rather than being lost.
It features durable messages. Durability in software engineering is defined as minimizing data loss on application failure. RabbitMQ achieves durable messaging by persisting messages even if the RabbitMQ service crashes.
It promotes decoupling. While not exclusively a feature of RabbitMQ, using a mature message queue promotes decoupling in general. Also, when the messaging service has no knowledge of producers, consumers, or even the message body, we can build our inter-service communication against an interface rather than the concrete implementations of each service.

I have enjoyed my time with RabbitMQ because of its maturity, feature-completeness, and reliability. But there are many other options for microservice communication that follow a different model (rather than RabbitMQ's asynchronous producer/consumer model). There are other details about message queues that I could cover like features specific to RabbitMQ (exchanges and bindings) or features that are part of the generic message queue pattern (dead letter queues, retries, synchronous vs asynchronous), but instead I want to move on to ensuring that each part of the messaging chain of communication can fail and the system will be able to self heal and achieve eventual consistency.

It's not enough to construct a bullet proof architecture of decoupled services communicating via a message queue. What if the message queue goes down? While modern, cloud-provided queues can achieve insane levels of availability, we still want to make our system as resilient as possible by accounting for the message service itself going down.

If the messaging service (RabbitMQ instance) ever fails/crashes/experiences long periods of downtime, our services will gracefully degrade by using the outbox pattern. The outbox pattern involves building an outbox for each service that seeks to produce messages. Instead of putting a message directly into the RabbitMQ queue, we instead write a row to a persistent data source that will be processed in the future. At a set interval, a background service will collect all the outgoing messages and try to deliver them to the message service. This type of abstraction allows the application to continue operating while the messaging operations are failing.

There are plenty of details to continue discussing on the topic of building interconnected services that share events and data. In future posts I will cover more of my favorite challenges when building distributed systems:

Enterprise grade observability. Tracking the lifecycle of messages across distributed systems using the saga pattern
Strongly typed messaging SDKs. When the interface of services change, we can use semantically versioned SDKs built from the source code to ensure breaking changes are transparent to the end user and downstream microservices
Enrichment services and the ownership of cross-cutting concerns. Any distributed system is going to have some degree of denormalized data and shared data. We can explore the different strategies and ownership patterns of cross cutting data.

Stay tuned!