1313852075|%e %B %Y, %H:%M
Tags messaging
This is the fourth article in a series about designing messaging architectures.
Resiliency Groups
To understand what we mean by “resiliency” or its cousin “reliability”, we need to look at their alter ego: “failure”. We can consider failure originating from:
- Application code can throw unhandled exceptions, crash, run out of memory or other resources, get into a deadlock or inifinite loop and therefore stop responding to input. This is the most likely, as the application code is less likely to be shared across other systems.
- System code, such as the messaging middleware, can die, or even run out of memory (e.g. caused by slow subscribers). This should happen less than in application code, if only because other systems will probably also using the same middleware and therefore find and identify bugs which can be applied to other deployments.
- Message queues can overflow
- Network transmission can fail temporarily, causing intermittent message loss.
- Hardware can also fail: servers, network switches to entire data centers.
For the purposes of this article, we will look at how to mitigate the failure of an individual Service Component instance. A key insight is that consideration of the strategies required to mitigate each of the different failure modes listed above must be included in the design from day 1. Wherever possible, the implementation of messaging resiliency design should be moved away from application code into the “Bus” layer or the middleware system; so that a common set of strategies can be applied across all Components.
Components in a SOA architecture are defined by the Service(s) that each type provides. If we assume that the Services are only provided by Component instances of the same type (i.e. with the same application code), the resiliency strategy for those instances needs to be considered. Additionally, scalability needs to be factored into the design; how many instances of the same Component are required in the local / global system to fulfil the performance requirements for the Service.
One approach is to define, by configuration (see above), one or more Resiliency Groups for each Component type. A Resiliency Group will ensure that exactly one instance of the Component within the Group will receive and process a given message sent to the Service on the Logical Message Bus, known as the “Primary” (P). Backup instances (see below) are denoted as Secondary (S) and maybe also Disaster Recovery (D) instances that run on dedicated DR – ready infrastructure.
Scalability is provided by splitting the message workload across multiple Resiliency Groups (e.g. may be located in different WAN regions). The most efficient way to achieve this is likely to be using topic subscription filtering, based on information in the topic name itself, enabling the horizontal scaling to be mediated by the messaging infrastructure and Logical Message Bus, rather than in application code.
Primary Instances within the Group will receive messages from the Logical Message Bus (if more than one Primary, messaging layer can route to different Primary instances based on an appropriate routing logic: LRU, Round-Robin etc)
Secondary & DR Instances can run in hot standby (receiving messages, processing them to update internal state, but not generating any output or effect), warm standby (initialized and running, but not receiving messages), or cold standby (instance not running unless required).
Primary and Secondary / DR instances can then be composed into Resiliency Groups; with the Group behaviour mediated by the software layer forming the Logical Message Bus on top of the messaging infrastructure. It would use heartbeating or similar to observe component failures and ensure the appropriate role change occurs to a backup instance in the Group:
- PPD (or PPPPD !). Two or more Primary instances, with a load balancing function to distribute the message load across the active Primary instances. The DR Instance would only become active and start processing messages if no Primary instances were running in the Group.
- PSD. One Primary instance processes all messages; the S and D instances are running in warm standby. If the Primary instance should fail, the Secondary instance will detect and take over the Primary role. If the Secondary instance should fail, the DR instance would take over. If the previously failed instance should restart successfully, it would join as a new Secondary instance.
The role handover used in PSD depends if the Service(s) provided by the Component are stateless or not; if stateless then the new Primary instance can go straight ahead and process the next message. If message processing is stateful, the new instance must synchronize its internal state (e.g. from a persistent store or other Service) before commencing to process the message stream.