Messaging #4 - Resiliency Groups
20 Aug 2011 14:54
Tags: messaging
This is the fourth article in a series about designing messaging architectures.
Resiliency Groups
To understand what we mean by “resiliency” or its cousin “reliability”, we need to look at their alter ego: “failure”. We can consider failure originating from:
- Application code can throw unhandled exceptions, crash, run out of memory or other resources, get into a deadlock or inifinite loop and therefore stop responding to input. This is the most likely, as the application code is less likely to be shared across other systems.
- System code, such as the messaging middleware, can die, or even run out of memory (e.g. caused by slow subscribers). This should happen less than in application code, if only because other systems will probably also using the same middleware and therefore find and identify bugs which can be applied to other deployments.
- Message queues can overflow
- Network transmission can fail temporarily, causing intermittent message loss.
- Hardware can also fail: servers, network switches to entire data centers.
For the purposes of this article, we will look at how to mitigate the failure of an individual Service Component instance. A key insight is that consideration of the strategies required to mitigate each of the different failure modes listed above must be included in the design from day 1. Wherever possible, the implementation of messaging resiliency design should be moved away from application code into the “Bus” layer or the middleware system; so that a common set of strategies can be applied across all Components.
Components in a SOA architecture are defined by the Service(s) that each type provides. If we assume that the Services are only provided by Component instances of the same type (i.e. with the same application code), the resiliency strategy for those instances needs to be considered. Additionally, scalability needs to be factored into the design; how many instances of the same Component are required in the local / global system to fulfil the performance requirements for the Service.
One approach is to define, by configuration (see above), one or more Resiliency Groups for each Component type. A Resiliency Group will ensure that exactly one instance of the Component within the Group will receive and process a given message sent to the Service on the Logical Message Bus, known as the “Primary” (P). Backup instances (see below) are denoted as Secondary (S) and maybe also Disaster Recovery (D) instances that run on dedicated DR – ready infrastructure.
Scalability is provided by splitting the message workload across multiple Resiliency Groups (e.g. may be located in different WAN regions). The most efficient way to achieve this is likely to be using topic subscription filtering, based on information in the topic name itself, enabling the horizontal scaling to be mediated by the messaging infrastructure and Logical Message Bus, rather than in application code.
Primary Instances within the Group will receive messages from the Logical Message Bus (if more than one Primary, messaging layer can route to different Primary instances based on an appropriate routing logic: LRU, Round-Robin etc)
Secondary & DR Instances can run in hot standby (receiving messages, processing them to update internal state, but not generating any output or effect), warm standby (initialized and running, but not receiving messages), or cold standby (instance not running unless required).
Primary and Secondary / DR instances can then be composed into Resiliency Groups; with the Group behaviour mediated by the software layer forming the Logical Message Bus on top of the messaging infrastructure. It would use heartbeating or similar to observe component failures and ensure the appropriate role change occurs to a backup instance in the Group:
- PPD (or PPPPD !). Two or more Primary instances, with a load balancing function to distribute the message load across the active Primary instances. The DR Instance would only become active and start processing messages if no Primary instances were running in the Group.
- PSD. One Primary instance processes all messages; the S and D instances are running in warm standby. If the Primary instance should fail, the Secondary instance will detect and take over the Primary role. If the Secondary instance should fail, the DR instance would take over. If the previously failed instance should restart successfully, it would join as a new Secondary instance.
The role handover used in PSD depends if the Service(s) provided by the Component are stateless or not; if stateless then the new Primary instance can go straight ahead and process the next message. If message processing is stateful, the new instance must synchronize its internal state (e.g. from a persistent store or other Service) before commencing to process the message stream. - Comments: 0
——
Messaging #3 - Logical Data Bus
07 Aug 2011 22:19
Tags: messaging
This is the 3rd in the series of messaging - related articles.
Build a Logical Message Bus
The messaging infrastructure supports the creation of a messaging topology and provides messaging – level services (guaranteed delivery, messaging resilience, WAN bridging etc). However, there is still a significant functional gap between that and the application domain-level code that developers and the business wish to spend most attention on.
A shared common layer can be built that abstracts the messaging infrastructure from Service Component application code, and provides further services:
- Topology configuration (e.g. converting message topology configuration into behaviour and state).
- Topic building (e.g. using a Builder Pattern implementation)
- Component resiliency (warm standby, round - robin)
- Inter-component heartbeating
- Test injection / mocking for unit testing
- Message performance monitoring
- Presents a messaging API to application code that abstracts the implementation of the messaging layer, allowing possibility of heterogeneous messaging architectures to be utilized to extract maximum performance.
- Implements the messaging-level protocol(s) used for message creation and bundling (the actual wire byte protocol would be determined by the messaging infrastructure itself).
- Provides a common point of control for messaging functionality across all components.
The combination of these services effectively creates a logical message bus (although the underlying messaging topology may be one or more of many different types and implementations, and not formally recognized as a “bus” at all).
This shared code becomes a critical part of meeting the System’s performance objectives, and, as such, requires the oversight of senior technical development personnel in its design and development. - Comments: 0
——
Messaging #2 - Topology By Configuration
20 Jul 2011 09:59
Tags: messaging
This is the second in a series of blog posts covering key insights gained whilst working on design of low-latency high-volume distributed trading systems, focusing on the messaging middleware layer.
Topology By Configuration
SOA systems are defined by multiple endpoint-to-endpoint connections, bridged by a variety of intermediary brokers, gateways, bridges etc. For example, 29West uses WAN Gateways to bridge local multicast network traffic across data centres via more reliable TCP transport. Other examples include any broker – orientated messaging topology (e.g. Active MQ etc). The Open Source community is also starting to address this space with products such as RabbitMQ and ZeroMQ.
The messaging topology should be defined by configuration as far as possible. The configuration, controlling mechanisms such as topic subscriptions, list of publication topics, and topics to bridge across Gateways, should be deployed as an overlay; managed as a separate artifact to the application software. This provides several benefits:
- By centralizing implementation of messaging topology into a deployable artifact in its own right, it supports the programme – level management of the topic space itself (see above).
- Changes can be made rapidly in development and testing environments.
- Production messaging topology is put under same version control mechanisms (with rollback etc) as the application code itself.
- Supports development of management, monitoring and visualization tools - Comments: 0
——
Messaging #1 - Service Topic Name Management
18 Jul 2011 09:35
Tags: messaging
RSBA Technology Ltd has provided consulting services to Tier 1 and 2 Investment Banks since 2009. Our focus has been on analyzing the requirements and project managing delivery of complex, multi-site real-time low-latency, high-capacity trading and workflow management systems for eFX businesses within the Bank’s Global Markets areas. In the course of this work, and in many years of experience prior to forming the Company, we have gained valuable insights and knowledge on how to effectively design, develop and deploy such systems in a highly competitive landscape.
This series of blog postings highlights some of these key insights related to the design of distributed trade workflow systems, in the context of low-latency, high-volume processing environments.
Service Topic Name Management
In a Pub / Sub – based SOA (Service-Orientated Architecture) architecture, different Service Components will be subscribing to subsets of data published within the system. Messages of specific content types are associated with specific Topics. Topics are used to determine which messages are processed by different Component types and instances (e.g. “FX.Spot.EURUSD” or “Exception.FIXGateway”).
To ensure the long-term expandability of the architecture, the topic namespace itself must be tightly controlled both technically and managerially. Although a SOA architecture supports low-coupling between Components and their respective development teams, it is essential that all current Components unambiguously define the topics of published messages, and register interest in receiving messages. Failure to so from day 1 leads to propagation of hacks and workarounds that materially impact the ability of the system to scale according to business demand.
A managed topic schema should be defined and published right at the start of the development of the SOA system; or should be retrospectively implemented in an existing system. This can be hierarchical, based on domain delimiters, with a “root” schema defining the overall structure and standards for developing sub-domain topic spaces by different teams below that.
Effort must be put in to model all the likely messaging paths and use cases between Service Components, and used to draw up a Topic Schedule. This should be maintained by a Design Authority as an important lifecycle artifact, and all development teams be made responsible for complying with the topic schema standards.
The topic identifiers, however structured, should not be directly exposed to the application layer. This allows:
- Changes to the topic structure, and thereby the routing of messages, that is transparent to applications and business logic.
- Inclusion of additional topic identifier sections required by the higher-level “Bus” functionality, such as component resiliency.
A possible message topic schema could incorporate the following elements within it:
- Root domain (e.g. “FX”, “IR”, “Control”, “Exception”)
- Publishing Component Identification (component type and instance information, should provide enough info to trace message originator to a specific process, such as physical location).
- Business Partition (Logical description of message contents and / or channel)
- Additional Parameters (additional flags and labels related to message routing, used for message subscription / forwarding purposes) . - Comments: 0