No Estimates

07 Mar 2015 17:24


I recently became aware of the recent Twitter #NoEstimates movement, thanks to this post on Slashdot. It goes back to this blog post by Woody Zuill. The hashtag is perhaps deliberately provocative, but has provided a useful forum for debate. The good news is that there is a well-documented but relatively unknown process that is essentially the same process that he describes: evolutionary project management. Anyone who has worked with me over the past decade should recognise some of this.

As always on Slashdot, the comments thread is way more informative and entertaining than the original article, accompanied by forthright language, for those of a nervous disposition. Here’s my own short summary of the collected wisdom of the developer crowd. Please let me know your own take on this!

The "Double It / Convert to degrees Fahrenheit" Approach

So, what do you do when a manager insists they need an estimate, even after you have spent 15 minutes trying to explain to them the process of engineering discovery? Several of the posters disclosed their personal algorithms for estimate inflation, like this (double it), this (triple it), and this (convert to Fahrenheit).

Although this can work in the short-term to get a manager off your back so you can back to actually doing productive work as soon as possible, it’s going to come back to bite you. The systematic problems here are the lack of trust and the lack of organisational learning: the next estimate goes through the same algorithm, and history repeats.

The "Function Point Analysis / Story points / T-Shirt Sizing" Approach

Maybe pulling a number out of the air can be improved upon. Maybe we can substitute some kind of proxy for a seemingly 100% accurate person-days estimate? Back in the day, Function Point Analysis was a popular management technique, which always seemed to me a fairly meaningless measure of bullet points in a requirements specification somewhere.

More recently, the agile movement has adopted this in a different form as story points, and T-shirt sizing (S, M , L, XL, XXL etc). Both attempt to help prioritize work based on the collective perceived complexity of a task, using an arbitrary scale, perhaps anchored to a well-known ‘M’ – sized task. By measuring the number of story points delivered by a stable team over a number of iterations, the idea is that the average ability of the team to deliver stuff can be found (‘velocity’), and use that for future estimation.

This technique has a couple of major weaknesses however. It is impossible to compare velocity between teams, because of the privately formed consensus within each team of the interpretation of each point number or T-shirt size, although this probably won’t stop managers from doing so. It is also very difficult to stop slipping back into the trap of equating story points to number of effort-hours or days. The key here is that it is a measure of complexity: the more complex something is, the greater the likely variation in required effort, or, in other words, the greater the +/- uncertainty should be attached to any effort estimate.

The "That's too long, you need to give a more realistic deadline" Approach

As if estimating itself is not sufficiently difficult, the real problems starts when the numbers are then used, manipulated and abused by others in an organization, usually when they are communicated up the hierarchy and turned into a deadline.

The really valuable information related to the original estimate (the uncertainty, assumptions, caveats etc) get lost, if there were ever documented at all, and the “4 weeks” estimates gets turned into a Today-plus-four calendar weeks deadline. People assume you will work 100% on the task and nothing else, or believe that a bit of pressure makes people work harder.

In any organisation that has more people that can comfortably stand around a single coffee machine, there is a need to formalise the co-ordination of tasks between teams (sales, marketing, support) so that the overall delivery is met. Estimates are used, but become meaningless once their context has been lost, because they are not updated to reflect the new reality as work progresses. They become a tool to pass the blame once the deadline inevitably sails by.

How can this situation be improved? When have you seen management hierarchies pay real attention to the risks and uncertainties attached to estimates, and then track them over time?

The "Give up" Approach

If you are very lucky, and work for one of the founders of Stack Overflow and Trello, you really can adopt a #NoEstimates approach. Joel wrote an influential blog post back in 2007 on Evidence-Based Scheduling, which is well worth a read. According to one poster on the Slashdot comment thread, Joel’s company uses a single estimate of ‘6-8 weeks’ for everything. Much smaller things, just do them, much bigger things, don’t do them until you can break them down into smaller things.

And finally, perhaps a way forward

I am sure most people reading this can recognise examples of all these approaches. It seems to me that there are a few fundamental themes that run through them all:

  • Lack of understanding about what estimates actually are: they are not single numbers but big blobby things with uncertainties, assumptions, and constraints.
  • Similarly, a lack of understanding of how to combine estimates (of smaller tasks into more complex ones, tasks between different teams)
  • The only estimates used for scheduling and decision-making are measures of effort & time. I only found a single post out of 250+ in the comment thread that suggested estimating business value as well as level of effort might be a good idea.
  • Human psychology cannot be ignored. We all want to be Scotty Miracle Workers, line up someone else to blame, and believe one-word requirement changes can have no impact on a schedule.

I would suggest we need techniques that help us:

*Focus on understanding what we know, and, more importantly, what we don’t know yet in sufficient detail to make good decisions

  • Balance the estimation of required effort with an equivalent analysis of the benefits we hope to gain in doing the work.
  • Increase transparency and objectivity of setting deadlines and milestones
  • Recognize constraints and the wider environment in which decisions are made, both internal (people, process) and external (customers, suppliers, regulatory)

Tom Gilb, based on his work at IBM in the 1960’s, devised a set of techniques and tools specifically engineered to address these needs, which he calls “Competitive Engineering” (“CE” for short). More recently, his son, Kai Gilb, extended them into an “agile-with-brains” methodology they called “Evolutionary Project Management” or “Evo” for short. I have been successfully applying these ideas to my own professional work since I first encountered them at Citigroup in 2005.

Although CE and Evo provide a comprehensive suite of requirements engineering and project management techniques, their core is built around a process of estimation:

*Estimation and quantification of resources and key stakeholder goals at the same time

  • Estimation of the impact (positive or negative!) of alternative possible designs or strategies on the achievement of the goals and resources
  • Estimation of the uncertainties, risks, and credibility of all of these estimates.

The process assists all stakeholders, from management through to developers, to attempt to quantify the needed benefits and value (‘how much “usability” do our customers actually need?”). The resulting numeric estimates are a just a useful side-benefit of this technique : the real value is in having these attempts in the first place. Better than any other method I have seen or used, it forces you to evaluate what is really known about the requirements, and makes it very clear where there are major holes or unknowns before you can proceed. This can work at any level: from corporate board level to individual agile sprints.

Get in touch!

Please contact me if you want to discuss any of this in more detail. I have previously written a review of the Competitive Engineering book and a detailed case study from one of my past projects at a large investment bank in which these techniques were used. - Comments: 0


What Is A Business Analyst?

10 Nov 2014 14:52

What Is A Business Analyst?

I was recently asked to define the role of a business analyst. People generally get what a developer ("writes code"), tester ("tests code") and project manager ("GANTT charts") do. Isn't that sufficient to deliver systems on time and on budget? Maybe, but what about those tricky things called "requirements"?

I wrote a short paper which identifies key tasks and success factors related to the ill-defined project role of a “business analyst”. It is based from on my own professional work experience as a business analyst over the past 20 years, across a variety of organisations (software firms, multi-national investment banks) and domain areas within the investment banking industry (buy-side, sell-side, front / middle / back office).

In conclusion, requirements elicitation and management should be considered a separate engineering discipline, alongside the more well-known roles. The business analyst role needs specific skills and training to be effective. The most effective practitioners recognise that the core tasks of requirement elicitation and capture are actually rooted in core engineering disciplines and principles of quantification, decomposition (of complex things into simpler, atomic statements) and

Please download and read the full What Is A Business Analyst? paper.

Please get in touch to discuss further, or leave comments below. - Comments: 0


Introduction to NetKernel

27 May 2014 14:36
Tags: netkernel

In March 2014, I met the founders of 1060Research at a training course given by Tom Gilb held at the British Computer Society in London. Over the past 12 years, Peter Rodgers and Tony Butterfield have developed the concept of something they named Resource-Orientated Computing ("RoC"), which attempts to bring the scaleability and zero-coupling economy of the world - wide web into the domain of individual applications. The NetKernel platform is their concrete, production-hardened implementation of the RoC principles. In a short 40 minute presentation, I saw enough to be immediately intrigued by this new approach.

In a nutshell, RoC and NetKernel focus on the information (or "resources") present in a software system as the primary architectural and compositional entity, rather than the physical code. In fact, everything becomes a resource, including the application state , but also the code itself, and all of the configuration needed to wire up the application. To the kernel, everything is normalised to the same core few ideas of requests, endpoints, representations and resolution. Once you have understood this, everything changes compared with the static object-orientated programming model.


Over the next few weeks, I read through much of the extensive documentation, installed the NetKernel server, watched some Youtube videos and walked through some of the available tutorials.

I then decided the best way to really gain a working understanding of RoC and NetKernel was to implement a real-world application based in some of my recent work experience. I also decided to keep a detailed diary of my progress through the implementation, recording lessons learnt and issues encountered along the way. I therefore started work on the NK-PKS prototype project, which I open-sourced on my github page. The "PKS Diary' documentation is embedded in the main PKS NetKernel module, and can be viewed in the NetKernel Documentation portal once the NK-PKS modules have been installed in a NK instance. - Comments: 0


A Competitive Engineering Case Study

27 Mar 2014 17:32
Tags: eve gilb planguage

A Competitive Engineering Case Study: Price Sentinel

I have just uploaded a new PDF document to the Files page.


This paper describes a single case study of a project management approach known as Requirements Engineering or Competitive Engineering (“CE”), developed by Tom and Kai Gilb. It demonstrates the benefits of re-casting stated business requirements as quantified stakeholder value objectives, focusing all project effort on maximizing the improvement of those value objectives for minimal cost by delivering real measurable improvements to those stakeholders early and often. - Comments: 0


Competitive Engineering on Prezi

08 Feb 2014 14:54
Tags: evo gilb

Competitive Engineering Presentation on Prezi

A few years ago, I created a presentation on the core principles of Competitive Engineering on I have embedded it into this blog post.

- Comments: 0


Tom Gilb's Competitive Engineering

03 Sep 2011 09:46
Tags: agile evo gilb


Back in 2004, I was employed by a large investment bank in their FX e-commerce IT department as a business analyst. The wider IT organisation used a complex waterfall-based project methodology that required use of an intranet application to manage and report progress. It superficially appeared to lock down development risk by imposing a structured, complex series of templates, procedures and sign-offs at all stages through initiation, analysis & design, development, testing and deployment. It required estimation and tracking of future resource spend (almost exclusively person-hours and delivery dates). However, it's main failings were that it almost totally missed the ability to track delivery of actual value improvements to a project's stakeholders, and the ability to react to changes in requirements and priority for the project's duration. The toolset generated lots of charts and stats that provided the illusion of risk control. but actually provided very little help to the analysts, developers and testers actually doing the work at the coal face.

Recognising that process improvements at the level that really mattered were needed, they looked at outside help to inject new ideas. That led to my first introduction to Tom Gilb, and the set of ideas , concepts and tools that he has collectively called "Competitive Engineering". During a 5-day course, Tom and his son Kai led us through the concepts of quantified performance objectives, impact estimation, specification quality control and evolutionary project management. These concepts seems to crystalise and reinforce more vague "common-sense" methods I had picked up in my career to date, particularly going back to my first job working on real-time command and control systems in the defense industry .

Inspired by these ideas, I proceeded to successfully apply many of the ideas in specifications and projects thereafter. Since leaving the bank in 2007, I have continued to introduce and apply competitive engineering methods in many other organisations and projects, continually learning and trying to improve. The ideas documented in the Specification Cookbook are largely inspired by Gilb's work.

Competitive Engineering - The Book

In 2005, Gilb published "Competitive Engineering", subtitled "A Handbook For Systems Engineering, Requirements Engineering, And Software Engineering Using Planguage". The book provides a comprehensive description of how to apply Gilb's concepts to any software project. The focus is on practical guidance, tools and advice so that you can pick out individual ideas and apply to your own working environment; more of a SDK than a formalised project management methodology. There is a lot of information packed into its pages; the emphasis is on useful content rather than unnecessary verbage. Each Chapter is presented with the same 10 subsections, making it straightforward to look-up the information you want. The book also includes a large Glossary of concept definitions which is published with a fair-use license.

Chapter 1 - Planguage Concepts and Control

The book dives straight in with an overview of the fundamentals of competitive engineering, called the "Planguage" framework, based on the plan-do-study-act process cycle. It introduces the key concepts and defined terms (e.g. "rules", "definition", "requirement") in a basic Planguage parameters, concepts table. It provides a powerful list of "12 Tough Questions", to help in controlling any kind of project risk, to find out what people really know, and don't know. Examples include:

  • Numbers - Why isn't the improvement quantified?
  • Doubt - Are you sure? If not, why not?
  • Evidence - How do you know it works that way? Did it 'ever'?
  • Proof - How can we be sure the plan is working, during the project, early?

A further treatment of this list is given in [ this paper (pdf)].

Chapter 2 - Introduction to Requirements 'Why?'

This chapter provides an overview of system attributes (function, performance, workload capacity, resource, design) and the different types of requirement:

  • Vision - a leadership statement of the goals ("We will be #1 in XYZ in 3 years")
  • Function - what the system does
  • Performance - how "good" the system has to do the Functions
  • Resource - targets and constraints on any "cost" associated with building or operating the system
  • Design Constraint - explicit unmoveable restriction regarding design choices
  • Condition Constraint - other conditions that must be fulfilled or not (e.g. legal, regulatory).

Then a series of Rules and examples are proposed governing specification of requirements.

Chapter 3 - Functions 'What Systems Do'

Describes in much more detail how to think about what a system does (its "Functions"), emphasising the importance of separation of Function "What" from Design "How" and Performance "How well". For example in trading applications, Functions are what a trader would have to do to manage orders if they just had to use pen and paper (just like the good old days!). The chapter sets out principles of function specifications and several worked examples.
Note that most specs would stop here, if at all, and then dive into design .. and we are only on chapter 3!

Chapter 4 - Performance 'How?'

One of the most interesting chapters of the book (worth the entrance fee on its own imho), focuses on one of the key ideas behind Planguage: How to specify "How good" a given performance attribute should be, by specifying quantifiable scales of measure. Attempting to formally answer this question is then central to the other techniques of impact estimation tables, Evo project management that follow later in the book. Think of all the "ilities" that crop up in normal specs: "need better usability", "maintainability", "flexibility" …. yes, but what does that actually mean? How do those statements help to decide the designs we need to implement? This chapter introduces principles, rules and examples to help put numbers and target levels ("past", "goal", "stretch", "fail") that force stakeholders to think about the level of quality improvement they require for a given set of resources being allocated to the project (money, effort etc).

Chapter 5 - Scales of Measure 'How to Quantify'

Putting numbers to seemingly amorphous concepts as "usability" and "flexibility" is hard, and so this chapter follows on from the introduction in Chapter 4 to present concrete analysis techniques and ideas to help actually come up with useful scales of measure which can then be used to guide the project deliverables during its development lifecycle. Even better, Tom Gilb has made this chapter freely available on his website!

Chapter 6 - Resources, Budgets and Costs 'Costs of Solutions'

This chapter provides treatment for the analysis of the resources, resource requirements ("budgets") and costs. It emphasises the benefit of having quantified performance *and* resource requirements in maintaining control over costs (using tools like Evo and Impact Estimation tables). Design to cost, by selecting designs that fit within the committed budgets, then using the Evo method for the project to ensure the maximum "benefit" is delivered for the resources spent on it, improving cost estimates by learning during early, frequent benefit deliveries.

Chapter 7 - Design ideas and Engineering 'How to Solve the "Requirements Problem" '

This chapter argues for Design Engineering: the systematic evaluation of designs against all relevant stakeholder function, performance and resource requirements. To make the initial point it includes a version of the classic swing-design cartoon! As does the rest of the book, it provides a lot of practical ideas, such as suggestions on how much information to add to a Design Specification at the Idea / evaluation stage vs the detailed design carried out in Evo steps (see Figure 7.3 in the chapter), and a process by which to identify and evaluate design ideas. The chapter finishes with an interesting treatment of evaluating "priority", and why using Planguage concepts can provide much higher levels of objective control over priority setting.

Chapter 8 - Specification Quality Control 'How to know how well you specified'

This Chapter almost stands on its own as a method for the control of quality of any typical output of a project (any document or artifact), but focussed on specification quality. SQC is a method of engineering process control through sampling measurement of specification quality. It presents a "full" and "lite" inspection process variants. If you recognise the classic "cost curve of bug fixing", then SQC immediately justify its upfront costs by systematically reducing defects even before coding has started! By insisting on Rules for writing specs to be published, and then only inspecting a sample of the full document (e.g. 1 page of 300 non-commentary words per page) against those Rules only, you can efficiently estimate the number of Defects across the whole document. This chapter provides a lot of practical guidance with this process, including templates for inspection forms. Tom Gilb has previously published a book dedicated to software inspections.

Chapter 9 - Impact Estimation 'How to understand strategies'

This chapter presents one of the key risk control tools in the Planguage competitive engineering method; that of the Impact Estimation (or "IE") table. An IE table simply has a list of "designs" along the top, and a list of stakeholder performance objectives and resource costs down the side. At each intersection of idea and objective, you write the estimated improvement towards the goal or estimated spend of resource in quantative terms. Summing these improvements / costs for all the objectives gives a feel for how "valuable" the design idea is; by summing the same estimates across all design ideas you can see if you have thought of sufficient solutions to achieve the goals. By dividing the sum of the improvement benefits by the sum of the related resources spend, you can get a rough cost/benefit ratio for each design idea, thus giving key insights to the cost vs benefit for multiple competing solutions or design ideas against your key stakeholder objectives and resource budgets. The point of trying to write this in a table enables a more objective, evidence-based decision-making approach, that crucially accounts for both the cost (which is the usual focus of IT project management!) and benefits (to the stakeholders), which is often missed completely, or poorly understood.

Chapter 10 - Evolutionary Project Management 'How to Manage Project Benefits and Costs'

Finally, Chapter 10 presents a summary of the Evo project management methodology, which ties together the concepts in the rest of the book. Based on the well-known Plan-Do-Study-Act cycle, Evo demands:

  • Early delivery of results to stakeholders
  • Frequent delivery of results to stakeholders
  • Small increments or ('steps')
  • Useful-to-stakeholder steps
  • Sequencing of steps according to degree of stakeholder benefit, preferably the most profitable first.

Once you have your function requirements and quantified performance objectives, and design ideas evaluated using IE tables, you embark on a series of short Evo steps, which rapidly goes through design, development, testing and implementation to get early feedback from the stakeholders themselves. The next Evo step is then adjusted (maybe re-do the step again, or select another step now you have learnt more by trying to deliver the current step). During each step, you are actually *encouraged* to go back and re-write any of the previous analysis and design specs as further stakeholder feedback is received.


The last 1/3 of the book itself includes a comprehensive formal glossary and definitions set for the Planguage concepts; I think of it as a domain-specific language for business analysis and project management. It is published under a separate permissive license which encourages the wider use of its contained ideas. A live version is also available on Gilb's site.


The proof is in the pudding; I have used Evo (albeit in disguise sometimes) on two large, high-risk projects in front-office investment banking businesses, and several smaller tasks. On the largest critical project, the original business functions & performance objective requirements document, which included no design, essentially remained unchanged over the 14 months the project took to deliver, but the detailed designs (of the GUI, business logic, performance characteristics) changed many many times, guided by lessons learnt and feedback gained by delivering a succession of early deliveries to real users. In the end, the new system responsible for 10s of USD billions of notional risk, successfully went live over one weekend for 800 users worldwide, and was seen as a big success by the sponsoring stakeholders.

I would recommend this book to any business analyst or project manager working to deliver complex IT systems. Further material is available on Gilb's website at - Comments: 0


ZeroMQ Sockets

27 Aug 2011 16:20
Tags: zeromq

ZeroMQ provides a scaleability layer between application endpoints. It implements several different message topologies, provided by a series of Socket Types, each with their own well-defined function and behaviour. This article summaries the behaviour and function of each Socket Type, and the most typical Socket combinations, valid in ZeroMQ 2.1.x.

ZeroMQ Socket Types

Socket Topology Send/Receive Incoming Routing / Msg Action Outgoing Routing / Msg Action HWM Action
REQ Request-Reply Send, Receive, Send, … Last peer /
Removes empty part
Load balanced /
Prepends empty part, queue unsent msgs
REP Request-Reply Receive, Send, Receive, … Fair-queued /
Retains message parts up to 1st empty part
Last peer /
Prepends retained parts, queue unsent msgs
DEALER Request-Reply Unrestricted Fair-queued /
No changes made
Load-balanced /
No changes made, queue unsent msgs
ROUTER Request-Reply Unrestricted Fair-queued /
Prepends reply address to message
Addressed peer /
Strips 1st message part, drops message if identity not known or peer available
PUB Pub-Sub Send only n/a Fan out /
No changes made, drops message if no subscribers
SUB Pub-Sub Receive only Fair-queued, optional filtering /
No changes made
n/a Drop
PUSH Pipeline Send only n/a Load balanced /
No changes made, queue unsent msgs
PULL Pipeline Receive only Fair-queued /
No changes made
n/a n/a
PAIR Exclusive Pair Unrestricted n/a Queue unsent msgs Block


  • Fair-queued means the socket gets input from all sockets connected to it in turn.
  • Load-balanced means the socket sends messages to all sockets connected to it in turn.
  • Last peer means the socket sends the message to the last socket that sent it a message
  • Fan-out means message is sent to all connected peers simultaneously

ZeroMQ Socket Combinations

ZeroMQ only supports the following Socket combinations in a connect-bind pair:

Combination Examples from the ZeroMQ ZGuide [Bibliography item ZGuide not found.]
PUB and SUB Uni-directional message broadcast from Publisher to Subscriber(s)
Publish-Subscribe Proxy Server
Slow Subscribers (Suicidal Snail Pattern)
High-speed Subscribers (Black Box Pattern)
Reliable Pub-Sub shared key-value system (Clone Pattern)
REQ and REP Sequential message request-reply between two Endpoints (e.g. client to server)
REQ and ROUTER Frontend broker connection to multiple clients
Req-rep broker connection to multiple workers using custom routing (LRU)
DEALER and REP Simple synchronous (one request , one reply) broker connection to one or more separate worker processes using load-balancing DEALER socket
DEALER and ROUTER Custom 1 .. n request routing from a single client to multiple servers
Asynchronous n..1 request routing to a single server
DEALER and DEALER Backend asynchronous (one request, many replies) broker connection to multiple workers
ROUTER and ROUTER Asynchronous brokerless reliability (Freelance Pattern)
PUSH and PULL Task ventilator connected to a pool of workers that forward results to a task sink
Parallel Pipeline with kill signalling
Pipeline streamer device
PAIR and PAIR Signalling between Threads
[[bibliography title="Further Reading"]]

[[/bibliography]] - Comments: 0


Messaging #5 - Duplicate Messages

27 Aug 2011 11:07

This is the 5th in a series of articles about SOA (Service-Orientated Architecture) messaging systems.

Duplicate Messages

Every Component should expect to receive duplicate messages from the message system at any time. Duplicates can arise from:

  • A technical failure in a publishing Component, or a delivery failure and attempted recovery event by the messaging infrastructure itself.
  • An external (with respect to the system) message source (e.g. an external FIX endpoint at a customer), outside the control of the receiving System.

No messaging architecture can guarantee duplicate - free message delivery under any circumstances, particularly the kinds of architectures required for low-latency high-volume distributed processing. Duplicate message handling must be considered as an integral part of a
Service Component’s design.

Duplicate messages can be either:

  • Explicitly flagged as possible duplicate messages. For example, 29West sets a flag on a message if it has been delivered by the persistence store (which acts as a proxy to the original publisher). Another potential example is during a message replay during a FIX session re-connection event, where previously sent messages are re-sent by the FIX client.
  • Unknown duplicates. The message initially appears to be a unique, new event. Receipt and subsequent detection of such messages usually indicates a major failure in the messaging layer somewhere.

Functional and technical analysis should define the strategies for detecting and handling confirmed and suspected duplicates. Care should be taken on balancing the requirements for low-latency processing (e.g. time to update a cached currency position) with the required accuracy and integrity of the position data itself. Implementation strategies may include:

  • Use of UUIDs or other unambiguous data matching techniques
  • Heuristic pattern matching of message data with existing records
  • Optimistic detection, whereby unflagged messages are initially assumed to be unique, thus ensuring fastest – possible processing latency, followed by subsequent duplicate detection that may result in the effect of the message being unwound in the (assumed rare) cases that a a duplicate has been received. - Comments: 0


Messaging #4 - Resiliency Groups

20 Aug 2011 14:54
Tags: messaging

This is the fourth article in a series about designing messaging architectures.

Resiliency Groups

To understand what we mean by “resiliency” or its cousin “reliability”, we need to look at their alter ego: “failure”. We can consider failure originating from:

  • Application code can throw unhandled exceptions, crash, run out of memory or other resources, get into a deadlock or inifinite loop and therefore stop responding to input. This is the most likely, as the application code is less likely to be shared across other systems.
  • System code, such as the messaging middleware, can die, or even run out of memory (e.g. caused by slow subscribers). This should happen less than in application code, if only because other systems will probably also using the same middleware and therefore find and identify bugs which can be applied to other deployments.
  • Message queues can overflow
  • Network transmission can fail temporarily, causing intermittent message loss.
  • Hardware can also fail: servers, network switches to entire data centers.

For the purposes of this article, we will look at how to mitigate the failure of an individual Service Component instance. A key insight is that consideration of the strategies required to mitigate each of the different failure modes listed above must be included in the design from day 1. Wherever possible, the implementation of messaging resiliency design should be moved away from application code into the “Bus” layer or the middleware system; so that a common set of strategies can be applied across all Components.

Components in a SOA architecture are defined by the Service(s) that each type provides. If we assume that the Services are only provided by Component instances of the same type (i.e. with the same application code), the resiliency strategy for those instances needs to be considered. Additionally, scalability needs to be factored into the design; how many instances of the same Component are required in the local / global system to fulfil the performance requirements for the Service.

One approach is to define, by configuration (see above), one or more Resiliency Groups for each Component type. A Resiliency Group will ensure that exactly one instance of the Component within the Group will receive and process a given message sent to the Service on the Logical Message Bus, known as the “Primary” (P). Backup instances (see below) are denoted as Secondary (S) and maybe also Disaster Recovery (D) instances that run on dedicated DR – ready infrastructure.

Scalability is provided by splitting the message workload across multiple Resiliency Groups (e.g. may be located in different WAN regions). The most efficient way to achieve this is likely to be using topic subscription filtering, based on information in the topic name itself, enabling the horizontal scaling to be mediated by the messaging infrastructure and Logical Message Bus, rather than in application code.

Primary Instances within the Group will receive messages from the Logical Message Bus (if more than one Primary, messaging layer can route to different Primary instances based on an appropriate routing logic: LRU, Round-Robin etc)

Secondary & DR Instances can run in hot standby (receiving messages, processing them to update internal state, but not generating any output or effect), warm standby (initialized and running, but not receiving messages), or cold standby (instance not running unless required).

Primary and Secondary / DR instances can then be composed into Resiliency Groups; with the Group behaviour mediated by the software layer forming the Logical Message Bus on top of the messaging infrastructure. It would use heartbeating or similar to observe component failures and ensure the appropriate role change occurs to a backup instance in the Group:

  • PPD (or PPPPD !). Two or more Primary instances, with a load balancing function to distribute the message load across the active Primary instances. The DR Instance would only become active and start processing messages if no Primary instances were running in the Group.
  • PSD. One Primary instance processes all messages; the S and D instances are running in warm standby. If the Primary instance should fail, the Secondary instance will detect and take over the Primary role. If the Secondary instance should fail, the DR instance would take over. If the previously failed instance should restart successfully, it would join as a new Secondary instance.

The role handover used in PSD depends if the Service(s) provided by the Component are stateless or not; if stateless then the new Primary instance can go straight ahead and process the next message. If message processing is stateful, the new instance must synchronize its internal state (e.g. from a persistent store or other Service) before commencing to process the message stream. - Comments: 0


Messaging #3 - Logical Data Bus

07 Aug 2011 22:19
Tags: messaging

This is the 3rd in the series of messaging - related articles.

Build a Logical Message Bus

The messaging infrastructure supports the creation of a messaging topology and provides messaging – level services (guaranteed delivery, messaging resilience, WAN bridging etc). However, there is still a significant functional gap between that and the application domain-level code that developers and the business wish to spend most attention on.

A shared common layer can be built that abstracts the messaging infrastructure from Service Component application code, and provides further services:

  • Topology configuration (e.g. converting message topology configuration into behaviour and state).
  • Topic building (e.g. using a Builder Pattern implementation)
  • Component resiliency (warm standby, round - robin)
  • Inter-component heartbeating
  • Test injection / mocking for unit testing
  • Message performance monitoring
  • Presents a messaging API to application code that abstracts the implementation of the messaging layer, allowing possibility of heterogeneous messaging architectures to be utilized to extract maximum performance.
  • Implements the messaging-level protocol(s) used for message creation and bundling (the actual wire byte protocol would be determined by the messaging infrastructure itself).
  • Provides a common point of control for messaging functionality across all components.

The combination of these services effectively creates a logical message bus (although the underlying messaging topology may be one or more of many different types and implementations, and not formally recognized as a “bus” at all).

This shared code becomes a critical part of meeting the System’s performance objectives, and, as such, requires the oversight of senior technical development personnel in its design and development. - Comments: 0

page 1 of 212next »
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License