Tom Gilb's Competitive Engineering

03 Sep 2011 09:46
Tags: agile evo gilb

Background

Back in 2004, I was employed by a large investment bank in their FX e-commerce IT department as a business analyst. The wider IT organisation used a complex waterfall-based project methodology that required use of an intranet application to manage and report progress. It superficially appeared to lock down development risk by imposing a structured, complex series of templates, procedures and sign-offs at all stages through initiation, analysis & design, development, testing and deployment. It required estimation and tracking of future resource spend (almost exclusively person-hours and delivery dates). However, it's main failings were that it almost totally missed the ability to track delivery of actual value improvements to a project's stakeholders, and the ability to react to changes in requirements and priority for the project's duration. The toolset generated lots of charts and stats that provided the illusion of risk control. but actually provided very little help to the analysts, developers and testers actually doing the work at the coal face.

Recognising that process improvements at the level that really mattered were needed, they looked at outside help to inject new ideas. That led to my first introduction to Tom Gilb, and the set of ideas , concepts and tools that he has collectively called "Competitive Engineering". During a 5-day course, Tom and his son Kai led us through the concepts of quantified performance objectives, impact estimation, specification quality control and evolutionary project management. These concepts seems to crystalise and reinforce more vague "common-sense" methods I had picked up in my career to date, particularly going back to my first job working on real-time command and control systems in the defense industry .

Inspired by these ideas, I proceeded to successfully apply many of the ideas in specifications and projects thereafter. Since leaving the bank in 2007, I have continued to introduce and apply competitive engineering methods in many other organisations and projects, continually learning and trying to improve. The ideas documented in the Specification Cookbook are largely inspired by Gilb's work.

Competitive Engineering - The Book

In 2005, Gilb published "Competitive Engineering", subtitled "A Handbook For Systems Engineering, Requirements Engineering, And Software Engineering Using Planguage". The book provides a comprehensive description of how to apply Gilb's concepts to any software project. The focus is on practical guidance, tools and advice so that you can pick out individual ideas and apply to your own working environment; more of a SDK than a formalised project management methodology. There is a lot of information packed into its pages; the emphasis is on useful content rather than unnecessary verbage. Each Chapter is presented with the same 10 subsections, making it straightforward to look-up the information you want. The book also includes a large Glossary of concept definitions which is published with a fair-use license.

Chapter 1 - Planguage Concepts and Control

The book dives straight in with an overview of the fundamentals of competitive engineering, called the "Planguage" framework, based on the plan-do-study-act process cycle. It introduces the key concepts and defined terms (e.g. "rules", "definition", "requirement") in a basic Planguage parameters, concepts table. It provides a powerful list of "12 Tough Questions", to help in controlling any kind of project risk, to find out what people really know, and don't know. Examples include:

  • Numbers - Why isn't the improvement quantified?
  • Doubt - Are you sure? If not, why not?
  • Evidence - How do you know it works that way? Did it 'ever'?
  • Proof - How can we be sure the plan is working, during the project, early?

A further treatment of this list is given in [www.gilb.com/tiki-download_file.php?fileId=24 this paper (pdf)].

Chapter 2 - Introduction to Requirements 'Why?'

This chapter provides an overview of system attributes (function, performance, workload capacity, resource, design) and the different types of requirement:

  • Vision - a leadership statement of the goals ("We will be #1 in XYZ in 3 years")
  • Function - what the system does
  • Performance - how "good" the system has to do the Functions
  • Resource - targets and constraints on any "cost" associated with building or operating the system
  • Design Constraint - explicit unmoveable restriction regarding design choices
  • Condition Constraint - other conditions that must be fulfilled or not (e.g. legal, regulatory).

Then a series of Rules and examples are proposed governing specification of requirements.

Chapter 3 - Functions 'What Systems Do'

Describes in much more detail how to think about what a system does (its "Functions"), emphasising the importance of separation of Function "What" from Design "How" and Performance "How well". For example in trading applications, Functions are what a trader would have to do to manage orders if they just had to use pen and paper (just like the good old days!). The chapter sets out principles of function specifications and several worked examples.
Note that most specs would stop here, if at all, and then dive into design .. and we are only on chapter 3!

Chapter 4 - Performance 'How?'

One of the most interesting chapters of the book (worth the entrance fee on its own imho), focuses on one of the key ideas behind Planguage: How to specify "How good" a given performance attribute should be, by specifying quantifiable scales of measure. Attempting to formally answer this question is then central to the other techniques of impact estimation tables, Evo project management that follow later in the book. Think of all the "ilities" that crop up in normal specs: "need better usability", "maintainability", "flexibility" …. yes, but what does that actually mean? How do those statements help to decide the designs we need to implement? This chapter introduces principles, rules and examples to help put numbers and target levels ("past", "goal", "stretch", "fail") that force stakeholders to think about the level of quality improvement they require for a given set of resources being allocated to the project (money, effort etc).

Chapter 5 - Scales of Measure 'How to Quantify'

Putting numbers to seemingly amorphous concepts as "usability" and "flexibility" is hard, and so this chapter follows on from the introduction in Chapter 4 to present concrete analysis techniques and ideas to help actually come up with useful scales of measure which can then be used to guide the project deliverables during its development lifecycle. Even better, Tom Gilb has made this chapter freely available on his website!

Chapter 6 - Resources, Budgets and Costs 'Costs of Solutions'

This chapter provides treatment for the analysis of the resources, resource requirements ("budgets") and costs. It emphasises the benefit of having quantified performance *and* resource requirements in maintaining control over costs (using tools like Evo and Impact Estimation tables). Design to cost, by selecting designs that fit within the committed budgets, then using the Evo method for the project to ensure the maximum "benefit" is delivered for the resources spent on it, improving cost estimates by learning during early, frequent benefit deliveries.

Chapter 7 - Design ideas and Engineering 'How to Solve the "Requirements Problem" '

This chapter argues for Design Engineering: the systematic evaluation of designs against all relevant stakeholder function, performance and resource requirements. To make the initial point it includes a version of the classic swing-design cartoon! As does the rest of the book, it provides a lot of practical ideas, such as suggestions on how much information to add to a Design Specification at the Idea / evaluation stage vs the detailed design carried out in Evo steps (see Figure 7.3 in the chapter), and a process by which to identify and evaluate design ideas. The chapter finishes with an interesting treatment of evaluating "priority", and why using Planguage concepts can provide much higher levels of objective control over priority setting.

Chapter 8 - Specification Quality Control 'How to know how well you specified'

This Chapter almost stands on its own as a method for the control of quality of any typical output of a project (any document or artifact), but focussed on specification quality. SQC is a method of engineering process control through sampling measurement of specification quality. It presents a "full" and "lite" inspection process variants. If you recognise the classic "cost curve of bug fixing", then SQC immediately justify its upfront costs by systematically reducing defects even before coding has started! By insisting on Rules for writing specs to be published, and then only inspecting a sample of the full document (e.g. 1 page of 300 non-commentary words per page) against those Rules only, you can efficiently estimate the number of Defects across the whole document. This chapter provides a lot of practical guidance with this process, including templates for inspection forms. Tom Gilb has previously published a book dedicated to software inspections.

Chapter 9 - Impact Estimation 'How to understand strategies'

This chapter presents one of the key risk control tools in the Planguage competitive engineering method; that of the Impact Estimation (or "IE") table. An IE table simply has a list of "designs" along the top, and a list of stakeholder performance objectives and resource costs down the side. At each intersection of idea and objective, you write the estimated improvement towards the goal or estimated spend of resource in quantative terms. Summing these improvements / costs for all the objectives gives a feel for how "valuable" the design idea is; by summing the same estimates across all design ideas you can see if you have thought of sufficient solutions to achieve the goals. By dividing the sum of the improvement benefits by the sum of the related resources spend, you can get a rough cost/benefit ratio for each design idea, thus giving key insights to the cost vs benefit for multiple competing solutions or design ideas against your key stakeholder objectives and resource budgets. The point of trying to write this in a table enables a more objective, evidence-based decision-making approach, that crucially accounts for both the cost (which is the usual focus of IT project management!) and benefits (to the stakeholders), which is often missed completely, or poorly understood.

Chapter 10 - Evolutionary Project Management 'How to Manage Project Benefits and Costs'

Finally, Chapter 10 presents a summary of the Evo project management methodology, which ties together the concepts in the rest of the book. Based on the well-known Plan-Do-Study-Act cycle, Evo demands:

  • Early delivery of results to stakeholders
  • Frequent delivery of results to stakeholders
  • Small increments or ('steps')
  • Useful-to-stakeholder steps
  • Sequencing of steps according to degree of stakeholder benefit, preferably the most profitable first.

Once you have your function requirements and quantified performance objectives, and design ideas evaluated using IE tables, you embark on a series of short Evo steps, which rapidly goes through design, development, testing and implementation to get early feedback from the stakeholders themselves. The next Evo step is then adjusted (maybe re-do the step again, or select another step now you have learnt more by trying to deliver the current step). During each step, you are actually *encouraged* to go back and re-write any of the previous analysis and design specs as further stakeholder feedback is received.

Glossary

The last 1/3 of the book itself includes a comprehensive formal glossary and definitions set for the Planguage concepts; I think of it as a domain-specific language for business analysis and project management. It is published under a separate permissive license which encourages the wider use of its contained ideas. A live version is also available on Gilb's site.

Conclusion

The proof is in the pudding; I have used Evo (albeit in disguise sometimes) on two large, high-risk projects in front-office investment banking businesses, and several smaller tasks. On the largest critical project, the original business functions & performance objective requirements document, which included no design, essentially remained unchanged over the 14 months the project took to deliver, but the detailed designs (of the GUI, business logic, performance characteristics) changed many many times, guided by lessons learnt and feedback gained by delivering a succession of early deliveries to real users. In the end, the new system responsible for 10s of USD billions of notional risk, successfully went live over one weekend for 800 users worldwide, and was seen as a big success by the sponsoring stakeholders.

I would recommend this book to any business analyst or project manager working to deliver complex IT systems. Further material is available on Gilb's website at http://www.gilb.com. - Comments: 0


——

ZeroMQ Sockets

27 Aug 2011 16:20
Tags: zeromq

ZeroMQ provides a scaleability layer between application endpoints. It implements several different message topologies, provided by a series of Socket Types, each with their own well-defined function and behaviour. This article summaries the behaviour and function of each Socket Type, and the most typical Socket combinations, valid in ZeroMQ 2.1.x.

ZeroMQ Socket Types

Socket Topology Send/Receive Incoming Routing / Msg Action Outgoing Routing / Msg Action HWM Action
REQ Request-Reply Send, Receive, Send, … Last peer /
Removes empty part
Load balanced /
Prepends empty part, queue unsent msgs
Block
REP Request-Reply Receive, Send, Receive, … Fair-queued /
Retains message parts up to 1st empty part
Last peer /
Prepends retained parts, queue unsent msgs
Drop
DEALER Request-Reply Unrestricted Fair-queued /
No changes made
Load-balanced /
No changes made, queue unsent msgs
Block
ROUTER Request-Reply Unrestricted Fair-queued /
Prepends reply address to message
Addressed peer /
Strips 1st message part, drops message if identity not known or peer available
Drop
PUB Pub-Sub Send only n/a Fan out /
No changes made, drops message if no subscribers
Drop
SUB Pub-Sub Receive only Fair-queued, optional filtering /
No changes made
n/a Drop
PUSH Pipeline Send only n/a Load balanced /
No changes made, queue unsent msgs
Block
PULL Pipeline Receive only Fair-queued /
No changes made
n/a n/a
PAIR Exclusive Pair Unrestricted n/a Queue unsent msgs Block

Where:

  • Fair-queued means the socket gets input from all sockets connected to it in turn.
  • Load-balanced means the socket sends messages to all sockets connected to it in turn.
  • Last peer means the socket sends the message to the last socket that sent it a message
  • Fan-out means message is sent to all connected peers simultaneously

ZeroMQ Socket Combinations

ZeroMQ only supports the following Socket combinations in a connect-bind pair:

Combination Examples from the ZeroMQ ZGuide [Bibliography item ZGuide not found.]
PUB and SUB Uni-directional message broadcast from Publisher to Subscriber(s)
Publish-Subscribe Proxy Server
Slow Subscribers (Suicidal Snail Pattern)
High-speed Subscribers (Black Box Pattern)
Reliable Pub-Sub shared key-value system (Clone Pattern)
REQ and REP Sequential message request-reply between two Endpoints (e.g. client to server)
REQ and ROUTER Frontend broker connection to multiple clients
Req-rep broker connection to multiple workers using custom routing (LRU)
DEALER and REP Simple synchronous (one request , one reply) broker connection to one or more separate worker processes using load-balancing DEALER socket
DEALER and ROUTER Custom 1 .. n request routing from a single client to multiple servers
Asynchronous n..1 request routing to a single server
DEALER and DEALER Backend asynchronous (one request, many replies) broker connection to multiple workers
ROUTER and ROUTER Asynchronous brokerless reliability (Freelance Pattern)
PUSH and PULL Task ventilator connected to a pool of workers that forward results to a task sink
Parallel Pipeline with kill signalling
Pipeline streamer device
PAIR and PAIR Signalling between Threads
[[bibliography title="Further Reading"]]
ZGuide
[http://zguide.zeromq.org/]
Concepts
[http://www.250bpm.com/concepts]
API
[http://api.zeromq.org/]

[[/bibliography]] - Comments: 0


——

Messaging #5 - Duplicate Messages

27 Aug 2011 11:07
Tags:

This is the 5th in a series of articles about SOA (Service-Orientated Architecture) messaging systems.

Duplicate Messages

Every Component should expect to receive duplicate messages from the message system at any time. Duplicates can arise from:

  • A technical failure in a publishing Component, or a delivery failure and attempted recovery event by the messaging infrastructure itself.
  • An external (with respect to the system) message source (e.g. an external FIX endpoint at a customer), outside the control of the receiving System.

No messaging architecture can guarantee duplicate - free message delivery under any circumstances, particularly the kinds of architectures required for low-latency high-volume distributed processing. Duplicate message handling must be considered as an integral part of a
Service Component’s design.

Duplicate messages can be either:

  • Explicitly flagged as possible duplicate messages. For example, 29West sets a flag on a message if it has been delivered by the persistence store (which acts as a proxy to the original publisher). Another potential example is during a message replay during a FIX session re-connection event, where previously sent messages are re-sent by the FIX client.
  • Unknown duplicates. The message initially appears to be a unique, new event. Receipt and subsequent detection of such messages usually indicates a major failure in the messaging layer somewhere.

Functional and technical analysis should define the strategies for detecting and handling confirmed and suspected duplicates. Care should be taken on balancing the requirements for low-latency processing (e.g. time to update a cached currency position) with the required accuracy and integrity of the position data itself. Implementation strategies may include:

  • Use of UUIDs or other unambiguous data matching techniques
  • Heuristic pattern matching of message data with existing records
  • Optimistic detection, whereby unflagged messages are initially assumed to be unique, thus ensuring fastest – possible processing latency, followed by subsequent duplicate detection that may result in the effect of the message being unwound in the (assumed rare) cases that a a duplicate has been received. - Comments: 0

——

Messaging #4 - Resiliency Groups

20 Aug 2011 14:54
Tags: messaging

This is the fourth article in a series about designing messaging architectures.

Resiliency Groups

To understand what we mean by “resiliency” or its cousin “reliability”, we need to look at their alter ego: “failure”. We can consider failure originating from:

  • Application code can throw unhandled exceptions, crash, run out of memory or other resources, get into a deadlock or inifinite loop and therefore stop responding to input. This is the most likely, as the application code is less likely to be shared across other systems.
  • System code, such as the messaging middleware, can die, or even run out of memory (e.g. caused by slow subscribers). This should happen less than in application code, if only because other systems will probably also using the same middleware and therefore find and identify bugs which can be applied to other deployments.
  • Message queues can overflow
  • Network transmission can fail temporarily, causing intermittent message loss.
  • Hardware can also fail: servers, network switches to entire data centers.

For the purposes of this article, we will look at how to mitigate the failure of an individual Service Component instance. A key insight is that consideration of the strategies required to mitigate each of the different failure modes listed above must be included in the design from day 1. Wherever possible, the implementation of messaging resiliency design should be moved away from application code into the “Bus” layer or the middleware system; so that a common set of strategies can be applied across all Components.

Components in a SOA architecture are defined by the Service(s) that each type provides. If we assume that the Services are only provided by Component instances of the same type (i.e. with the same application code), the resiliency strategy for those instances needs to be considered. Additionally, scalability needs to be factored into the design; how many instances of the same Component are required in the local / global system to fulfil the performance requirements for the Service.

One approach is to define, by configuration (see above), one or more Resiliency Groups for each Component type. A Resiliency Group will ensure that exactly one instance of the Component within the Group will receive and process a given message sent to the Service on the Logical Message Bus, known as the “Primary” (P). Backup instances (see below) are denoted as Secondary (S) and maybe also Disaster Recovery (D) instances that run on dedicated DR – ready infrastructure.

Scalability is provided by splitting the message workload across multiple Resiliency Groups (e.g. may be located in different WAN regions). The most efficient way to achieve this is likely to be using topic subscription filtering, based on information in the topic name itself, enabling the horizontal scaling to be mediated by the messaging infrastructure and Logical Message Bus, rather than in application code.

Primary Instances within the Group will receive messages from the Logical Message Bus (if more than one Primary, messaging layer can route to different Primary instances based on an appropriate routing logic: LRU, Round-Robin etc)

Secondary & DR Instances can run in hot standby (receiving messages, processing them to update internal state, but not generating any output or effect), warm standby (initialized and running, but not receiving messages), or cold standby (instance not running unless required).

Primary and Secondary / DR instances can then be composed into Resiliency Groups; with the Group behaviour mediated by the software layer forming the Logical Message Bus on top of the messaging infrastructure. It would use heartbeating or similar to observe component failures and ensure the appropriate role change occurs to a backup instance in the Group:

  • PPD (or PPPPD !). Two or more Primary instances, with a load balancing function to distribute the message load across the active Primary instances. The DR Instance would only become active and start processing messages if no Primary instances were running in the Group.
  • PSD. One Primary instance processes all messages; the S and D instances are running in warm standby. If the Primary instance should fail, the Secondary instance will detect and take over the Primary role. If the Secondary instance should fail, the DR instance would take over. If the previously failed instance should restart successfully, it would join as a new Secondary instance.

The role handover used in PSD depends if the Service(s) provided by the Component are stateless or not; if stateless then the new Primary instance can go straight ahead and process the next message. If message processing is stateful, the new instance must synchronize its internal state (e.g. from a persistent store or other Service) before commencing to process the message stream. - Comments: 0


——

Messaging #3 - Logical Data Bus

07 Aug 2011 22:19
Tags: messaging

This is the 3rd in the series of messaging - related articles.

Build a Logical Message Bus

The messaging infrastructure supports the creation of a messaging topology and provides messaging – level services (guaranteed delivery, messaging resilience, WAN bridging etc). However, there is still a significant functional gap between that and the application domain-level code that developers and the business wish to spend most attention on.

A shared common layer can be built that abstracts the messaging infrastructure from Service Component application code, and provides further services:

  • Topology configuration (e.g. converting message topology configuration into behaviour and state).
  • Topic building (e.g. using a Builder Pattern implementation)
  • Component resiliency (warm standby, round - robin)
  • Inter-component heartbeating
  • Test injection / mocking for unit testing
  • Message performance monitoring
  • Presents a messaging API to application code that abstracts the implementation of the messaging layer, allowing possibility of heterogeneous messaging architectures to be utilized to extract maximum performance.
  • Implements the messaging-level protocol(s) used for message creation and bundling (the actual wire byte protocol would be determined by the messaging infrastructure itself).
  • Provides a common point of control for messaging functionality across all components.

The combination of these services effectively creates a logical message bus (although the underlying messaging topology may be one or more of many different types and implementations, and not formally recognized as a “bus” at all).

This shared code becomes a critical part of meeting the System’s performance objectives, and, as such, requires the oversight of senior technical development personnel in its design and development. - Comments: 0


——

Messaging #2 - Topology By Configuration

20 Jul 2011 09:59
Tags: messaging

This is the second in a series of blog posts covering key insights gained whilst working on design of low-latency high-volume distributed trading systems, focusing on the messaging middleware layer.

Topology By Configuration

SOA systems are defined by multiple endpoint-to-endpoint connections, bridged by a variety of intermediary brokers, gateways, bridges etc. For example, 29West uses WAN Gateways to bridge local multicast network traffic across data centres via more reliable TCP transport. Other examples include any broker – orientated messaging topology (e.g. Active MQ etc). The Open Source community is also starting to address this space with products such as RabbitMQ and ZeroMQ.

The messaging topology should be defined by configuration as far as possible. The configuration, controlling mechanisms such as topic subscriptions, list of publication topics, and topics to bridge across Gateways, should be deployed as an overlay; managed as a separate artifact to the application software. This provides several benefits:

  • By centralizing implementation of messaging topology into a deployable artifact in its own right, it supports the programme – level management of the topic space itself (see above).
  • Changes can be made rapidly in development and testing environments.
  • Production messaging topology is put under same version control mechanisms (with rollback etc) as the application code itself.
  • Supports development of management, monitoring and visualization tools - Comments: 0

——

Messaging #1 - Service Topic Name Management

18 Jul 2011 09:35
Tags: messaging

RSBA Technology Ltd has provided consulting services to Tier 1 and 2 Investment Banks since 2009. Our focus has been on analyzing the requirements and project managing delivery of complex, multi-site real-time low-latency, high-capacity trading and workflow management systems for eFX businesses within the Bank’s Global Markets areas. In the course of this work, and in many years of experience prior to forming the Company, we have gained valuable insights and knowledge on how to effectively design, develop and deploy such systems in a highly competitive landscape.

This series of blog postings highlights some of these key insights related to the design of distributed trade workflow systems, in the context of low-latency, high-volume processing environments.

Service Topic Name Management

In a Pub / Sub – based SOA (Service-Orientated Architecture) architecture, different Service Components will be subscribing to subsets of data published within the system. Messages of specific content types are associated with specific Topics. Topics are used to determine which messages are processed by different Component types and instances (e.g. “FX.Spot.EURUSD” or “Exception.FIXGateway”).

To ensure the long-term expandability of the architecture, the topic namespace itself must be tightly controlled both technically and managerially. Although a SOA architecture supports low-coupling between Components and their respective development teams, it is essential that all current Components unambiguously define the topics of published messages, and register interest in receiving messages. Failure to so from day 1 leads to propagation of hacks and workarounds that materially impact the ability of the system to scale according to business demand.

A managed topic schema should be defined and published right at the start of the development of the SOA system; or should be retrospectively implemented in an existing system. This can be hierarchical, based on domain delimiters, with a “root” schema defining the overall structure and standards for developing sub-domain topic spaces by different teams below that.

Effort must be put in to model all the likely messaging paths and use cases between Service Components, and used to draw up a Topic Schedule. This should be maintained by a Design Authority as an important lifecycle artifact, and all development teams be made responsible for complying with the topic schema standards.

The topic identifiers, however structured, should not be directly exposed to the application layer. This allows:

  1. Changes to the topic structure, and thereby the routing of messages, that is transparent to applications and business logic.
  2. Inclusion of additional topic identifier sections required by the higher-level “Bus” functionality, such as component resiliency.

A possible message topic schema could incorporate the following elements within it:

  1. Root domain (e.g. “FX”, “IR”, “Control”, “Exception”)
  2. Publishing Component Identification (component type and instance information, should provide enough info to trace message originator to a specific process, such as physical location).
  3. Business Partition (Logical description of message contents and / or channel)
  4. Additional Parameters (additional flags and labels related to message routing, used for message subscription / forwarding purposes) . - Comments: 0

——

Welcome To RSBA Technology Ltd

10 May 2011 19:26
Tags: blog

Welcome to our new website, powered by wikidot. Having all the content in one place, editable online, should ensure updates are more frequent than previously!

The Specification Cookbook still has pride of place in its own section; we are currently transferring and refreshing the existing material across from the previous host. - Comments: 1


Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License