The Elements of Distributed Architecture

Notes after watching the Pluralsite course on this subject.

http://pluralsight.com/training/Courses/TableOfContents/eda

In general, I’ve put my italicised comments at the start of each section, and then continued with information from the course itself.

Introduction

This is intended to be a big picture, technology agnostic view of distributed systems architectures. The high level summary is very interesting, although I think that as it got into the detail of each area it became less relevant to the specific topic of distributed architectures and tied itself into quite specific situations which didn’t necessarily relate to distributed architectures in any way.

In distributed architecture the capabilities revolve around information, communication, presentation, processing, failure management and protection. Constraints revolve around capacity, latency, affinity, intermittent failure, thieves and idiots.

Architecture is both a matter of opinion and engineering. When designing buildings there are a wide range of appearance which can be implemented, depending upon the customers brief and the architects’ vision – no two factories or hotels need look the same. However, there are also engineering constraints that have to be considered – structural load, material strength, physical location and so forth.

Course Notes

Capabilities Overview

  • Information
    • Events (information that you find out about)
    • Storage (information that you are aware of)
  • Communication
    • Synchronous
    • Asynchronous
  • Presentation
    • Read only consumption
    • Read/Write interaction
  • Processing
    • Silicon time (server only)
    • Human time (dependent upon people – workflows, UI etc.)
  • Failure Management
    • Isolate (pessimistic, don’t let a failure occur)
    • Compensate (optimistic, if a failure occurs take compensatory action)
  • Protection
    • Security (guard against unauthorised actions)
    • Safety (guard against unintended actions)

Constraints

Capacity

How much can be held, transferred or processed in a period of time. This could be storage capacity, network bandwidth, processing power available.

Latency

The time to accomplish a task. Software friction, as it were. The time taken to transmit, store or compute. This is often down to physical limitations such as bus size, speed of transmission across cables etc.

Affinity

The degree to which a particular set of information is bound to a scope or a location. It is probably the biggest inhibitor for scaling out – whether it is affinity to an individual’s mobile phone which sits in a pocket or because special data storage is needed.

Intermittent Failure

Any condition which interrupts or affects the planned flow. It could be permanent failure, intermittent failure, programming errors physical equipment failure or exhausted capacity for instance.

Thieves and idiots

Security is required to protect against thieves who attempt to gain unauthorised access. Safety is required to protect against idiots – people with legitimate access but who do silly things.

Information

The main body of the course started in a very useful place. In any distributed system there will be some information which is stored as known current state which is of interest to the system as a whole (e.g. product information, shift information, orders, van trips) and other information which is published between different parts of the system as events which typically reflect a need to change state based upon action elsewhere (e.g. order picked, order cancelled, product delta). This doesn’t say anything specifically about requesting data, but that is assumed in the presence of state that there will be requests to gather information about that state.

Events

Events originate from an event source. They are invariant, in that they are representations of facts, statements about past activity. They are also ephemeral – they are acted upon, reacted to or transformed into state (or ignored).

Events are not necessarily commands. Autonomous modules may choose how they want to respond to an event.

It is important to note that the value of events typically varies with their age – they are most valuable just after they have been produced, but that value decays with time (either in seconds for some stock ticker feeds, longer periods for other events). Eventually the value of events might rise again as they start to become interesting for historical archives – although it could be argued that this is only the case when the information in an event has been transformed into state somewhere.

State

This is information which you are already aware of – you know it exists and how to get it. It includes the possibility of ‘not present’ state, such as when you ask for the records associated with a customer and find out that there are no associated records.

State can be considered along a number of additional dimensions:

Private/Personal/Shared (the degree to which it is shared can make information harder to gather and makes updates more complex).

Fresh/Stale/Historic (how much staleness can you tolerate in a system? How old can data become before it becomes useless? For example, someone’s shipping address might change but there are compensating mechanisms in place – post office forwarding – that means that we could cope with a stale postal address; however we couldn’t cope with the same level of staleness for a grocery delivery address because the compensating mechanism doesn’t help us there)

Independent/Dependent/Related

Small/Large/Huge (The bigger the state is, the harder it is to deal with. Huge datasets can cause us problems with storage, querying and transfer of data. Sharding is a technique to help deal with this).

Owned/Foreign/Related

Communication

The course didn’t really consider any situations were synchronous communication was valuable in a distributed architecture. From what I gather it makes most sense within a module, where components might be tightly coupled together, but between services asynchronous communication is preferred as it is non-blocking and allows load balancing and load levelling better.

Synchronous communication

Like a telephone conversation, both parties must be present. A request is made and the requester takes no action until a response is received. Two of the disadvantageous side effects of this are that the client can be blocked while long running operations take place on a server, and modules end up being tightly coupled together.

Asynchronous communication

This can be used to stop blocking communication. A request is made, but the client can then take other actions while waiting the response. When the response arrives the client can get on with processing it.

Async communications can also be used to give us temporal decoupling, load balancing and (with an appropriate broker, such as message queue) load levelling.

Presentation

I don’t think the course made a compelling argument about the role of a distributed architecture in affecting presentation systems – the issues of consuming and interacting with computer systems was pretty generic in approach. I’d argue that more could be made here about the role of the UI in composing together information gathered asynchronously from various distributed services, and considering the implications of capacity, latency and affinity in providing a consistent and responsive UI when working in a distributed architecture.

Consuming information

Strictly speaking the consumption of information isn’t purely read-only, as little portions of information are sent to request the information which is to be consumed, and logging of those requests may happen as a side effect.

Interacting with information

Interaction may include a combination of analysis, filtering and dispatching of information inbound, and then after processing a combination of collecting, composing and rendering data outbound. The line between presentation logic and business logic is often becoming blurred now with the desire to provide greater interactivity to end users.

Processing

Processing which involves people takes place in human time. The time to complete the process may range from seconds (when someone is typing on a keyboard) to hours or days (for long running workflows). Typically when working on human time there is a long idle time and a very short processing time.

Processes which don’t involve people at all can normally be completed in silicon time, working in microseconds (or as fast as the computer can turn the information around). Typically there is very little idle time and a large proportion of processing time.

There are three foundational patterns for processing. Stack based, Queue based and Event based. Stack based processing is commonly found within an application, it expects immediate action. When seen in a distributed system it often takes the form of RPC (Remote Procedure Calls). Queue based processing is often found between applications, it expects eventual action. Message based queue systems are common in distributed architectures. Event based processing asks modules to be reactive to events which are generated in other parts of the system. The sender doesn’t necessarily know who gets the message, they just publish event messages and one or more subscribers may or may not respond to them.

Failure Management

There are two main ways of handling failure management described in this course. I think that more could have been made in the course of the costs and benefits of either of these failure management strategies in a specifically distributed system. Which are the considerations that particularly come to bear in a distributed system? Perhaps the most important issue brought up is the restrictions imposed by CAP theorem, which will affect distributed architecture. That in itself suggests that perhaps isolation is more useful within a service, while compensation is more useful between services?

Isolate

Attempt to make it impossible for a failure to happen. It takes a pessimistic view that failures are likely and must be prevented. An example of this is using transactions to ensure that a group of database updates either all happen or none happen.

Compensate

Optimistically assume that most of the time things are OK and that if a problem occurs it will be trapped and compensated for. An example of this is some high performance systems where the overhead of wrapping everything in transactions would slow processing down too much – so a more complicated rollback process is in place for exceptional conditions, recognising that those are rare enough that the added complication of rollback is more than worth it for the performance gains in other parts of the system.

[As a general point, I don’t think that this section was covered very well in the course material beyond this point]

Types of failure

A failure might be permanent or intermittent. It might arise because of problems with infrastructure, program defects or incorrect data.

If there is a permanent failure the main options are to raise an alert and wait for the cavalry to arrive or take some large scale mitigation (fail over to a different node?). For intermittent failures the recovery options may include retry, compensate by taking a different action or alert and abandon recovery.

Retry isn’t always possible, as some operations can only be attempted once. There might also be time constraints that mean that there isn’t enough time to retry, or the input might have been destroyed in the process of the initial attempt. There may also be considerations about whether or not it is safe to retry. What if the first attempt produced a partial result? Are you sure it really failed? Is the action idempotent?

Note that many actions which appear to be idempotent still have side effects – perhaps logging, perhaps something else, which will occur again when the action is retried.

Ideally we would want all our operations to be ACID (Atomic, Consistent, Isolated and Durable). Unfortunately for distributed systems CAP theorem tells us that it is impossible for a distributed system to simultaneously provide Consistency (all nodes see the same data at the same time), Availability (a guarantee that every request receives a response about whether it was successful or failed) and Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system). A distributed system can satisfy any two of these guarantees, but not all three. (A good article on this issue can be found here: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem )

The idea of accepting eventual consistency in distributed systems has been captured by the rather forced acronym BASE – the opposite end of the scale from ACID. Base stands for Basically Available, Soft-state (caches), Eventually consistent. ACID is relatively straightforward in small, single-server systems, but as you scale out you need to use reliable messaging, queues and replicated flows of data to provide you eventual consistency (even though you cannot guarantee that everything is exactly the same at a single point in time).

Protection

Our systems need to cope with Security and Safety. The training session dealt largely with generic security and safety issues rather than those which are closely associated with distributed architectures.

Security

This is preventing unauthorised access to systems – establishing proof of identity, authorisation to use particular resources, preventing eavesdropping or tampering with messages.

In terms of distributed systems this concerns the availability of identity services and possibly authorisation services. The first enables secure establishment of someone’s identity (providing them with a token to prove that identity) and the second enables secure authentication of someone’s ability to access certain resources.

Safety

You want to be able to guard against unintended actions – where the right person who has authorisation to perform certain actions has accidentally done the wrong thing. Mitigation can be provided for some applications by having undo functions to allow someone to recover from an accidental action themselves, and by having audit trails to be able to identify when unintended actions have taken place so that remedial action can be taken.

Advertisements