I was fortunate to be able to attend QCon 2014 in London this year, and I took quite comprehensive notes from the key sessions which I attended. I’ve collected here the notes on the general principle of migrating to microservices. anything which appears [in square brackets] are my own editorial comments. I intent to write something later to bring a consistent view for myself on how to move a monolithic application into services and microservices in the medium to long term.

Enterprise REST – a case study[1]

Brandon Byars (Thoughtworks)

Using rest at a large scale in a telecoms company.

The eight fallacies of distributed computing

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. the network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

We are moving from plain old XML, RPC style to URIs, then to HTTP and heading towards HATEOAS.

Uri is about resources, http is a standard interface. HATEOAS is forcing people to use your hyperlinks.

What he has observed is that most arguments occur at the upper couple of levels. Put vs patch, parameterisation etc.

Another observation is that those areas are not where the mistakes tend to happen. Most errors are in the versioning, deployment, testing, service, granularity and deployment of Uris.

Concentrating here on enterprise API’s, it is a story about a billing system.

He worked for three years at a telecoms company that wrote a billing system in the eighties. It enabled them to strategically grow, and was useful for a long time. The consequence was that their billing system had become very complex though. Static analysis of their code was worrying. Typically a complexity of more than 10 is a problem; theirs was 1000…

So this is a story about legacy rewrite. They agreed on doing web based integration. It was before micro services were being considered, so the services they were creating were quite large. They had to support a customer facing web UI and some customer support UI.

They broke it into a dozen services.

They wanted choreography rather than orchestration. The latter is big SOA. It simplifies the architecture diagram – but not the actual architecture!. The former allows business value to emerge from the interactions of the services. Harder to diagram but provides better decoupling between services.

They chose to define logical environments for isolation.

They had a problem. They were adopting the infrastructure definition of environments, which was different to the developer view of environments. Devs worked on local machines and integrated on a shared integration vm, but using logically isolated environments overlaid on the infrastructure environments.

Ports and database names were used to provide logical isolation boundaries.

Coordinated deployments

Developers updated their own project development environments as and when they want.

This was one of their biggest successes. It predates some of the tools like chef and puppet which now allow that kind of thing to be done in a more structured way.

Version as a last resort

Semantic versioning is sensible; it is common to support multiple versions in prod, so supporting old versions is important. It adds complexity though, especially in monitoring and troubleshooting. So delay the need to version.

Spelling mistakes, inconsistent capitalisation are a common but silly cause of breaking changes. If you treat your end points as user stories it is easier to catch things like this early and in time.

Postel’s law

“Be liberal in what you accept and conservative in what you send”.

We too often have deserialisers which throw an exception if you get an order with four fields and your object is expecting three. Additional fields you don’t expect shouldn’t be breaking changes (but too often it is). move away from strict schema validation.

Separate functional testing from integration testing

The best balance was to have isolation tests which are stubbed ruthlessly over the wire against stubs. They started with handcrafted stubs, now there are more record and replay stubs which are out there. Moco, VCR, Betamax stubby4j etc.

Contract tests

The last known good versions of all other services were deployed alongside new dev work. If they all pass, the combination of all those items were saved as an artefact, a set of services which work together. [note: this flies in the face of the Netflix microservice deployment process]

Use bounded contexts to control complexity

The product team had a very good idea of what products were.

Event propagation in REST is very much in discussion. They used atom feeds, but he isn’t entirely sure that was best.

This article is online at http://martinfowler.com/articles/enterpriseREST.html

Migrating to microservices[2]

Adrian Cockcroft

Left Netflix in January. Wants to cover things more broadly. He has some slides on net share from cloud conference which are relevant from an exec view.

What did I learn from Netflix?

Started working on API’s, moved to microservices in the cloud. Their main plan is speed of delivery to the marketplace. It learns very quickly.

  • Speed wins in the marketplace
  • Remove friction from prod development
  • High trust, low process
  • Freedom and responsibility culture
  • Don’t do your own undifferentiated heavy lifting (I.e. Just put it all on AWS)
  • Simple patterns automated by tooling (no central architecture planning function). This helps arch to evolve, best parts to then be identified and spread further
  • Microservices for speed of development and availability

Typical reactions to Netflix talks

In 2009 people said that they were crazy and making stuff up
In 2010 people said it won’t work
In 2011 people said it only works for unicorns like Netflix
In 2012 people said we’d like to do, that but can’t
In 2013 people said we’re on our way, using Netflix OSS code

Demands on it have exploded. You can only survive by being good at it

Colonel Boyd USAF said observe, orient, decide, act.

So how fast can you act?

Doing IaaS? Cuts out infrastructure from the process of product development.

PaaS cuts out even more – can make product manager talk to dev on one feature that is quickly released (pushed every day to production!) As the cost and size and risk of change is reduced, the rate of change increased.

Observe = Innovation: land grab, customer pain points, measure customers, competitive moves.
Orient = Big Data: analysis, model hypotheses.
Decide = Culture: plan response, share plans, JFDI
Act= Cloud: incremental features, automatic deploy, launch AB testing.

How do I get there?

Old school companies need to read “The Phoenix Project”[3] and move on from there.

Lean enterprise by Jez humble is about continuous deployment for speed in big organisations. No time for handoff between teams. Run what you wrote – root access and pager duty. Genuine ownership of your code where it counts. High trust for fast local action. Freedom and responsibility for developers.

Open source ecosystems

  • The most advanced scalable and stable code today is OSS.
  • No procurement cycle; fix and extend it yourself.
  • GitHub is your company’s online resume.
  • Give up control to get ubiquity – Apache license
  • Extensible platforms create ecosystems.

Cloud native for high availability

  • Business logic isolation in stateless microservices
  • Immutable code with instant rollback
  • Auto scaled capacity and deployment updates
  • Distributed across availability zones and regions
  • Lots of de-normalised single function NoSQL data stores
  • NetflixOSS at netflix.github.com and techblog.netflix.com

Cloud native benchmarking

Netflix were planning to run full active-active across east coast and west coast. In order to do this they took the most write intensive workload they could. In twenty minutes they got 192 Tb of SSD provisioned across 96 machines in two locations.

An API proxy Zuul defines the API they want to use externally, and routes to various back ends.

Then there is Eureka Edda which is their service registry and a black box recorder for the state of the whole system at any time.

They then each teams concerned with a microservice. One might have a Karyon based server talking via Staash to Cassandra. Another might use MySQL or S3. Each team works independently;  it decouples the problems with breaking the build.

Separate concerns using micro services

  • Inverse Conway’s law – teams own service groups.
  • One “verb” per single function microservice
  • Size doesn’t matter
  • One developer independently produces a microservice.
  • Each microservice has its own build, avoids trunk conflicts.
  • Stateless business logic (simplifies roll forward/back)
  • Stateful cached data access layer
  • Versioning

Deployment architecture

  • Versioning
    • leave multiple old microservice versions running
    • fast introduction vs slow retirement
    • code pushes only add to the system. Eventually retire old services which have no traffic.
  • Client libraries
    • even if you start with a protocol, a client side driver is the end state.
    • best strategy is to own your own client libraries from the start (eg mongo did this, which really helped with their uptake)
  • Multithreading and non-blocking calls
    • reactive model RxJava using Observable to hide threading
    • migrated from Tomcat to Netty to get non-blocking I/O speedup
  • Enterprise Service Bus / Messaging
    • message buses are CP with big problems getting to AP (CAP theorem)
    • use for send and forget over high latency links

Microservice APIs

  • API patterns
    • RCP, rest. Self-describing overhead, public vs in-house.
    • XPATH, jsonpath adds some flexibility but not as useful in house.
  • Scripted API end points – dynamic client RPC pattern
    • See Daniel jacobsons talks at slideshare/netflix
    • March 3rd 2014 techblog.netflix.com post by Sangeeta Narayanan
  • Service discovery
    • build time Ivy, Gradle and Artifactory
    • run time Zookeeper for CP, Eureka for AP
  • Understanding existing code boundaries
    • how do you break up your giant code base?
    • buy a bigger printer and wallpaper a room.

Microservice datastores

  • Book: Refactoring Databases
    • Schemaspy to examine data structure
    • Denormalisation into one data source per table or materialised view.
  • CAP – Consistent or Available when Partitioned
    • Look at Jepsen models for common systems aphyr.com/tags/jepsen
    • AP as default for distributed system unless downtime is explicitly OK
  • Circuit breakers see http://fluxcapacitor.com for code examples.
    • NetflixOSS, Hystrix, Turbine, Latency Monkey, Ribbon/Karyon
    • Also look at Finagle/Zipkin from twitter and Metrics, Graphite
    • Speed of development vs scale driving resilience

How do we get to microservices simplest and soonest?

Try the carrot, stick and shiny objects

  • “This new feature will be ready faster as a microservice”
  • “This new feature you want will only be implemented in the new microservice based system”
  • “Why don’t you concentrate on some other part of the system while we get the transition done?”

(as it happens, the last one of those was the approach they used at Netflix)

Moving to Microservices – shadow traffic backend redirection

  • First attempt to send traffic to cloud based microservice
    • Used real traffic stream to validate cloud backend
    • Uncovered lots of process and tools issues
    • Uncovered service latency issues
  • They modified the monolithic datacentre code path
    • Returns Genre/movie list for a customer
    • Asynchronously duplicated request to cloud
    • Started with send-and-forget mode, ignore response
  • Dynamic consistent traffic percentage
    • If (customerid % 100 < threshold) shadow_call()[4]
    • They set the threshold so they could dial up or down the amount of traffic going to the cloud.

Production is kept immutable. While monolithic updates can break everything at once, a microservice deployment adds a new microservice (with no impact) and then route test traffic to it. They have version aware routing and eventual retirement of older services.

Scott Adams “How to fail at almost everything and still win big”

Automatic canary red/black deployment

This process is in use for tens of large fleet microservices in active development.

  1. Developer checks in code then gets email notifications of progress
  2. Jenkins build launches in test account and starts tests
  3. If tests pass, launch ‘canary’ signature analysis in production
    1. Start one instance of the old code per zone
    2. Start one instance of the new code per zone
    3. Ramp up traffic and analyse metrics on all six
  4. If canary signature looks good, replace current production
    1. Scale canary build up to full capacity
    2. Send all the traffic to the new code
    3. Wait until after peak traffic time then remove old code instances
  5. Security team tools notice the new build via Edda query
    1. Automatic penetration test scan initiated.

When code is checked in during the afternoon in California and passes the canary test suite, it is first deployed to night time Europe, then after peak it is canaried and deployed to East Coast US on the next day, and then West Coast US after peak on the West Coast.

Monitoring the microservices is important.

They use Appdynamics to instrument the JVM to capture everything including traffic flows. They insert a tag for every http request with a header annotation guid, and visualise the overall flow or the business transaction flow.

Boundary.com and Lyatiss CloudWeaver are used to instrument the packet flows across the network. Captures the zone and region config from cloud APIs and tags, allows them to correlate, aggregate and visualise the traffic flows.

Scaling continuous delivery models

Monolithic – Etsy, Facebook

  • Etsy – 8 devs per train
  • Ops team run the monolith
  • Queue for the next train
  • Coordination chat session
  • Need to learn deploy process
  • Update in place
  • Few concurrent versions
  • 50 monolithic updates/day
  • Roll forward only
  • “done” is released to Ops
Microservices – Netflix, Gilt

  • Everyone has their own build
  • Dev runs their own microservice
  • No waiting, no meetings
  • API call to update prod timeline
  • Automated hands-off deploy
  • Non-destructive updates
  • Unlimited concurrent versions
  • 100s of independent updates
  • Roll-back in seconds
  • “done” is retired from prod


Adrian’s Blog: http://perfcap.blogspot.com

Dismantling the monolith

Brian Mcallister, Groupon

Groupon started with a giant monolithic rails app. Between 100k to 2000k lines of code.

Mobile component of transactions is now over 50% of traffic and growing.

They started by adding APIs onto their monolith but it was still horrible. They even had a different code base for their international platform.

This gave them a huge crisis in the business. Tried a front end rewrite over six months but had to roll back, it was a disaster.

They couldn’t develop things fast enough for the business. They wanted to build features worldwide, the mobile and web lacked feature parity and they couldn’t change the look and feel. But the big rewrite didn’t work, how can they move forward in 2012?

They looked at their monolith and tried to identify what modules composed it. They then wanted to start breaking each module into a separate service. They picked a two page flow (subscription) set up a separate route to that particular part of the application. For a language, the two guys who did this picked node.js because it had some momentum, and it had NPM which does packages correctly.

They implemented this tiny app (200 lines CoffeeScript) which called their existing API and moved it live. Within two hours they had a major site outage for that page.

It turns out that their infrastructure was optimised for rails, but wasn’t set up for supporting node.js. They had to introduce an additional routing layer to make it work.

So they went on to their next module. Subscription flow had been very low risk, didn’t have to manage templates etc. They wanted a different team to take on the next thing, something bigger, so they chose the browse page.

They thought it would be an easy change, but it wasn’t. The change to the deployment model needed a change to culture too, which they didn’t do. A two week estimate turned into six months with lots of tension and pain. Plus a realisation that they would have to adjust a lot of things about their culture.

They then decided that they wanted to spin out another dozen teams, so first had to think through their culture changes and make a framework which is easier to use. One of their policies is that documentation was required to answer every question.

They figured out how they wanted to handle layout. They have about twenty different kinds of layout. One model they tried was having template pages which contain the components (but that lacked flexibility). The next model was like angular.js compositing, but that gives a slow user experience at first. Eventually they just went to shared layouts from a shared layouts service, that includes login status, country and other stuff to create a dynamic moustache template.

Then they decided they needed to finish it. 150 developers in two months, no production work other than bug fixes during that time. This was so they could do AB testing between the two.

Latency halved across the board. They can now plot traffic by which application is running.

Of course, they still had their other code bases to handle.

Then what to do with their API. They couldn’t just break the API. But new international services start to take responsibility for the API going forwards, with the routing layer deciding where to send the requests to.

Modular development of a large e-commerce site

Oliver Wegner and Stefan Tikov, Otto.de

An architecture’s quality is directly proportional to the number of bottlenecks limiting its evolution, development and operation.

Conway’s law is well known[5]. But the inverse is true – that an architecture can limit what can be done within the organisation. Choosing a particular architecture can be a means of optimising for a desired organisational structure.

Rebuilding Otto.de

It was founded as a catalogue store back in 1949. For the last fifteen years e-commerce has been rapidly increasing, and has now overtaken the original business – it is now 80% of all turnover.

Business stakeholders are demanding more and more, but they couldn’t do it on their 2001 platform, so they decided to rebuild. They looked at buying in a product like ATG, but decided that would just be swapping one monolith for the next one. So they had a think – what were their goals?


  • Test driven (including AB testing support)
  • Data driven (decisions based on data, not feelings)
  • Personalised
  • Features

Non-functional goals

  • Simpler
  • Reliable
  • Fast
  • Realtime
  • Scalable
  • Time to market (one release per month was too slow)

They couldn’t change their backend systems which managed products, customer, orders.

They decided to use open source for their core technologies, so to not be dependent on one vendor

They made one prototype to define the technology stack

Project organisation with autonomous teams.

Scrum as agile development method

Their technical system architecture started off looking like a standard layered model. But it looked like any other single system. So how could they divide it into services rather than data layers? They decided that it should reflect the system blocks, then have rules about how to connect those blocks. Within each domain service, the language and data storage choices are really internal to the systems. They have ended up with a combination of C#, JRuby, Scala, Groovy and Clojure communicating with a combination of RDMBS, NoSQL document and key/value stores.

The customer journey has separate elements, and different business units had interest in and responsibility for each step (discover, search, assess, order, checkout). They divided up the system architecture up vertically along these same lines (search, product, order, user, after-sales, etc.).

Macro architecture

  • RESTful
  • Shared nothing
  • Vertical design
  • Data governance

Micro architecture

  • Buy when non-core
  • Common technologies

Their Product Owner is a virtual entity, decisions made jointly by project lead, business lead and technical lead. (A very interesting idea)

When you have teams for each of these areas, how do you deal with frontend integration? The customers want a consistent experience.

Atom feeds are used for loose integration between systems (eg caching product changes). Better than straight rest calls, which ties things together in an unwanted way.

Good client side integration levels are links, and replaced links (embedded something on the page).

Interestingly their basket page is an entirely different application, although it looks identical to the customer. They have an asset team to concentrate on providing consistent assists to all apps. Danger that they become a bottleneck though… Might be a better way of doing it? Having a central versioned storage perhaps?

First approach to AB testing was a solution with a centralised framework which every team has to include in their repository. But they don’t want that code sharing. So they have a dedicated separate vertical system for testing. Independent from other systems.

Ideally, for cross cutting concerns they like to introduce new vertical services to cover them.

2 years, >100 people, on budget, on quality, ready four months early for the MVP.

Lessons learned

Independent, autonomous systems for maximum decoupling. Allowed dev to,scale

Strict macro architecture rules that everyone knows.

Minimise cross functional concerns. Avoid centralising things; it is more work but much better in the long run.

Prefer “pull” to “push” sharing

Address cross functional concerns

Minimise the need for coordination between teams.

Be skeptical of “easy” solutions

Teams with their own decisions. Trust them.

Lessons learned in scaling twitter

Brian Degenhardt

He works in the platform team that write the base libraries. There are lessons learned in scaling from the original architecture to what they have today.

Engineering is the scalpel which we use to subdivide a problem so that it can be made in pieces. Westminster Abbey was built from individual stones. Manageable pieces.

Three lessons

1 incrementally implemented SOA.
2 separate semantics from execution
3 use statistics to monitor behaviour

incrementally implemented SOA

Originally it was a monolithic rails application talking to MySQL. It allowed a small team to iterate quickly so it was useful at first.

It was difficult to scale the security model
Any change deploys to all servers (and lots of leaky abstractions)
Poor concurrency and runtime performance, as all servers have to have all the code. Ruby was single threaded etc
Leaky abstractions and tight coupling made it difficult to separate stuff.

First they split storage into separate services. Tweets, users, timelines and social graph.

Then they divided up into routing, presentation which includes web, API, monorail, logic which includes tweets, users and timelines.

A visualisation of the services is vast; individually easy to work with, complex as a whole

How is a tweet sent?

Write tweet
Goes to write API
Fan out and deliver to all followers
Redis for each timeline, so it is stored there for each person
Timeline service pulls your stuff off your redis timeline.
Then each tweet has to be hydrated from tweets and user service.

The tweet is also sent to the ingester, which sends stuff to earlybird, the search system. Wh
Eh you search, blender does a parallel search across all earlybird instances when you search.

The tweet also goes to the firehose, http push and mobile push.

Timeline: 240miliion active users, 300k queries per second. 5000 tweets per second. Gets tweets in between 1 and 4 milliseconds.

Twitter server is open source, a base library for config, admin, logging and life cycle of service and metrics. Written in scala. Finagle is the underlying component. An RCP mechanism for the JVM. T does service discovery, load balancing, retrying, thread./connection pooling, stats collection, distributed tracing.

separate semantics from execution

Your service as a function

Takes a request, returns as response. The response is a Future[T] it is their abstraction for concurrency. It is composable, it can be pending, completed or failed. It is concurrent and easy to reason about.

Get user I’d, get tweet ids from timeline, get tweets, get images from tweets as needed.

Userid = future(23)

Then you can do the whole composing of the actions to get the whole future action, and then the parallisable sections do so.

They divided up what they want to do, and how the threads execute it. Separates the reasoning about what you want to do and how to make it as efficient as possible.

So there stack looks like this
Your service,

Each service follows this same stack


Use statistics to monitor behaviour

Breaking it up means they hve to use stats.

The amount of traffic means that the aggregate is more important than individual requests. More vertical components means more measurement

Horizontal SOA means more measure.

Failures are ok

300,000 rps
99.99% success, so
30 failures per second, so failure rate is more useful than actual failures.

Now, there might be a 10% increase in rate for median and p90, but long tail jumps 300%.
Tail effects are cumulative, so any p99 or p999 requests are important to manage.
This is why request level concurrency is important.

Measuring the vertical stack

They write a tool called viz which measurs the vertical stack, graphs it and has queries and alerts in it, makes dashboards possible. Every team has their own dashboard.

Measuring the horizontal

Zipkin (which is awesome). An OSS distributed trace. It is always on in the twitter infrastructure and maps 1 in 1000 live tweets. It makes it possible to see the pipelining as it is happening.

It is modelled after googles “dapper” paper. Google didn’t open source theirs, so twitter wrote their own and open sourced it. Distributed tracing is very useful!

Their statistics stuff is also open sourced in twitter commons.


What do they do about performance testing?

He is lukewarm about load testing. Instead they have a carefully staged rollout – canarying, so it goes to some servers and gets a portion of the traffic to see whether the canary dies on live traffic.

Oscars caused them a bit of a problem when everyone came to search at the same time, plus a lot of users who haven’t used it in years logged on they have to refresh lots of cache.

The biggest tweet velocity is from japan when castle in the sky is on – at the end everyone watching traditionally tweets the spell of destruction, and they had 185,000 tweets of that in one second.

With monorail, they had to have change freezes around big times because they couldn’t be sure what the impact would be. Now they have smaller services they don’t have that same problem.

Versioning – they just always maintain backwards compatibility, always adding new services.

How would they change if they were having financial transactions rather than tweets? Probably drop features and provide more deliberate accuracy.

How Netflix leverages multiple regions to increase availability

Ruslan Meshenberg
Director , platform engineering.

They have over 44 million subscribers worldwide.

This talk is really about failure in its most dramatic state. How do you mitigate it when it happens.

Small scale, slow change, everything pretty much works. Once you start moving at larger scale or faster change you will have problems with hardware failures or software problems respectively.

Top problems generate bad pr (active active, game day practicing)
Cust service calls ((better tools and practices)
Metrics impact ( better data tagging)

Does an instance fail? Yes. Could be bad deployments, hardware failure, latent issues. Their chaos monkey tests these things.

Can a whole data centre fail. It can happen, for routing or doc specific issues. They test this with chaos gorilla.

Does a whole region fail (with several data centres) most likely a region wide config issue. They test this with chaos kong.

Everything fails eventually

So you decide how you deal with it
Time to detect, time to recover.

Highly agile and resilient service on top of ephemeral services.


Changes in one region shouldn’t affect others.


Christmas 2012 they had a long and painful outage. They were just in US east at that time.

The postmortem stated that data was deleted by a maintenance process from production. None of their services could receive any traffic.

This led to project isthmus. Plugging the leak, it is a tunnel between the US East 1 and US west 2. Users are geolocated between the two regions. If there is a regional failure, they override the geo routing and route to up region.

Zuul is a smart routing fabric invented in house that sits behind the elastic load balances, and is a powerful filter routing handler. It is open sourced.

DNS can become a single point of failure, so they created a “denominator” that allows choice of DNS.

This supported ELB. THey didn’t want to build one offs for each service, how could they come up with a single solution. This led to the active active project

Active active

This provides full regional redundancy.

Can’t just deploy to both regions and be done. There are some nightly batch jobs which are not on user path and don’t need it. Secondly, replicating the state is important.

Routing users to the services has been discussed

Data replication? They have embraced Cassandra, which has a tuneable consistency model. Eventual consistency != hopeful consistency! They benchmarked global Cassandra and had no data loss with 1 million wires and reads within 500ms.

Propagating ev cache layers was more difficult. They need single digit milliseconds response times for some situations. They came up with a complicated method of doing this. They have now announced dynomite which keeps native protocol from memcache and works better.

Config isolation

Archaius is the region isolated config tool. They don’t want to make any config global now, default to regional. All devs have live access and can do global config if needed, but it isn’t default.

They have automated canaries and continuous deployment.

As guard is the tool they made OSS to handle deployments. Sets up a new instance, directs some traffic across, old cluster still there until you definitely need to take it out.

They have a platform deployment app they are planning to open source to allow deployments to all areas to happen with less interaction.

Monitoring and alerting has per region metrics, global aggregation. It could be seen s a logging service which also allows you to watch movies!

They use route53 CNAMES. (Look up).

For fallback it isn’t enough just to reroute. Need to ensure that data is repaired, cold caches get refreshed, auto scaling men’s that don’t bring traffic back too quickly.

Validating the whole thing works by using their chaos primates against live systems. Ensure there is no cross regional dependencies. They kill their data tier too, not just their stateless services.

Open Source


Ice is their open source tool which takes the amazonn pricing report and gives powerful visualisations of the costs.

Eureka is a service registry/discovery tool. One is deployed per zone. It is highly available.

Edda is a snapshot of every environment in history, how it evolved over time and be able to query it.

Archaius is for configuration management.

Ribbon library for internal request routing
Karyon is the server side partner of ribbon

Hystrix circuit breaker. Fail fast, recover fast.

Turbine dashboard to work with Hystrix

Simian army (runs in working hours so devs can fix)

It is all apache licensed.

Q how do they do capacity planning? They don’t. Devs have freedom and responsibility to Isle what they need if they thing they need it.

Q when a new version of a service corrupts data, what do you do? If you find the corruption quickly, you fix quickly. If it isn’t detected in time you have to fix forward and cleanse the data. They are quite paranoid about their backup policies.

Q if they use DNS for failover, do they have a short ttl? Yes, they have it down to about 10 mins. Sadly some devices don’t respect ttl rules.

Manoeuvrable Web Architecture[6]

Michael Nygard (author of “Release it!”)

Agile development works best at the micro scale. It won’t creat macro scale agility though.

The term “agility” comes from John Boyd, the fighter pilot from Vietnam who was considered ham fisted but was a brilliant theorist. He could get on someone’s six within 40 seconds on a regular basis. His later work is better known than his earlier work.

He was very good at introspection, and he wrote down how to dogfight. What made for a successful combat manoeuvre. Basically it came down to rapid transfer of kinetic to potential energy and vice versa. This was his Energy-Maneouverabilty theory (EM) and fast transience in manoeuvres was key to success.

He decided to calculate EM values for the contemporary US and USSR aircraft on computers, and discovered that the US aircract inventory was inferior in almost every regard. This didn’t win him many friends, and he was assigned to the pentagon to weigh him down with paperwork, but the “fighter mafia” there embraced him and fought for aircraft they wanted to create which matched his theories – like the F-16. They resisted built-in ladders, because they knew that small changes add up. It had a very high thrust to weight ratio so that it could accellerate quickly. It also had wings designed to be high drag, so that it could shed momentum very quickly. It was designed with EM theory in mind.

He later moved on to think about Manoeuvre Warfare, saying that the most important thing is to control the tempo of the engagement. Unlike the popular idea that warfare was about destroying the opponents ability to wage war, he focussed on making it impossible for the enemy to bring things to bear. One of the things that he observed that even something as simple as breaking camp and moving to a new location was dramatically affected by the experience of the units – a six to one ratio in the time taken.

You also want to take initiative. Take the actions that everyone else has to react to.

Observe, Orient, Decide, Act.

We want to be able to learn from these things.  A maneouvaerable web architecture will allow us to:

  • Control tempo of engagement
  • Take initiative.
  • Send ambiguous signals so that competitors don’t see where you are going.

So how do we do this?

We can’t do it just by declaring it to be so.

Tempo is the result, an emergent property of your maneuverability. It bursts to make these changes!

The typical IT architecture is awful, a disaster for tempo. There are a few themes which emerge from them.

  • plurality (it is ok to have many ways of using a service, and allow Darwinian evolution to pick the winner)
  • break monoliths
  • use Uris with abandon (trying to get just one perfect API can be a false goal)
  • augment upstream
  • contextualise downstream

These are not patterns, but some are still open to debate.

UIs need to be abstracted more

There was a company working in 100+ countries which needed to have separate UIs, which all invoked country specific services and all know each other and what they require. One way to address that problem is to get the UI to ask the back end what it needs to know rather than knowing about domain constructs. Generic UI plus semantic HTML plus unobtrusive JavaScript is better. Adding in SSR or CSR with a CMS is better.

A component and glue model is good. Scripts addressable by URL and dynamic deployment of scripts. Every modified script gets a new URL so that old stuff doesn’t break users of old script (they might be relying on a ‘bug’ in the old implementation).

Immutable values is a good policy at the large scale as well as the small scale (as done in clojure). This is needed because it is impossible to enforce a global time across a large enterprise.

We need to separate value semantics and reference semantics. Values don’t change references have atomic changes with no observable intermediate states.

Example: perpetual string. Stores strings forever
URL is sha-256 hash of the string.
Use for scripts, legal text. Edit the string, get a new URL.


What else could we make into a value? What about a shopping cart?

A cart is a number
Add: function from cart, item, qty to cart.
Remove: function from cart, item to cart.

There is a universe of potential shopping carts, and everyone with the exact same items has the exact same cart. Add an item and you go to a new cart

This gives a better separation of concerns, as the traditional method couples the cart to a single owner. [Personally I don’t think this makes as much sense as treating a shopping cart as an order that just hasn’t been submitted yet, and as such is rightly coupled to the customer]

Generalised minimalism

Feature: send email to customer ahead of credit card expiration.

Completed solution had user table, warning table, card table and a daily job that wakes up, scans cards, creates warnings, sends emails and checks for bounces.

The solution works. But it isn’t very composable. You cannot reuse functions.

A better design has several small services called
At: at date time, call this URL.
Template: accept body and params to format text.
Lead time: generate series of date times
Mailer: send email to address, track bounces

On the surface it appears more complex, but it is all small simple components. You can’t see the feature up front, but it emerges from interaction of these simple parts.

Do identifiers better

Too many identifiers are too context aware.

Ideally we would like an unlimited number of catalogues. Every service issues identifiers, no restrictions on use.

Policy proxy

The client can only access their own catalogue, the proxy checks that it is using the correct id. This could then be used in front of many different services.

Faceted identity brings together all the different identities. The user has links to ids issued by other services. This allows you to have different access paths to get to the same thing. The relationships are all externalised.

Explicit context
Urls, state machines, reply-to-query
Implicit context
Bare identifiers, state names. Assumed channel

Other interesting ideas (less tried and trusted)

Allowing use without permission, but you can cut off someone who is abusing your system.

Half duplex testing
Set up mock, set up call, make call, assert, verify mock was called *is wrong. *

Separate the mock verification from the call assert.

Ideally you want every group in the company to be able to work autonomously, and so we want to create systems to make this eas


[1] http://qconlondon.com/dl/qcon-london-2014/slides/BrandonByars_EnterpriseIntegrationUsingRESTACaseStudy.pdf

[2] http://qconlondon.com/dl/qcon-london-2014/slides/AdrianCockcroft_MigratingToMicroservices.pdf

[3] http://www.amazon.co.uk/The-Phoenix-Project-Helping-Business-ebook/dp/B00AZRBLHO

[4] Interesting note – this wouldn’t give them the percentage they would expect if their customer ids are numeric because of the law of large numbers. I imagine they used something more accurate in real life

[5] “Organisations which design systems are constrained to produce systems which are copies of the communication structures of these organisations” – M E Conway.

[6] http://gotocon.com/dl/goto-berlin-2013/slides/MichaelT.Nygard_ManeuverableWebArchitecture.pdf (qcon ones are protected at the moment for some reason, these are identical)