Unimpressed by Apple 2016

That must have been the least inspiring launch of an iPhone ever. Since the iPhone 4 the even years have seen a change in form (always with more and/or better screen) and the odd years have typically introduced better internals (improved camera, processor, touch sensor).

Yesterday’s event seemed like an odd year event.

Form Factor?

Essentially unchanged. There seem to be minor changes to the casing but essentially the same size and resolution.

  • 4: introduced retina screen
  • 5: introduced longer screen (extra row of icons)
  • 6: introduced bigger screen (and giant plus screen)
  • 7: introduced… no change

General improvements?

  • Improved speed
  • Improved battery life
  • Improved graphics
  • Improved camera
  • Home button improvement
  • Slightly more water resistant

I hate to say it but the 6s gave us pretty much that and more – all those improvements including the fingerprint recognition on the home button became lightning fast plus we had the whole 3D Touch thing, which opened up whole new vistas of activity. The 7 is incrementally better in all those areas, but why wouldn’t it be?


I see nothing in the 7 which makes a purchase even slightly appalling above my 6s which I have right now.

I’m not even sure that it is a compelling upgrade from the 6 to be honest. Might an upgrader just be better off going to a cheaper 6s?

Demos for a mario platform game and a beat em up both of which appeared to have gameplay from a decade or more ago was uninspiring, as was a watch Pokemon app after the crest of that wave has broken.

So what is going on?

I’m no apple analyst, but I wonder whether they are holding back major changes until 2017, their 10th anniversary?

I hear that iPhone sales haven’t been stellar this year (I’ve no idea how true that is) but perhaps a speed bump issue this year would mean there are more people ready for a big step up in capability this time next year? Three years worth of customers ready for the next big change in form factor or capability in a phone? I could imagine that as a plan because if they did something special for the tenth anniversary they would want sales to be knocked out of the park.

Other hardware news

It was also disappointing to see no other exciting novel hardware news. Last year we had the iPad Pro and Apple Pencil. The year before that we had the Apple Watch. Are they not working on anything aspirational right now?

It doesn’t necessarily have to be a whole new category, but surely something where at the end of the presentation we are saying “I can’t wait to get my hands on that!”

Have they just lost the art of the showstopper? Losing the Steve Jobs “one more thing” reveal has turned into nothing much to reveal at all? That would be a sad day in Apples history if it has arrived.

QCon 2015 – Architectures at scale


I took the following notes from a series of lectures about large scale architectures while at QCon 2015. I’m reproducing them here for easy reference and in case anyone else also finds them helpful or interesting.

Scaling Uber’s Realtime Market Platform

Matt Ranney, Uber First time they have talked about their architecture in public. Their supply and demand picture is very dynamic, since their partners can do as much or as little as they want. The demand from passengers who want rides are also very dynamic. He will use the terms supply and demand but doesn’t lose sense of the fact they are people. The mobile phone is the interface for their dispatch system. He shows an animation of London routes on New Year’s Eve, their busiest time. Partners and riders drive it all. talk to dispatch over mobile internet data. Below that are maps/ETA, services and post trip processing. Below that are databases and money. They use node, Python, Java, go. Lots of technology decisions were made in order to allow them to move quickly. All basically in service of mobile phones. Dispatch is almost entirely written in node.js, but they now plan to move to io.js. One of the things that they have found is that the enthusiasm of their developers enables lots of work to be done. Don’t underestimate the benefit of enthusiastic developers who enjoy working with a technology. The maps service handles maps, routes, route times etc. The business services are written mostly in Python. They have many databases. The oldest stuff is Postgres. They also have redis, MySQL, their own db and riak. The post trip processing pipeline has to do a bunch of stuff (in Python) – collect ratings, send emails, arrange payment etc. The dispatch system has just been totally rewritten. Even though Joel Spolsky’s cautionary tale[1] is well known, there were lots of problems with the existing system. In particular it expected one driver and one rider baked deep into everything. That stopped their ability to go to other markets like boxes or food, or have multiple riders sharing. Also they sharded their data by city but that wasn’t sustainable. They also had many points of failure that could bring the whole thing down. So even bearing in mind that cautionary tale, rewriting from scratch was still the right thing to do.

The new dispatch system

They generalised the idea of supply and demand. With services that had state machines that kept track of everything about these. Supply contains all kind of attributes – is there a child seat, number of seats available, wheelchair carriage, car share preferences. A service called disco matches things up, it stands for “dispatch optimisation”. Their old system only dealt with available cars, but this allows a wider view. They added geotag by supply, geotag by demand and routing/eta services. The disco service does a first pass to get nearby candidates, then checks on routing service to see if there are river boundaries or similar things which rule out some candidates. Some places like airports have queuing protocols. The geospatial index is cool. It has to be super scalable, with 100 million writes per second and many more reads. The old one only tracked dispatchable supply, so a global index was easy. The new world needs all supply in every state and their projected route into the future. So they use the “S2”  library from Google[2] which really helps with spherical geometry. Each latlong has a cell Id, which represents a box on the surface of the earth. An int64 can represent every cm on earth! They use level 12 which is about a km on a side. There is an interesting thing – you can quickly get the coverage for a shape – a 1km radius of our current location is covered by five cells. The cell Id is used as a shard key. It can scale out by adding new nodes. Sqmap.com is good for exploring this.


They wanted to reduce waste time and resources, help them to be productive, reduce extra driving, provide lowest overall ETAs. While normally the demand search currently available supply (which works pretty well), they decided to allow it to search drivers which are on a trip who might be nearer and available within a better timescale. This would give a better result for the demand, and also means that less unpaid drive time for a different driver. It is like the travelling salesman problem at an interesting scale in the real world.

So how do they get scalability?

They are more ways than this, their way might seem crazy but it works for them. Keeping all state in the database is slow and expensive so they decided to write stateful services. So how does that get scaled? All processes are treated the same; they have written some software to provide application-layer sharding to their services in a fault tolerant and scalable manner. This is called ringpop[3] and is open source on GitHub. It is optimised for uptime rather than correctness. It uses the SWIM gossip protocol to maintain a consistent view across nodes (see paper by Abhinandan Das et al for more details[4]). This makes it easy to scale the services by adding machines. They also created a network multiplexing and framing protocol for RPC called TChannel[5] which they have also open sourced. This was designed for high performance across all languages (especially JavaScript and Python. They wanted to get redis level performance from node.js). They wanted a high performance forwarding support and proper pipelining (where every client is also a server). They wanted to bake in checksums and tracing into every request. They also wanted a clean migration path away from Json and http to Thrift and TChannel.


Everybody wants availability but some want more than others. Banks don’t seem to care so much about their service availability and are quite happy with planned downtime since their customers rarely move elsewhere. Uber, on the other hand, find that if they are down the riders (and their money) goes straight to another ride company. Availability matters a lot to Uber. So everything is retryable. It must be made retryable, even if it is really hard to do. They spent a lot of time figuring out how to ensure everything is idempotent. Everything is killable (chaos monkey style). Failure is common, they exercise it often. They don’t have graceful shutdowns, they crash only shutdowns. They also want small pieces to minimise the cost of individual things dying.

Cultural changes

No pairs of things like database, because randomly killing things don’t recover well. Kill everything means can kill database nodes too, which changes some of their database choices. Killing redis is expensive, killing ringpop is fine. Originally they had services talking to each other via independent load balancers. But what happens when load balancer dies? They decided that the clients need to have some intelligence about how to route around problems. They ended up making a service discovery system out of ringpop. This is just getting into production, will open source soon.

Horrible problems? Latency

Overall latency is greater than or equal to the latency of slowest component. If p99 is 1000ms, then using 1 component you know that at least 1% have a 1000ms latency. But if you have 100 components, then 63% would have a 1000ms latency. Note: He didn’t go into this in more detail, but I’m concerned with his maths here – it would be true if the 100 services were ‘AND’ed together – but in his service are they not ‘OR’ed together? A good way to solve this latency is “backup requests with cross service cancellation”[6]. Service A send a request to service B(1) which also tells it that you are sending it to service B(2) as well. After 5ms Service A sends the same request to to service B(2) which says that it is also sent to B(1). Whichever Service B responds first then sends a cancel to the other server.

Horrible problems? Datacentres failure

Harder to prepare for, happens less frequently. Failover of data is easy enough to arrange, but how do you handle all the in process trips? The ones which are currently on the road? The failover datacentre just doesn’t have the data, as you can’t be replicating all the data all the time. Their solution is to use their driving partner phones. The server periodically creates an encrypted digest of information and sends it back to the phone. If failover happens the phone finds itself talking to a new server, which doesn’t recognise the current location of the phone and requests the encrypted digest so that it can bring itself up to date.

Microservices Are Too (Conceptually) Big

Philip Wills, the Guardian @philwills Five years ago they relaunched the Guardian site as a new shiny monolith. Which shows that everything that starts off as new and shiny becomes legacy over time! Over the last five years the guardian has broken down their monolith into micro services. Microservices are too big when we conflate solving two different kinds of problems.

  • independent products
  • single responsibility applications

Why micro services?

The guardian is owned only by the Scott Trust, which is there to safeguard their journalistic freedom and liberal values. That has enabled their investment into it. Last week they deployed forty applications to production. They produce the website, the cms, and the reporting services. What is a microservice? Martin Fowlers article[7] makes it clear that it isn’t a monolith. Not an ESB (since it should rely on dumb pipes). But  the article is a little wooly on details. We do want to deliver business innovation though. One of the things which is valuable is independent teams. Masters of their own destiny. In 2008 they had built a nice chunky monolith with Thoughtworks. But a short while after that they ran a great hack day, but they could never get any of their innovative ideas into production. The organisational complexity was too high. They never got into a gant chart of doom, but they did have a full time release manager for their fortnightly releases. They wanted to limit the scope of failures. They made changes to their applications. Micro apps. They put placeholders into the monolith so that it would get things from somewhere else. They could break some things out of the monolith this way. Then the big wiki leaks story happened. They had arranged an online Q&A with Julian Assange, but they didn’t have a real solution for it – so their used their online commenting solution to do it, but it was a disaster. The commenting use case was that almost everyone just looked at the first page of comments but now everyone asked for all the comments. It didn’t fall over completely, but it went really slowly, and all the monolith threads got tied up and the whole site ground to a halt. As a result of this they decided that they needed to allow things to fail independently. Otherwise people fear change. If you are going to push the boundaries you have to be aware that some innovations are not going to work. That also means that you need to be able to kill things. Clean software death which leaves nothing behind. Their first commenting system was using “pluck”, and moving that out of the monolith was a pain. Several times they tried to get rid of everything and kept having to go back to kill more off. It was very difficult to reason about whether it can be cleanly extracted.

Independent Products

Keeping teams in line with their products. A nice stable well defined interface. JSON is a really poorly defined data interface, and they are looking at moving away to thrift. They like the strong typing of scala, why not have that in their messages too? These should be deployable independently too. Move forwards should always be backwards compatible although keeping stable interfaces is hard though. For an independent product you must own your own data store. No integration between teams on the basis of database. This allows for independence but doesn’t speak to their other goals.

Single responsibility applications

Following the meme others have used this week, microservices should be “Small enough to fit in your head”. More usefully, they decide that a well partitioned service has one key metric which tells you whether it is doing the job it should do. These must be isolated from each other and don’t impact each other’s performance so they run everything in AWS and most operational problems are solved by either turning it off and on, or throwing more hardware at it. How do they actually structure things?

  • Website
  • apps
  • content api
  • composer
  • asset manage,net
  • workflow
  • analytics

There is a distinction between how some of these work. Wherever they can break synchronous dependencies, they do, and they keep dependency chains as short as possible. Composer to content api uses idempotent messages going over a service bus for async. Can’t affect public site. As specific example their sporting match report page is served by a number of distinct apps composed within the page. One is core article, one is stats, one is comment system. There is no cross talk between systems to avoid cascading failures or one problem affecting another. They pull the additional data down through JavaScript to avoid these problems too. Much less impact of something goes wrong in one area. Their analytics system is called Oban. Their dashboard gets info from elastic search, which gets info from their other services. In terms of releases, the dashboard gets modified daily, the logger gets changed much less frequently (say once a month). They are looking at changing some aspects of these right now already, moving to amazons kinesis. The key loader metric is “unprocessed messages” and it auto scales based on the length of the queue. The website key metric is response time to users. Microservices are not a silver bullet. It doesn’t solve all problems. But we need to think about the problems that we want to solve. He certainly prefers to see micro services though. Regarding cross cutting services they try to use amazons implementation. They avoid shared libraries unless as a last resort.

Rebuilding Atlas – Advertising At Scale At Facebook

Jason McHugh, facebook Facebook purchased Atlas from Microsoft, and the contract included Microsoft continuing to run it on their infrastructure for a couple of years. Atlas was a fourteen year old company.

Ad serving tech

The vision of ads at Facebook is that thy don’t have to suck. Can they make it a positive experience. It is a huge industry with a vast budget. Digital advertising is fast growing. Only television isn’t falling in terms of other advertising, although people are now spending more time on mobile and online than with tv.

third party ad serving

Advertisers think of the people hosting their ads as publishers. They use an independent third party to aggregate numbers from many publishers (this is where atlas comes in). The third party also manages campaigns, creative concepts, action tags for what the customer is doing. Advertisers get snippets they give to publishers and action tags in their own site to understand conversions. When a request for an ad is made the ad server does an identity resolution, deduces some demographics and then puts together the advertiser creative content. Traffic patterns and probabilistic matching and machine learning are used to guess who people are.

The challenges

Understanding the systems was complex, none of the devs had any experience in this business area. They also had to get to grips with the existing technology stack, architecture, the data model and databases (19 separate ones). One db had 345 tables with nearly 4000 columns. Couldn’t even answer some of the WTH questions. It was deployed on several thousand machines across many data centres. It was a huge mature product. Which subset did they want to implement?

  • third part ad server definitely
  • many other things

Lift and shift is a common acquisitions approach. An evolutionary approach. However, they didn’t want to do this. The hardware was old and owned by Microsoft, built on technologies closed to Microsoft. This was counter to their overall approach at facebook.


They took nothing from the original. This is their new high level architecture He then looked at the physical architecture of ad delivery and logging. User traffic goes to DNS (anycast) to see which cluster the request should be sent to. Cartographer pushes new maps to DNS as it determines changes in best routing right now. Then a cluster switch recipes a request and uses a hash to route for a layer 4 load balancer which uses consistent hashing[8]. This then routes to proxygen which is a layer 7 load balancer. This sends the request to the atlas delivery engine. As an ad is delivered to the customer there is also a real time pipeline for processing everything which has happened. Atmonitor (fraud detection of nonhuman activities), Limiter (was it legal from a billing perspective), report preprocessing (messages sharded by strongest identity, second strongest identity and placement, stored in Presto db), Aggregator(roll up sums), invoicing etc. “Scribe” is a hugely important component. A high throughput message queue, highly scaled. Not lossless but guarantees are excellent. Decouples producers. Persistent for n days. Sharded consumption, checkpoint streams with fixed periodicity. Message queues can be costly. Repeatable re-execution is an important ability. The need to be able to find the repeating messages or larger units of work amongst billions is tricky.

Lessons Learned

The only mistake which they admitted to was wasting effort by minimising the code in their www tier without considering what other teams were doing in that domain. In the last two years there have been huge improvements there, and if would have been better if they had built for what was coming than what was there right now.

Service Architectures At Scale: Lessons From Google And Ebay

Randy Shoup Aims to give an impression on what these architectures look like in the large and feel like in the small.

Architecture Evolution

They all end up looking somewhat similar. eBay started as a monolithic perl app written over a weekend, then became a C app with 3 million lines of code, then a Java monolith and now micro services. Twitter was a monolithic rails app which changed to a rails plus scala monolith, then to microservices Amazon was a monolithic c++ app which changed to Java/scala and then to micro services.

Ecosystem of services

Hundreds to thousands of independent services. Not tiers but ecosystems with layers of services. A graph of relationships. They are evolution rather than intelligent design. There has never been a top down view at Google. It is much more like variation and natural selection. Services justify their existence through usage. Architecture without an architect. The is no such role and no central authority. Most decisions are made locally rather than globally. Appearance of clean layering is an emergent Proprty. eBay by contrast had an architecture review board which had to pass everything, usually far too late to be valuable. Randy worked on cloud data store in app engine at Google. This was built upon a tower of prior services, each of which was added when something new was needed at a higher level of abstraction. Standardisation can happen without central control.

  • standardised communication
    • network protocols (stubby or rest)
    • data formats (protobuf or json)
    • interface schema
  • standardised infrastructure
    • source control
    • config management
    • cluster management

Standardisation is encouraged via libraries, support in underlying services, code reviews, searchable code. There is one logical code base at Google that anyone can search to find out whether any of the 10,000 engineers are already working on it. The easiest way to encourage best practices is with actual code. Make it really easy to do the right thing. There is independence of services internally. No standardisation around programming languages for instance. Four+ languages, many frameworks etc. they standardise the arcs of the graph, not the nodes. Services are normally built for one use case and then generalised for other use cases. Pragmatism wins. E.g. Google file system, bigtable, megastore, Google app engine, gmail. Deprecating old services. If it is a failure or not used any more, repurpose technology and redeploy people. Eg Google wave cancelled, core services having multiple generations.

Building a service

A good service has:

  • single purpose
  • simple, well defined interface
  • modular and independent
  • isolated persistence(!)

Goals of a service owner are to meet the needs of my clients in functionality, quality, performance, stability, reliability and constant improvemet over time. All at minimum cost and effort to develop and operate. They have end to end ownership of the status through to retirement. You build it, you run it. They have autonomy and authority to choose tech, methodology etc. You are focussed primarily upward to the clients of your service and downward to the services you depend upon. This gives a bounded cognitive load. Small, nimble teams building these services. Typical 3-5 people. Teams that can be fed by two large pizzas. Service to service relationships. Think about them as vendor-customer relationships. Friendly and cooperative but structured. Clear about ownership. Customer can choose to use the service or not. SLAs are provided – promises of service levels which can be trusted. Charging and cost allocation turned out to be important to prevent misuse of services. There was one consumer of their service which was using far too high a proportion of their service time. They kept asking the consumer to optimise their usage, but it was never a high priority for them – until they started getting charged for usage, at which point they quickly introduced an optimisation which cut usage to 10% off the previous figure.

  • charge customers for usage of the service
  • aligns economic incentives of customer and provider
  • motivates both sides to optimise for efficiency
  • pre/post allocation at Google

Maintaining service quality benefits from the following approaches:

  • small incremental changes
    • easy to reason about and understand
    • risk of code change is nonlinear in size of change
    • (-) initial memcache service submission
  • solid development practices
    • code reviews before submission
    • automated tests for everything
  • Google build and test system
    • uses production cluster manager (tests are run at a lower priority though!)
    • runs millions of tests in parallel every day
  • backward and forward compatibility of interfaces
    • never break client code
    • often multiple interface versions
    • sometimes multiple deployments
  • explicit deprecation policy

Operating a service

Services at scale are highly exposed to variability in performance. Predictability trumps average performance. Low latency and inconsistent performance does not equal low latency. The long tail latency are much more important. Memcache service had periodic hiccups in performance. One in a million. Difficult to detect and diagnose. The cause was slab memory allocation. Service reliability.

  • highly exposed to failure
    • sharks and backhoes are the big problems killing cables
    • operator oops (10x most likely)

Resilience in depth. Redundancy, load balancing, flow control. Rapid rollback for oops. Incremental deployment

  • Canary systems
  • staged rollouts
  • rapid rollback

eBay feature flags have been rediscovered many times. Separating code deployment from feature deployment. You can never have too much monitoring!

Service anti patterns

The mega service.

  • services that do too much
  • too difficult to reason about
  • scary to change
  • lots of upstream and downstream dependencies

Shared persistence

  • breaks encapsulation
  • encourages backdoor interface violations
  • unhealthy and near invisible coupling of services
  • the initial eBay SOA efforts were like this.


Building A Modern Microservices Architecture At Gilt: The Essentials

Yoni Goldberg, gilt Gilt is a flash sales company, with limited discounted limited sales, resulting in huge spikes at noon each day. There are about 1000 staff, 150 devs. Classic startup story. Started fast with Ruby on Rails with a Postgres db. The moment of truth was when they first added louboutin shoes on the site in 2009. Even with all their extra precautions they had planned, the site still failed dramatically. They needed thousands of ruby processes, Postgres was overloaded, routing between ruby processes was a pain. Thousands of controllers, 200k loc, lots of contributors, no ownership. Three things changed

  1. Started the transition to JVM
  2. Microservice era started
  3. Dedicated data stores for services

Their first ten services were

  • core services (checkout, identity)
  • supporting services (preferences)
  • front end services

They solved 90% of their scaling problems, not not the developers pain points. Began the transition to scala and play. They now have about 300 services in product.

Current challenges

  • deployments and testing
  • Dev and integration environments
  • service discoverability
  • who owns this service?
  • monitoring

Building the Halo 4 Services with Orleans

Caitie McCaffrey. @Caitie She worked on halo 4 services for three years, and will talk about architectural issues. Halo 1 was only networked peer to peer and no services. Halo 2 had some Xbox services for achievements etc Halo 3 etc had more Halo 4 they wanted to take the original engine and old services based on physical hardware and build for a more sustainable future. Main services Presence (where you are) Statistics (all about your player) Title files (static files pushed to players, games, matchmaking) Cheat detection (analysed data streams and auto ban people, auto mute jerks) User generated content (making maps and game content. Big part of the long tail) They knew that they had to approach concurrent user numbers of up to 12 million+, with 2 million sold on day 1. They had 11.6 million players online, 1.5 billion games and 270 million hours. No major downtime.

Architectural Problems

You get a huge spike of users on day 1, and spikes at Christmas or at promotions – the opposite of ramping up. This is why they decided to go with cloud. Worker roles, blob, table storage and service bus. It had to be always available. The game engine expected 100% availability. They needed low latency and high concurrency. Typically people start with a stateless front end, stateless middle tier and storage. Then they added a storage cache, but that added concurrency issues. Other options? They wanted to have data locality. The data is highly partitionable by user, so that would be good to do. Hadoop jobs were too slow. So they looked at the actor model[9] paper from 1973[10] for thinking about concurrency. An actor can

  • send a message
  • create a new actor
  • change internal state

This gives stateful services with no concurrency issue. Erlang and others did this, but they wanted to do something in .Net. The Microsoft Research team were doing something called Orleans[11]. An Orleans virtual actor always exists. It can never be created or deleted. The runtime manages:

  • perpetual existence
  • automatic instantiation by runtime
  • location transparency
  • automatic scale out

It uses the terminology “grains” within a “cluster”. Runtime does

  • messaging
  • hosting
    • connection
    • serialisation
  • execution
  • within the grain you don’t have to worry about concurrency either

Messaging guarantees

  • at least once is out of the box
  • best for app to decide what it should do if it doesn’t get an acknowledgement back.

CAP theorem

Orleans is AP. You can always talk to a grain. You might possibly end up with two instances of the same actor running at the same time.

Programming model

  • dot net framework. C#, F#. Plan is to be dot net core compatible.
  • actor interfaces
  • promises. They all return promises.
  • actor references. Strongly typed so that it can work out what goes where.
  • turns. They are single threaded and do messages in a “turn”. You can mark a grain as reentrant, but still always run a single thread.
  • persistence. Doesn’t persist any grain state by default. Hard to solve for the general case, developers choose for themselves what they want to do.

Reliability is managed automatically by the runtime. If a machine dies, the runtime just brings the grains up on a different box automatically. Same if a single grain dies, it just gets reallocated the next time it is needed. Started working with this academic team in summer 2011 to get it up and running; a two way partnership. Ultimately a stateless front end talks to an Orleans silo which talks to persistent storage. Halo 4 statistics service (She built this) Player grain. Everything about you as the player. Comes up when you go online, garbage collected. Game grains. Everything for a game. Xbox posts stats to the game, this writes aggregate data to blob storage, and at end of game sends info to each player grain, which writes its state to table storage. The player operations are idempotent and could be replayed as necessary in case of message failure.

Performance and scalability

Not bare metal code. But it can run at 90-95% cup utilisation stably. They have run load tests at this level over 25 boxes for days. It also scaled linearly with number of servers. This was in a report published in March this year.

Programmer productivity and performance

They scaled their team from six to twenty. The devs picked it up quickly and were readily available for hire. Distributed systems now is a bit like compilers in the sixties. Something that works and is scaleable is now possible. Orleans is an early iteration of a tool that really enables us to do that. Easy and performant. Orleans is open source on GitHub[12].


(What was the failure detection method used – question from Ali) They used a configurable timeout, set to low – a couple of seconds, so it could fail fast and a new one got hydrated. (What was latency for serialising the messages for reads) Typically actors were doing just one hop. It could use any storage (expects in azure at the moment). They knew their read patterns so they knew fairly well how things were going to be used. They used service bus for durable transmission of messages. (how easy was it to onboard devs to this actor model? Are the actors parent/children or finer control needed) It was really easy to onboard people. As long as people understood async. There is no management of actors at all by the apps. [1] http://www.joelonsoftware.com/articles/fog0000000069.html [2] https://code.google.com/p/s2-geometry-library/ [3] https://github.com/uber/ringpop [4] http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf [5] https://github.com/uber/tchannel [6] https://www.google.co.uk/webhp?sourceid=chrome-instant&rlz=1C1CHFX_en-gbGB557GB557&ion=1&espv=2&ie=UTF-8#q=backup%20requests%20cross%20service%20cancellation [7] http://martinfowler.com/articles/microservices.html [8] http://en.wikipedia.org/wiki/Consistent_hashing [9] http://en.wikipedia.org/wiki/Actor_model [10] http://ijcai.org/Past%20Proceedings/IJCAI-73/PDF/027B.pdf [11] http://research.microsoft.com/en-us/projects/orleans/ [12] https://github.com/dotnet/orleans

Migrating to Microservices – QCon London 2014

I was fortunate to be able to attend QCon 2014 in London this year, and I took quite comprehensive notes from the key sessions which I attended. I’ve collected here the notes on the general principle of migrating to microservices. anything which appears [in square brackets] are my own editorial comments. I intent to write something later to bring a consistent view for myself on how to move a monolithic application into services and microservices in the medium to long term.

Enterprise REST – a case study[1]

Brandon Byars (Thoughtworks)

Using rest at a large scale in a telecoms company.

The eight fallacies of distributed computing

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. the network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

We are moving from plain old XML, RPC style to URIs, then to HTTP and heading towards HATEOAS.

Uri is about resources, http is a standard interface. HATEOAS is forcing people to use your hyperlinks.

What he has observed is that most arguments occur at the upper couple of levels. Put vs patch, parameterisation etc.

Another observation is that those areas are not where the mistakes tend to happen. Most errors are in the versioning, deployment, testing, service, granularity and deployment of Uris.

Concentrating here on enterprise API’s, it is a story about a billing system.

He worked for three years at a telecoms company that wrote a billing system in the eighties. It enabled them to strategically grow, and was useful for a long time. The consequence was that their billing system had become very complex though. Static analysis of their code was worrying. Typically a complexity of more than 10 is a problem; theirs was 1000…

So this is a story about legacy rewrite. They agreed on doing web based integration. It was before micro services were being considered, so the services they were creating were quite large. They had to support a customer facing web UI and some customer support UI.

They broke it into a dozen services.

They wanted choreography rather than orchestration. The latter is big SOA. It simplifies the architecture diagram – but not the actual architecture!. The former allows business value to emerge from the interactions of the services. Harder to diagram but provides better decoupling between services.

They chose to define logical environments for isolation.

They had a problem. They were adopting the infrastructure definition of environments, which was different to the developer view of environments. Devs worked on local machines and integrated on a shared integration vm, but using logically isolated environments overlaid on the infrastructure environments.

Ports and database names were used to provide logical isolation boundaries.

Coordinated deployments

Developers updated their own project development environments as and when they want.

This was one of their biggest successes. It predates some of the tools like chef and puppet which now allow that kind of thing to be done in a more structured way.

Version as a last resort

Semantic versioning is sensible; it is common to support multiple versions in prod, so supporting old versions is important. It adds complexity though, especially in monitoring and troubleshooting. So delay the need to version.

Spelling mistakes, inconsistent capitalisation are a common but silly cause of breaking changes. If you treat your end points as user stories it is easier to catch things like this early and in time.

Postel’s law

“Be liberal in what you accept and conservative in what you send”.

We too often have deserialisers which throw an exception if you get an order with four fields and your object is expecting three. Additional fields you don’t expect shouldn’t be breaking changes (but too often it is). move away from strict schema validation.

Separate functional testing from integration testing

The best balance was to have isolation tests which are stubbed ruthlessly over the wire against stubs. They started with handcrafted stubs, now there are more record and replay stubs which are out there. Moco, VCR, Betamax stubby4j etc.

Contract tests

The last known good versions of all other services were deployed alongside new dev work. If they all pass, the combination of all those items were saved as an artefact, a set of services which work together. [note: this flies in the face of the Netflix microservice deployment process]

Use bounded contexts to control complexity

The product team had a very good idea of what products were.

Event propagation in REST is very much in discussion. They used atom feeds, but he isn’t entirely sure that was best.

This article is online at http://martinfowler.com/articles/enterpriseREST.html

Migrating to microservices[2]

Adrian Cockcroft

Left Netflix in January. Wants to cover things more broadly. He has some slides on net share from cloud conference which are relevant from an exec view.

What did I learn from Netflix?

Started working on API’s, moved to microservices in the cloud. Their main plan is speed of delivery to the marketplace. It learns very quickly.

  • Speed wins in the marketplace
  • Remove friction from prod development
  • High trust, low process
  • Freedom and responsibility culture
  • Don’t do your own undifferentiated heavy lifting (I.e. Just put it all on AWS)
  • Simple patterns automated by tooling (no central architecture planning function). This helps arch to evolve, best parts to then be identified and spread further
  • Microservices for speed of development and availability

Typical reactions to Netflix talks

In 2009 people said that they were crazy and making stuff up
In 2010 people said it won’t work
In 2011 people said it only works for unicorns like Netflix
In 2012 people said we’d like to do, that but can’t
In 2013 people said we’re on our way, using Netflix OSS code

Demands on it have exploded. You can only survive by being good at it

Colonel Boyd USAF said observe, orient, decide, act.

So how fast can you act?

Doing IaaS? Cuts out infrastructure from the process of product development.

PaaS cuts out even more – can make product manager talk to dev on one feature that is quickly released (pushed every day to production!) As the cost and size and risk of change is reduced, the rate of change increased.

Observe = Innovation: land grab, customer pain points, measure customers, competitive moves.
Orient = Big Data: analysis, model hypotheses.
Decide = Culture: plan response, share plans, JFDI
Act= Cloud: incremental features, automatic deploy, launch AB testing.

How do I get there?

Old school companies need to read “The Phoenix Project”[3] and move on from there.

Lean enterprise by Jez humble is about continuous deployment for speed in big organisations. No time for handoff between teams. Run what you wrote – root access and pager duty. Genuine ownership of your code where it counts. High trust for fast local action. Freedom and responsibility for developers.

Open source ecosystems

  • The most advanced scalable and stable code today is OSS.
  • No procurement cycle; fix and extend it yourself.
  • GitHub is your company’s online resume.
  • Give up control to get ubiquity – Apache license
  • Extensible platforms create ecosystems.

Cloud native for high availability

  • Business logic isolation in stateless microservices
  • Immutable code with instant rollback
  • Auto scaled capacity and deployment updates
  • Distributed across availability zones and regions
  • Lots of de-normalised single function NoSQL data stores
  • NetflixOSS at netflix.github.com and techblog.netflix.com

Cloud native benchmarking

Netflix were planning to run full active-active across east coast and west coast. In order to do this they took the most write intensive workload they could. In twenty minutes they got 192 Tb of SSD provisioned across 96 machines in two locations.

An API proxy Zuul defines the API they want to use externally, and routes to various back ends.

Then there is Eureka Edda which is their service registry and a black box recorder for the state of the whole system at any time.

They then each teams concerned with a microservice. One might have a Karyon based server talking via Staash to Cassandra. Another might use MySQL or S3. Each team works independently;  it decouples the problems with breaking the build.

Separate concerns using micro services

  • Inverse Conway’s law – teams own service groups.
  • One “verb” per single function microservice
  • Size doesn’t matter
  • One developer independently produces a microservice.
  • Each microservice has its own build, avoids trunk conflicts.
  • Stateless business logic (simplifies roll forward/back)
  • Stateful cached data access layer
  • Versioning

Deployment architecture

  • Versioning
    • leave multiple old microservice versions running
    • fast introduction vs slow retirement
    • code pushes only add to the system. Eventually retire old services which have no traffic.
  • Client libraries
    • even if you start with a protocol, a client side driver is the end state.
    • best strategy is to own your own client libraries from the start (eg mongo did this, which really helped with their uptake)
  • Multithreading and non-blocking calls
    • reactive model RxJava using Observable to hide threading
    • migrated from Tomcat to Netty to get non-blocking I/O speedup
  • Enterprise Service Bus / Messaging
    • message buses are CP with big problems getting to AP (CAP theorem)
    • use for send and forget over high latency links

Microservice APIs

  • API patterns
    • RCP, rest. Self-describing overhead, public vs in-house.
    • XPATH, jsonpath adds some flexibility but not as useful in house.
  • Scripted API end points – dynamic client RPC pattern
    • See Daniel jacobsons talks at slideshare/netflix
    • March 3rd 2014 techblog.netflix.com post by Sangeeta Narayanan
  • Service discovery
    • build time Ivy, Gradle and Artifactory
    • run time Zookeeper for CP, Eureka for AP
  • Understanding existing code boundaries
    • how do you break up your giant code base?
    • buy a bigger printer and wallpaper a room.

Microservice datastores

  • Book: Refactoring Databases
    • Schemaspy to examine data structure
    • Denormalisation into one data source per table or materialised view.
  • CAP – Consistent or Available when Partitioned
    • Look at Jepsen models for common systems aphyr.com/tags/jepsen
    • AP as default for distributed system unless downtime is explicitly OK
  • Circuit breakers see http://fluxcapacitor.com for code examples.
    • NetflixOSS, Hystrix, Turbine, Latency Monkey, Ribbon/Karyon
    • Also look at Finagle/Zipkin from twitter and Metrics, Graphite
    • Speed of development vs scale driving resilience

How do we get to microservices simplest and soonest?

Try the carrot, stick and shiny objects

  • “This new feature will be ready faster as a microservice”
  • “This new feature you want will only be implemented in the new microservice based system”
  • “Why don’t you concentrate on some other part of the system while we get the transition done?”

(as it happens, the last one of those was the approach they used at Netflix)

Moving to Microservices – shadow traffic backend redirection

  • First attempt to send traffic to cloud based microservice
    • Used real traffic stream to validate cloud backend
    • Uncovered lots of process and tools issues
    • Uncovered service latency issues
  • They modified the monolithic datacentre code path
    • Returns Genre/movie list for a customer
    • Asynchronously duplicated request to cloud
    • Started with send-and-forget mode, ignore response
  • Dynamic consistent traffic percentage
    • If (customerid % 100 < threshold) shadow_call()[4]
    • They set the threshold so they could dial up or down the amount of traffic going to the cloud.

Production is kept immutable. While monolithic updates can break everything at once, a microservice deployment adds a new microservice (with no impact) and then route test traffic to it. They have version aware routing and eventual retirement of older services.

Scott Adams “How to fail at almost everything and still win big”

Automatic canary red/black deployment

This process is in use for tens of large fleet microservices in active development.

  1. Developer checks in code then gets email notifications of progress
  2. Jenkins build launches in test account and starts tests
  3. If tests pass, launch ‘canary’ signature analysis in production
    1. Start one instance of the old code per zone
    2. Start one instance of the new code per zone
    3. Ramp up traffic and analyse metrics on all six
  4. If canary signature looks good, replace current production
    1. Scale canary build up to full capacity
    2. Send all the traffic to the new code
    3. Wait until after peak traffic time then remove old code instances
  5. Security team tools notice the new build via Edda query
    1. Automatic penetration test scan initiated.

When code is checked in during the afternoon in California and passes the canary test suite, it is first deployed to night time Europe, then after peak it is canaried and deployed to East Coast US on the next day, and then West Coast US after peak on the West Coast.

Monitoring the microservices is important.

They use Appdynamics to instrument the JVM to capture everything including traffic flows. They insert a tag for every http request with a header annotation guid, and visualise the overall flow or the business transaction flow.

Boundary.com and Lyatiss CloudWeaver are used to instrument the packet flows across the network. Captures the zone and region config from cloud APIs and tags, allows them to correlate, aggregate and visualise the traffic flows.

Scaling continuous delivery models

Monolithic – Etsy, Facebook

  • Etsy – 8 devs per train
  • Ops team run the monolith
  • Queue for the next train
  • Coordination chat session
  • Need to learn deploy process
  • Update in place
  • Few concurrent versions
  • 50 monolithic updates/day
  • Roll forward only
  • “done” is released to Ops
Microservices – Netflix, Gilt

  • Everyone has their own build
  • Dev runs their own microservice
  • No waiting, no meetings
  • API call to update prod timeline
  • Automated hands-off deploy
  • Non-destructive updates
  • Unlimited concurrent versions
  • 100s of independent updates
  • Roll-back in seconds
  • “done” is retired from prod


Adrian’s Blog: http://perfcap.blogspot.com

Dismantling the monolith

Brian Mcallister, Groupon

Groupon started with a giant monolithic rails app. Between 100k to 2000k lines of code.

Mobile component of transactions is now over 50% of traffic and growing.

They started by adding APIs onto their monolith but it was still horrible. They even had a different code base for their international platform.

This gave them a huge crisis in the business. Tried a front end rewrite over six months but had to roll back, it was a disaster.

They couldn’t develop things fast enough for the business. They wanted to build features worldwide, the mobile and web lacked feature parity and they couldn’t change the look and feel. But the big rewrite didn’t work, how can they move forward in 2012?

They looked at their monolith and tried to identify what modules composed it. They then wanted to start breaking each module into a separate service. They picked a two page flow (subscription) set up a separate route to that particular part of the application. For a language, the two guys who did this picked node.js because it had some momentum, and it had NPM which does packages correctly.

They implemented this tiny app (200 lines CoffeeScript) which called their existing API and moved it live. Within two hours they had a major site outage for that page.

It turns out that their infrastructure was optimised for rails, but wasn’t set up for supporting node.js. They had to introduce an additional routing layer to make it work.

So they went on to their next module. Subscription flow had been very low risk, didn’t have to manage templates etc. They wanted a different team to take on the next thing, something bigger, so they chose the browse page.

They thought it would be an easy change, but it wasn’t. The change to the deployment model needed a change to culture too, which they didn’t do. A two week estimate turned into six months with lots of tension and pain. Plus a realisation that they would have to adjust a lot of things about their culture.

They then decided that they wanted to spin out another dozen teams, so first had to think through their culture changes and make a framework which is easier to use. One of their policies is that documentation was required to answer every question.

They figured out how they wanted to handle layout. They have about twenty different kinds of layout. One model they tried was having template pages which contain the components (but that lacked flexibility). The next model was like angular.js compositing, but that gives a slow user experience at first. Eventually they just went to shared layouts from a shared layouts service, that includes login status, country and other stuff to create a dynamic moustache template.

Then they decided they needed to finish it. 150 developers in two months, no production work other than bug fixes during that time. This was so they could do AB testing between the two.

Latency halved across the board. They can now plot traffic by which application is running.

Of course, they still had their other code bases to handle.

Then what to do with their API. They couldn’t just break the API. But new international services start to take responsibility for the API going forwards, with the routing layer deciding where to send the requests to.

Modular development of a large e-commerce site

Oliver Wegner and Stefan Tikov, Otto.de

An architecture’s quality is directly proportional to the number of bottlenecks limiting its evolution, development and operation.

Conway’s law is well known[5]. But the inverse is true – that an architecture can limit what can be done within the organisation. Choosing a particular architecture can be a means of optimising for a desired organisational structure.

Rebuilding Otto.de

It was founded as a catalogue store back in 1949. For the last fifteen years e-commerce has been rapidly increasing, and has now overtaken the original business – it is now 80% of all turnover.

Business stakeholders are demanding more and more, but they couldn’t do it on their 2001 platform, so they decided to rebuild. They looked at buying in a product like ATG, but decided that would just be swapping one monolith for the next one. So they had a think – what were their goals?


  • Test driven (including AB testing support)
  • Data driven (decisions based on data, not feelings)
  • Personalised
  • Features

Non-functional goals

  • Simpler
  • Reliable
  • Fast
  • Realtime
  • Scalable
  • Time to market (one release per month was too slow)

They couldn’t change their backend systems which managed products, customer, orders.

They decided to use open source for their core technologies, so to not be dependent on one vendor

They made one prototype to define the technology stack

Project organisation with autonomous teams.

Scrum as agile development method

Their technical system architecture started off looking like a standard layered model. But it looked like any other single system. So how could they divide it into services rather than data layers? They decided that it should reflect the system blocks, then have rules about how to connect those blocks. Within each domain service, the language and data storage choices are really internal to the systems. They have ended up with a combination of C#, JRuby, Scala, Groovy and Clojure communicating with a combination of RDMBS, NoSQL document and key/value stores.

The customer journey has separate elements, and different business units had interest in and responsibility for each step (discover, search, assess, order, checkout). They divided up the system architecture up vertically along these same lines (search, product, order, user, after-sales, etc.).

Macro architecture

  • RESTful
  • Shared nothing
  • Vertical design
  • Data governance

Micro architecture

  • Buy when non-core
  • Common technologies

Their Product Owner is a virtual entity, decisions made jointly by project lead, business lead and technical lead. (A very interesting idea)

When you have teams for each of these areas, how do you deal with frontend integration? The customers want a consistent experience.

Atom feeds are used for loose integration between systems (eg caching product changes). Better than straight rest calls, which ties things together in an unwanted way.

Good client side integration levels are links, and replaced links (embedded something on the page).

Interestingly their basket page is an entirely different application, although it looks identical to the customer. They have an asset team to concentrate on providing consistent assists to all apps. Danger that they become a bottleneck though… Might be a better way of doing it? Having a central versioned storage perhaps?

First approach to AB testing was a solution with a centralised framework which every team has to include in their repository. But they don’t want that code sharing. So they have a dedicated separate vertical system for testing. Independent from other systems.

Ideally, for cross cutting concerns they like to introduce new vertical services to cover them.

2 years, >100 people, on budget, on quality, ready four months early for the MVP.

Lessons learned

Independent, autonomous systems for maximum decoupling. Allowed dev to,scale

Strict macro architecture rules that everyone knows.

Minimise cross functional concerns. Avoid centralising things; it is more work but much better in the long run.

Prefer “pull” to “push” sharing

Address cross functional concerns

Minimise the need for coordination between teams.

Be skeptical of “easy” solutions

Teams with their own decisions. Trust them.

Lessons learned in scaling twitter

Brian Degenhardt

He works in the platform team that write the base libraries. There are lessons learned in scaling from the original architecture to what they have today.

Engineering is the scalpel which we use to subdivide a problem so that it can be made in pieces. Westminster Abbey was built from individual stones. Manageable pieces.

Three lessons

1 incrementally implemented SOA.
2 separate semantics from execution
3 use statistics to monitor behaviour

incrementally implemented SOA

Originally it was a monolithic rails application talking to MySQL. It allowed a small team to iterate quickly so it was useful at first.

It was difficult to scale the security model
Any change deploys to all servers (and lots of leaky abstractions)
Poor concurrency and runtime performance, as all servers have to have all the code. Ruby was single threaded etc
Leaky abstractions and tight coupling made it difficult to separate stuff.

First they split storage into separate services. Tweets, users, timelines and social graph.

Then they divided up into routing, presentation which includes web, API, monorail, logic which includes tweets, users and timelines.

A visualisation of the services is vast; individually easy to work with, complex as a whole

How is a tweet sent?

Write tweet
Goes to write API
Fan out and deliver to all followers
Redis for each timeline, so it is stored there for each person
Timeline service pulls your stuff off your redis timeline.
Then each tweet has to be hydrated from tweets and user service.

The tweet is also sent to the ingester, which sends stuff to earlybird, the search system. Wh
Eh you search, blender does a parallel search across all earlybird instances when you search.

The tweet also goes to the firehose, http push and mobile push.

Timeline: 240miliion active users, 300k queries per second. 5000 tweets per second. Gets tweets in between 1 and 4 milliseconds.

Twitter server is open source, a base library for config, admin, logging and life cycle of service and metrics. Written in scala. Finagle is the underlying component. An RCP mechanism for the JVM. T does service discovery, load balancing, retrying, thread./connection pooling, stats collection, distributed tracing.

separate semantics from execution

Your service as a function

Takes a request, returns as response. The response is a Future[T] it is their abstraction for concurrency. It is composable, it can be pending, completed or failed. It is concurrent and easy to reason about.

Get user I’d, get tweet ids from timeline, get tweets, get images from tweets as needed.

Userid = future(23)

Then you can do the whole composing of the actions to get the whole future action, and then the parallisable sections do so.

They divided up what they want to do, and how the threads execute it. Separates the reasoning about what you want to do and how to make it as efficient as possible.

So there stack looks like this
Your service,

Each service follows this same stack


Use statistics to monitor behaviour

Breaking it up means they hve to use stats.

The amount of traffic means that the aggregate is more important than individual requests. More vertical components means more measurement

Horizontal SOA means more measure.

Failures are ok

300,000 rps
99.99% success, so
30 failures per second, so failure rate is more useful than actual failures.

Now, there might be a 10% increase in rate for median and p90, but long tail jumps 300%.
Tail effects are cumulative, so any p99 or p999 requests are important to manage.
This is why request level concurrency is important.

Measuring the vertical stack

They write a tool called viz which measurs the vertical stack, graphs it and has queries and alerts in it, makes dashboards possible. Every team has their own dashboard.

Measuring the horizontal

Zipkin (which is awesome). An OSS distributed trace. It is always on in the twitter infrastructure and maps 1 in 1000 live tweets. It makes it possible to see the pipelining as it is happening.

It is modelled after googles “dapper” paper. Google didn’t open source theirs, so twitter wrote their own and open sourced it. Distributed tracing is very useful!

Their statistics stuff is also open sourced in twitter commons.


What do they do about performance testing?

He is lukewarm about load testing. Instead they have a carefully staged rollout – canarying, so it goes to some servers and gets a portion of the traffic to see whether the canary dies on live traffic.

Oscars caused them a bit of a problem when everyone came to search at the same time, plus a lot of users who haven’t used it in years logged on they have to refresh lots of cache.

The biggest tweet velocity is from japan when castle in the sky is on – at the end everyone watching traditionally tweets the spell of destruction, and they had 185,000 tweets of that in one second.

With monorail, they had to have change freezes around big times because they couldn’t be sure what the impact would be. Now they have smaller services they don’t have that same problem.

Versioning – they just always maintain backwards compatibility, always adding new services.

How would they change if they were having financial transactions rather than tweets? Probably drop features and provide more deliberate accuracy.

How Netflix leverages multiple regions to increase availability

Ruslan Meshenberg
Director , platform engineering.

They have over 44 million subscribers worldwide.

This talk is really about failure in its most dramatic state. How do you mitigate it when it happens.

Small scale, slow change, everything pretty much works. Once you start moving at larger scale or faster change you will have problems with hardware failures or software problems respectively.

Top problems generate bad pr (active active, game day practicing)
Cust service calls ((better tools and practices)
Metrics impact ( better data tagging)

Does an instance fail? Yes. Could be bad deployments, hardware failure, latent issues. Their chaos monkey tests these things.

Can a whole data centre fail. It can happen, for routing or doc specific issues. They test this with chaos gorilla.

Does a whole region fail (with several data centres) most likely a region wide config issue. They test this with chaos kong.

Everything fails eventually

So you decide how you deal with it
Time to detect, time to recover.

Highly agile and resilient service on top of ephemeral services.


Changes in one region shouldn’t affect others.


Christmas 2012 they had a long and painful outage. They were just in US east at that time.

The postmortem stated that data was deleted by a maintenance process from production. None of their services could receive any traffic.

This led to project isthmus. Plugging the leak, it is a tunnel between the US East 1 and US west 2. Users are geolocated between the two regions. If there is a regional failure, they override the geo routing and route to up region.

Zuul is a smart routing fabric invented in house that sits behind the elastic load balances, and is a powerful filter routing handler. It is open sourced.

DNS can become a single point of failure, so they created a “denominator” that allows choice of DNS.

This supported ELB. THey didn’t want to build one offs for each service, how could they come up with a single solution. This led to the active active project

Active active

This provides full regional redundancy.

Can’t just deploy to both regions and be done. There are some nightly batch jobs which are not on user path and don’t need it. Secondly, replicating the state is important.

Routing users to the services has been discussed

Data replication? They have embraced Cassandra, which has a tuneable consistency model. Eventual consistency != hopeful consistency! They benchmarked global Cassandra and had no data loss with 1 million wires and reads within 500ms.

Propagating ev cache layers was more difficult. They need single digit milliseconds response times for some situations. They came up with a complicated method of doing this. They have now announced dynomite which keeps native protocol from memcache and works better.

Config isolation

Archaius is the region isolated config tool. They don’t want to make any config global now, default to regional. All devs have live access and can do global config if needed, but it isn’t default.

They have automated canaries and continuous deployment.

As guard is the tool they made OSS to handle deployments. Sets up a new instance, directs some traffic across, old cluster still there until you definitely need to take it out.

They have a platform deployment app they are planning to open source to allow deployments to all areas to happen with less interaction.

Monitoring and alerting has per region metrics, global aggregation. It could be seen s a logging service which also allows you to watch movies!

They use route53 CNAMES. (Look up).

For fallback it isn’t enough just to reroute. Need to ensure that data is repaired, cold caches get refreshed, auto scaling men’s that don’t bring traffic back too quickly.

Validating the whole thing works by using their chaos primates against live systems. Ensure there is no cross regional dependencies. They kill their data tier too, not just their stateless services.

Open Source


Ice is their open source tool which takes the amazonn pricing report and gives powerful visualisations of the costs.

Eureka is a service registry/discovery tool. One is deployed per zone. It is highly available.

Edda is a snapshot of every environment in history, how it evolved over time and be able to query it.

Archaius is for configuration management.

Ribbon library for internal request routing
Karyon is the server side partner of ribbon

Hystrix circuit breaker. Fail fast, recover fast.

Turbine dashboard to work with Hystrix

Simian army (runs in working hours so devs can fix)

It is all apache licensed.

Q how do they do capacity planning? They don’t. Devs have freedom and responsibility to Isle what they need if they thing they need it.

Q when a new version of a service corrupts data, what do you do? If you find the corruption quickly, you fix quickly. If it isn’t detected in time you have to fix forward and cleanse the data. They are quite paranoid about their backup policies.

Q if they use DNS for failover, do they have a short ttl? Yes, they have it down to about 10 mins. Sadly some devices don’t respect ttl rules.

Manoeuvrable Web Architecture[6]

Michael Nygard (author of “Release it!”)

Agile development works best at the micro scale. It won’t creat macro scale agility though.

The term “agility” comes from John Boyd, the fighter pilot from Vietnam who was considered ham fisted but was a brilliant theorist. He could get on someone’s six within 40 seconds on a regular basis. His later work is better known than his earlier work.

He was very good at introspection, and he wrote down how to dogfight. What made for a successful combat manoeuvre. Basically it came down to rapid transfer of kinetic to potential energy and vice versa. This was his Energy-Maneouverabilty theory (EM) and fast transience in manoeuvres was key to success.

He decided to calculate EM values for the contemporary US and USSR aircraft on computers, and discovered that the US aircract inventory was inferior in almost every regard. This didn’t win him many friends, and he was assigned to the pentagon to weigh him down with paperwork, but the “fighter mafia” there embraced him and fought for aircraft they wanted to create which matched his theories – like the F-16. They resisted built-in ladders, because they knew that small changes add up. It had a very high thrust to weight ratio so that it could accellerate quickly. It also had wings designed to be high drag, so that it could shed momentum very quickly. It was designed with EM theory in mind.

He later moved on to think about Manoeuvre Warfare, saying that the most important thing is to control the tempo of the engagement. Unlike the popular idea that warfare was about destroying the opponents ability to wage war, he focussed on making it impossible for the enemy to bring things to bear. One of the things that he observed that even something as simple as breaking camp and moving to a new location was dramatically affected by the experience of the units – a six to one ratio in the time taken.

You also want to take initiative. Take the actions that everyone else has to react to.

Observe, Orient, Decide, Act.

We want to be able to learn from these things.  A maneouvaerable web architecture will allow us to:

  • Control tempo of engagement
  • Take initiative.
  • Send ambiguous signals so that competitors don’t see where you are going.

So how do we do this?

We can’t do it just by declaring it to be so.

Tempo is the result, an emergent property of your maneuverability. It bursts to make these changes!

The typical IT architecture is awful, a disaster for tempo. There are a few themes which emerge from them.

  • plurality (it is ok to have many ways of using a service, and allow Darwinian evolution to pick the winner)
  • break monoliths
  • use Uris with abandon (trying to get just one perfect API can be a false goal)
  • augment upstream
  • contextualise downstream

These are not patterns, but some are still open to debate.

UIs need to be abstracted more

There was a company working in 100+ countries which needed to have separate UIs, which all invoked country specific services and all know each other and what they require. One way to address that problem is to get the UI to ask the back end what it needs to know rather than knowing about domain constructs. Generic UI plus semantic HTML plus unobtrusive JavaScript is better. Adding in SSR or CSR with a CMS is better.

A component and glue model is good. Scripts addressable by URL and dynamic deployment of scripts. Every modified script gets a new URL so that old stuff doesn’t break users of old script (they might be relying on a ‘bug’ in the old implementation).

Immutable values is a good policy at the large scale as well as the small scale (as done in clojure). This is needed because it is impossible to enforce a global time across a large enterprise.

We need to separate value semantics and reference semantics. Values don’t change references have atomic changes with no observable intermediate states.

Example: perpetual string. Stores strings forever
URL is sha-256 hash of the string.
Use for scripts, legal text. Edit the string, get a new URL.


What else could we make into a value? What about a shopping cart?

A cart is a number
Add: function from cart, item, qty to cart.
Remove: function from cart, item to cart.

There is a universe of potential shopping carts, and everyone with the exact same items has the exact same cart. Add an item and you go to a new cart

This gives a better separation of concerns, as the traditional method couples the cart to a single owner. [Personally I don’t think this makes as much sense as treating a shopping cart as an order that just hasn’t been submitted yet, and as such is rightly coupled to the customer]

Generalised minimalism

Feature: send email to customer ahead of credit card expiration.

Completed solution had user table, warning table, card table and a daily job that wakes up, scans cards, creates warnings, sends emails and checks for bounces.

The solution works. But it isn’t very composable. You cannot reuse functions.

A better design has several small services called
At: at date time, call this URL.
Template: accept body and params to format text.
Lead time: generate series of date times
Mailer: send email to address, track bounces

On the surface it appears more complex, but it is all small simple components. You can’t see the feature up front, but it emerges from interaction of these simple parts.

Do identifiers better

Too many identifiers are too context aware.

Ideally we would like an unlimited number of catalogues. Every service issues identifiers, no restrictions on use.

Policy proxy

The client can only access their own catalogue, the proxy checks that it is using the correct id. This could then be used in front of many different services.

Faceted identity brings together all the different identities. The user has links to ids issued by other services. This allows you to have different access paths to get to the same thing. The relationships are all externalised.

Explicit context
Urls, state machines, reply-to-query
Implicit context
Bare identifiers, state names. Assumed channel

Other interesting ideas (less tried and trusted)

Allowing use without permission, but you can cut off someone who is abusing your system.

Half duplex testing
Set up mock, set up call, make call, assert, verify mock was called *is wrong. *

Separate the mock verification from the call assert.

Ideally you want every group in the company to be able to work autonomously, and so we want to create systems to make this eas


[1] http://qconlondon.com/dl/qcon-london-2014/slides/BrandonByars_EnterpriseIntegrationUsingRESTACaseStudy.pdf

[2] http://qconlondon.com/dl/qcon-london-2014/slides/AdrianCockcroft_MigratingToMicroservices.pdf

[3] http://www.amazon.co.uk/The-Phoenix-Project-Helping-Business-ebook/dp/B00AZRBLHO

[4] Interesting note – this wouldn’t give them the percentage they would expect if their customer ids are numeric because of the law of large numbers. I imagine they used something more accurate in real life

[5] “Organisations which design systems are constrained to produce systems which are copies of the communication structures of these organisations” – M E Conway.

[6] http://gotocon.com/dl/goto-berlin-2013/slides/MichaelT.Nygard_ManeuverableWebArchitecture.pdf (qcon ones are protected at the moment for some reason, these are identical)


Things Apple is doing wrong in iOS



In my opinion, iOS 7 introduced some major usability problems, and some glitches which I find both annoying and surprising, considering the care which Apple used to take over the UX (user experience) of their products. This page might grow as I discover more annoyances, and be amended as they fix things to some extent.

Phone photographs far less usable

So, in previous incarnations when I get a phone call from someone my whole phone lights up with the lovely picture of the person calling me. I can tell at a glance, instantly, who is calling. For some reason the muppets at Apple decided that it would be best to blur that photograph out underneath one of the overlays which they were so pleased about, so that there is no indication of who is calling. Minor fix: In 7.1 they put a tiny round icon of the person in the corner of the phone. This improves the functionality from 0 out of 10 to 2 out of 10. it is still pretty useless though as you’ve got to peer at the phone to see which tiny round icon is displayed.

Which is best when receiving a call, eh? I know which I prefer to see.

iphone-call-recieve-old iphone-call-recieve-new

Siri is inconsistent when voice dialling

When I’m using headphones and I tell Siri to “phone my wife at home”, sometimes Siri will say “calling <my wife’s name> at home”. Sometimes it will just say “ringing”. Why this inconsistency? To be honest, this problem has been around as long as Siri has. Prior to 7.1 Siri was much less reliable and it was especially annoying to ask Siri to phone someone and it just said “ringing” and then connected the wrong person. At least when you said “Phone Robert White” and Siri replied “calling Apple Support” you could hang up the phone quickly. when it responds “ringing” it is always a bit of a gamble about whether or not it will connect the right person.

Awful address book

The iOS7 contacts list is a horrible nightmare. When in view mode for a particular contact the information stretches out below the ‘fold’ of the screen because labels are placed above the data fields. The photograph of the contact is shrunk down to a small circle, allowing less useful visual information to be displayed. It isn’t obvious what bits I can touch to phone, email or check someone’s address in maps. Luckily I still have Peeps on my iPhone which allows me to see the contact information in the old format and the usability and readability of the information is markedly better. The relevant information pops out at me, rather than disappearing in a sea of white.

iOS7 contact

iOS7 contact

Previous iOS contact

Previous iOS contact

Forgets what is playing when I make a phone call

This is an odd one which appeared either in iOS7 or iOS7.1, but which (happily) appears to have been resolved by iOS 7.1.1 When walking home from the station listening to music or a podcast on earphones, I decide to call my wife to so that she can meet me with the puppy. I pause the music, voice dial my wife via Siri (see above!) and then when I hang up the phone want to start the music playing again. But iOS has forgotten that it was playing anything! Using any of the combinations to start playing again reveal nothing, and I have to unlock the device, manually navigate to the app and start it once again. This is the case whether I’m using the music player or the podcast app (which is horrible enough that I’m going to include that here soon too). fixed since 7.1.1 (so far at least). Since I applied the latest upgrade I can unpause music or podcasts which I was listening to before I made a phone call. Hurrah. At least I can be thankful for small mercies.

Podcast Modal Dialog

How difficult should it be to write a usable padcast app? It was OK when it was part of the music player, but as a separate app it has so many problems. The most egregious one is this though… I don’t want podcasts to suck up my limited mobile data bandwidth, I only want it to pick up things when I’m connected to WiFi, so I use Settings to turn Use Mobile Data off. No problem so far, right? Wrong. If I ever open up the podcasts app when I’m not on WiFi (for instance when I’m walking to the station and want to listen to something, it ALWAYS pops up a system modal dialog to tell me that it can’t download anything unless you go to settings and turn on mobile data for it. You stupid application, I know that I can’t do that. You might put that message up if (and only if) I’m not connected to WiFi, see that there is an episode I don’t have and attempt to download it. At that point it might be useful information. But popping up a system modal dialog (push home button do do something else? No sir! Attempting to use Siri for something? Not a chance) seems to me to be the height of rudeness, and points to a certain incompetence in the QA of this application by the Apple team. How was something this obvious not picked up during testing? More rants to come. Please feel free to email me if you agree/disagree!

An interesting email idea


So, at work today there was some discussion about email validation by regex, because a third party provider couldn’t accept one of our customers email addresses because it contained a Spanish ñ character.

Too many email address regexes assume that only ascii characters will be present in emails, and thus fall foul of more recent innovations which allow UTF8 characters.

Mind you, even those that are just checking for ascii characters arguably fall short of decent validation, as Phil Haack makes clear in his old blog entry here http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx

Who knew that there were so many potential valid (but effectively unusable) variants on email addresses?

My attention was drawn to the note about gmail addresses though – that if you are someone@gmail.com you can provide an email address in the form someone++mytag@gmail.com and it will be successfully routed to you – but with the advantage that if you get an email from someone unexpected you can find out from the ++mytag which you set who it was who allowed your email to be passed on.

When is transparent compression not transparent?


, ,

So, I had an interesting experience today – I was knocking up a basic MVC4 site using the default template and Entity framework code-first because I wanted to do a quick test on validators.

However, the site failed, with a message I’d not seen before.


The file “c:\Program Files\Microsoft SQL Server\MSSQL10.SQLEXPRESS\MSSQL\DATA\BlogContext-20130903103345.mdf” is compressed but does not reside in a read-only database or filegroup. The file must be decompressed.

So, it turns out that I was using NTFS file compression on the Program Files directory because the transparent compression it has provided over the years has been very worthwhile. The C: drive on my company machine has limited space, and some applications (I’m looking at you Microsoft) insist on installing a proportion of themselves on the C: drive no matter where you tell them to install.

SQL Express puts its data files under Program Files (why?) and it turns out that the transparent file compression isn’t so transparent – SQL doesn’t like it.

I uncompressed that SQLExpress directory and things worked just like normal.

The moral of the story is that you can’t always trust software which says “trust me, it’s magic” without knowing what is actually going on behind the scenes, even when doing toy proof of concept stuff!



Dancing – why did nobody tell me about this earlier?

So, a non-programming post for a change.

I love dancing. Especially, the Argentine Tango. I have been having lessons in public classes at the wonderful Mirrors Dance (http://www.mirrorsdance.com) for a couple of years now, and I would commend learning to dance to anyone, of any age.

I guess I’ve come to dancing relatively late on. My interest was probably kindled over the last few years through family watching of ‘Strictly Come Dancing’, the television show were celebrities team up with professional dance partners and train up through a knock-out competition until the grand final where one couple end up as champions.

Even then I didn’t seriously consider going to dancing lessons. That first step would be a big one for a bloke to take. However, in Autumn 2010 I started a new job at Tesco Dotcom. To my surprise they were running a “Tesco does Strictly” competition to raise money for their charity of the year – ten members of staff each paired with a professional dancer for a single night of competition.

Autumn 2011 they ran the event again and I applied but didn’t get in. But I decided that if I’d got the guts to try for that competition, I could go along to a local introductory class. I’d just had an email about Mirrors Dance running an introductory Argentine Tango class and that had always been my favourite dance to watch – all fire and passion and drama – so I signed up. It seemed almost like eight weeks of learning how to stand and walk (which was surprisingly difficult), but I enjoyed it and started going to the public classes in 2012.

My co-ordination isn’t great, and I’m a bit overweight which throws out my balance and my posture, but I figured that it was worth persevering and the teacher, Trudi Youngs, was an excellent motivator.

Anyway, come Autumn 2012 Tesco decided to run its charity event again, and I applied again – and got accepted! Gulp! Unfortunately I’d only had one lesson before my assigned professional had to pull out. It seemed like that was the end, but I got in touch with Trudi and wondered whether she would be prepared to take on the commitment of training me for the competition. Happily she agreed, and between Christmas and New Year we trained in snow and sunshine in church halls across Hertfordshire; an Argentine Tango set to the music ‘Roxanne’ from Moulin Rouge.

Alex and Trudi Argentine Tango

Alex and Trudi Argentine Tango (photo by John Hardwick)

We danced on the evening of January 25th at the Emirates Stadium. To my everlasting amazement we came joint top of the judges vote (39 out of 40!) and top in the audience vote. We won the glitterball (and along the way about £28,000 was raised for Cancer Research by the event).

Since then I’ve kept on dancing at Mirrors Dance. I’ve made some great friends there, I’ve done a couple of dancing exams and hope to continue doing them as long as I’m able – I’d love to see how far I can take it.

So, in the spirit of this blog, what have I learned from this experience?

  • I’ve learned that it is possible to do more than you think you can.
  • I’ve learned that it is good to occasionally step outside your comfort zone
  • I’ve learned that it is good to engage in physical challenges, and step outside the mental world (especially if your work revolves almost entirely around the mental world!)

So I want to give a shout out to Trudi Youngs and Mirrors Dance school – if you’re in their vicinity I can’t recommend them highly enough. If you are not lucky enough to be close to Hitchin, Hertfordshire, then look up a dance school near where you are.

You’ve got nothing to lose and everything to gain. Keep Dancing!

The benefits of Rich Snippets to improve search listings




Rich Snippets are ways in which sites and pages can provide structured information that enables Google to provide more relevant search information to users. It doesn’t improve search ranking, but  is designed to give users a sense for what’s on the page and why it is relevant to their query when they see their listing.

The idea is that a restaurant might show average review and price range, a recipe page might show a photo, the total preparation time and review rating, a musical album page might list songs with links for previews.

It transpires that this is relatively straightforward to add to information on a site, and gives quite a lot of benefit quite quickly.

What is the point of  Microdata?

When people visit our web pages they easily understand the underlying meaning of the page through context and other mechanisms. Search engines find it much harder to understand the meaning of web pages. Microdata is a set of attributes or classes which can be added to web pages to give additional, specific, data which can help search engines and other applications to understand the content and display it in a useful way.

Getting Started

The first step is to pick a mark-up format. Google allows Microdata (its recommended format), and also supports Microformats and RDFa.

Microdata is part of HTML5 [1] and adds attributes to tags to assign brief, defined, descriptive names to provide semantic information. It is designed to be simpler than the other options. Microformats[2] are an open standard for assigning semantic information by adding classes. RDFa[3] (Resource Description Framework in attributes) is a W3C recommendation for a set of markup attributes to augment the visual information on the Web with machine-readable hints.

Google does supply a testing tool[4] which you can use to test your markup and identify any errors.

Google supports rich snippets for these content types

  • Reviews
  • People
  • Events
  • Products
  • Recipes
  • Music
  • Business and Organisations
  • Video

Schema.org is a collaboration by Google, Microsoft and Yahoo to providing a wider range of detailed, specific Microdata schemas

How does it work?


Microdata introduces five simple global attributes (available for any element to use) which give context for machines about your data. These five new attributes are: itemtype, itemscope, itemprop, itemid and itemref.

itemscope and itemtype

An itemscope attribute is added to a div in order to identify that everything in that div is going to be a particular class of information. Then an itemtype attribute is added to specify the precise type of information, the schema being used. e.g.

<div itemscope itemtype="http://schema.org/Movie">
  <div>Director: James Cameron (born August 16, 1954)</div>
  <div>Science fiction</div>
  <a href="../movies/avatar-theatrical-trailer.html">Trailer</a>



Once you have labelled the itemtype, it is possible to use itemprop attributes to give additional information about the kind of data which is represented by the other elements of the HTML. So if we continue working on our Avatar fragment we have:

<div itemscope itemtype ="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <div>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</div>
  <div itemprop="genre">Science fiction</div>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>


So search engines now know that this information is about a movie, and it knows that the heading is the name of the movie, that the director is James Cameron, that the genre is science fiction and that there is a link to a trailer. That was all information which was easy for a person to infer from the displayed data, but which was impossible for an automated system to know without help. It is possible for itemscopes to be nested, so that an itemtype of ‘Person’ may have an address which has an itemtype of ‘Address’ within it.

Data type disambiguation

There are some data types for which it may be useful to clarify the data. For example, datetime and times can benefit from being marked up with the <time datetime=””> tag, where the ISO format is used to unambiguously represent the date. e.g.

<time datetime="2011-04-01">04/01/11</time>
<time datetime="2011-05-08T19:30">May 8, 7:30pm</time>
<time itemprop="cookTime" datetime="PT1H30M">1 1/2 hrs</time>

Other examples are support for enumerations of a limited set of possibilities (e.g. informing the search engine that the stock availability notice is one of four possible variations of stock availability) and meta tags to provide a place for making information available which would otherwise be invisible (e.g. duration of a flash video, number of stars in a rating graphic). It is strongly suggested that the meta approach be used sparingly. All the search engine providers prefer to see markup related to user-visible content.

The markup is valid HTML 5.


This is the oldest way of providing additional semantic information to web pages. It relies mostly upon the class attribute, sometimes the id, title, rel or rev attributes.

An advantage of this approach is that it has minimal effect on the validity of websites when classes are used for the information. On the other hand, there may be a temptation for developers to use semantic classes for css styling, and end up confusing one with the other. In addition, the use of title attributes has caused problems with accessibility in the past.


RDFa is a highly extensible but arguably more complex solution for providing semantic metadata. URLs are used to identify almost all things, being applied to a property attribute. In the example below a namespace is declared to simplify the rest of the markup in the code fragment. Although flexible enough to work with any number of entity definitions, it appears that a perceived lack of consistent, agreed definitions has tended to hold it back. It is also more complex and that has slowed adoption.

<div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Person">
   My name is <span property="v:name">Bob Smith</span>, 
   but people call me <span property="v:nickname">Smithy</span>.
   Here is my homepage: 
   <a href="http://www.example.com" rel="v:url">www.example.com</a>.
   I live in 
   <span rel="v:address">
      <span typeof="v:Address">
         <span property="v:locality">Albuquerque</span>, 
         <span property="v:region">NM</span>
   and work as an <span property="v:title">engineer</span>
   at <span property="v:affiliation">ACME Corp</span>.


Where might it be used?

Taking Tesco.com as an example:

Wine reviews

This fragment of a google search page for “tesco isla negra merlot reserva” is interesting – the top result is the main wine page on the wine website. The fourth entry is from the reviews.tesco.com site – the same reviews which appear on the product page. It is possible to click through to the product page and then see shopping basket etc. However, it might be useful if the star rating and reviews on the main page appeared on the main page search result (as that is the one we would particularly like people to click through on).

Recipes on Realfood

The Realfood recipes have ratings – there could be thumbnail, ratings and cook time showing up in the Google search listings. Which of those results is someone most likely to click on?


Realfood Recipe markup using Microformat

The following additional classes and span added to html on the page allows the image, rating and time to cook to appear in the search results producing this:

<div class="hrecipe">
 <h1 class="fn">Pecan pie recipe</h1>
  <li class="hreview-aggregate">
       <p id="recipeStarRating" class="rating">
  <span>5 stars</span><span class="ratingtooltip">Rating: 5 stars</span>
  <span class="count">(3 ratings)</span>
 <img src="http://realfood.tesco.com/media/images/pecan-pie-hero-3f2d0a4e-e1d9-4bd0-9e42-b375c0b83389-0-472x310.jpg" alt="pecan pie hero" width="472" height="310" class="photo" />
    <span>Cost per serving: 41p</span>
    <span class="duration"><span class="value-title" title="PT0H35M"></span>Takes: 25 mins to prepare and 35 mins to cook</span>
    <span>Serves: 8</span>
   <p>Preheat the oven... </p>

Wine Markup using Microdata

Wine is an interesting case as the wine reviews site (reviews.tesco.com) does have the appropriate markup, but the main wine site (www.tesco.com/Wine ) doesn’t, even though it contains the same review information. The side effect of which is that the reviews site appears more attractive and informational on Google listings, even though we want people to visit the higher ranked main page. The second example has weaker text summary because the reviews subdomain doesn’t have any detailed product information.

e.g. on reviews.tesco.com we find the reviews surrounded by:

<span itemprop=”aggregateRating” itemscope itemtype=”http://schema.org/AggregateRating”&gt; …

While on www.tesco.com/Wine there is no use of MicroData

<div> …

The following code fragment highlights where MicroData is currently used on that wine reviews site (this is not provided as a good example – the html illegally nests div elements within span elements for instance).

<span itemprop=”aggregateRating” itemscope itemtype=”http://schema.org/AggregateRating&#8221;>

<div id=”BVRRSReviewsSummaryID”>

<div id=”BVRRSOverallRatingContainerID”>

<div id=”BVRRSReviewsSummaryTitleID”>Overall rating:</div>

<div id=”BVRRSReviewsSummaryRatingImageID”>

<img src=”http://tescowines.ugc.bazaarvoice.com/1071-en_gb/4_4/5/ratingLarge.gif&#8221; width=”115″ height=”25″ alt=”4.371 / 5″ />


<div id=”BVRRSReviewsSummaryRatingTotalsID”>

<span id=”BVRRSReviewsSummaryOutOfID”>

<div itemprop=”ratingValue”>4.4</div>


<div itemprop=”bestRating”>5</div>


<span id=”BVRRSReviewsSummaryCountID”>(<span itemprop=”reviewCount”>62</span> reviews) <span>62</span>







I think it would be a relatively small change to implement one of these options in various Tesco sites (whether recipes on Real Food, reviews on Wine, product information on GHS or something else) and it would bring value to their web proposition and give them more appealing search results quickly.

The question then would be which format to use.

MicroData has the advantage of being both part of the HTML 5 specification and supported by all the major search engines, so this should probably be the strategic choice. The only concern would be whether the fact that it is invalid HTML for prior versions of HTML is significant. It affects validation, not rendering.




Code Contracts in .Net


, ,

I’ve heard a little about Code Contracts introduced in .Net 4.0, but until today I hadn’t really looked into it at all. Turns out there is a nice video on youtube by David McCarter here http://www.youtube.com/watch?v=xVhQ9yfo54I

In a nutshell, for the few people who have not come across them yet, they exist in System.Diagnostics.Contracts and they enable you to set preconditions (what does it expect), postconditions (what does it promise) and invariance  (what does it maintain).

We know that data is evil. Well, everyone else’s data is evil.

As such, we end up validating parameters to ensure that they are not null, not empty and so forth – an element of hygiene which sometimes gets overlooked.

So if we are so used to putting in these guard conditions though, what do Code Contracts give us? On my cursory look at them, there are some clear benefits.

1. Static checking

You can enable static checking of code contracts so that every time the code is built those contracts will be evaluated. Is there some rarely used code which is carelessly passing in a null reference? rather than it failing at run time in obscure circumstances, my compiler can tell me about it.

Yes Please.

2. Checking happens outside the method

When we have guard conditions inside our method, that is the place where checking happens and were errors get reported. We might then have to investigate the stack trace to find out where the actual source of the error is.

Code contracts are located within the method, but preconditions are checked outside the method when anything attempts to call that method. The contracts become part of the method and determines whether it can be called or not.  That means that we get an error reported much closer to the actual source of the problem.


The following examples are mostly taken directly from the video

Examples – preconditions

Contract.Requires(ex != null);

This will raise an exception if the parameter ex is null.

Contract.Requires<ArgumentNullException>(ex != null, “My custom error message”);

This will raise an ArgumentNullException if the parameter ex is null, and include a custom error message.

Examples – postconditions


This checks that the string returned by a method is neither null nor whitespace. You can’t use the variable that would be returned from the method, but you can use Contract.Result and give it the type of the return value, which allows this postcondition to be evaluated.

Examples – maintaining state

Contract.EnsuresOnThrow<MyBusinessException>(Contract.OldValue(this.name) == this.name);

Here I want to check that if a MyBusinessException was thrown,  a particular value (this.name in this instance) hasn’t changed. Contract.OldValue allows you to reference the original value of a variable for the purposes of this check. If my method didn’t clean up nicely as a result of throwing the exception, then this contract would be violated. Modifying my code to ensure that I handle this situation properly means that this contract can be met.

Unanswered Questions

Something that I want to look into, but haven’t seen any details about yet, is whether there is a performance overhead to using Code Contracts. Underneath the covers it works by rewriting the IDL – there isn’t any magic, ultimately it all has to be code; this is merely a clever and succinct way of enabling it without requiring a lot of complicated coding every time to make it happen.

It looks to me as though this would be a boon in writing code which is more robust, reducing maintenance costs and the probability of live errors. But is it suitable for all environments, or perhaps not for computationally intensive ones? Time and further investigation will tell.


Other References

The main source for information is on the Microsoft site here: http://research.microsoft.com/en-us/projects/contracts/

There are additional videos on Microsoft’s Channel9 site here: http://channel9.msdn.com/Search/?Term=code%20contracts

Transactional REST


, ,

I’ve just finished watching an interesting set of slides about implementing transactions over HTTP using REST and the Try-Confirm/Cancel pattern.

The video and slides are available on the InfoQ site here http://www.infoq.com/presentations/Transactions-HTTP-REST and an easier to read version of the slides alone are visible here  http://pautasso.info/talks/2012/soa-cloud-rest-tcc/rest-tcc.html#/title (note: you use the spacebar to progress through the slides)

Transactions for the REST of us

They start out by going over the typical issues – that HTTP is stateless and so doesn’t really support transactions where you may want to back out changes in multiple locations safely and consistently.

Their proposal has to be interoperable (no changes/extensions to HTTP), maintain loose coupling (so that REST services remain unaware that they are participating in a transaction) and simple enough that it is worth adopting.

The proposed Try-Confirm/Cancel pattern relies upon an initial state, a reserved state and a final state.

Try puts something into a reserved state.

Cancel (or Timeout) moves from the reserved state back to the Initial state.

Confirming a reserved state moves to the final state.

From a programmatic point of view

  1. Try – inserts into a reservation db
  2. Confirm – updates reservation, set status = confirmed
  3. Cancel – invalidate reservation
  4. Timeout – invalidate reservation

A number of service calls can then be wrapped into a single transaction by using a Transaction Coordinator service to handle the confirmations. So for example an event booking  might reserve a seat from one service, pay via another service and then call the transaction coordinator which then attempts to confirm the payment and the reservation.

If there is a failure before confirmation, both services timeout and return to their original state. if there is a failure after confirmation, both services have been confirmed and are in their final state.

If there is a failure between reservation and confirmation, or between confirmation steps, then the transaction co-ordinator can either allow unconfirmed states to timeout or (perhaps better behaved) cancel those outstanding reservations and cancel any confirmations it has made before the failure occurred.

One particular advantage of this approach over a “workflow plus compensation” approach is that here the ‘undo’ functionality is pushed onto participant services rather than being a responsibility of an overall workflow. Thus services can timeout or cancel reserved states and error handling can be decoupled from the explicit workflow. This increased autonomy of participating services helps us keep our services decoupled and avoid the strong coupling that seems to ‘leak’ into our systems so easily.

This obviously isn’t a panacea solution to all ills – there will be some circumstances where it fits better than others. It fits particularly well into reservation-style business models (booking, purchasing from limited stock).