I took the following notes from a series of lectures about large scale architectures while at QCon 2015. I’m reproducing them here for easy reference and in case anyone else also finds them helpful or interesting.
Matt Ranney, Uber First time they have talked about their architecture in public. Their supply and demand picture is very dynamic, since their partners can do as much or as little as they want. The demand from passengers who want rides are also very dynamic. He will use the terms supply and demand but doesn’t lose sense of the fact they are people. The mobile phone is the interface for their dispatch system. He shows an animation of London routes on New Year’s Eve, their busiest time. Partners and riders drive it all. talk to dispatch over mobile internet data. Below that are maps/ETA, services and post trip processing. Below that are databases and money. They use node, Python, Java, go. Lots of technology decisions were made in order to allow them to move quickly. All basically in service of mobile phones. Dispatch is almost entirely written in node.js, but they now plan to move to io.js. One of the things that they have found is that the enthusiasm of their developers enables lots of work to be done. Don’t underestimate the benefit of enthusiastic developers who enjoy working with a technology. The maps service handles maps, routes, route times etc. The business services are written mostly in Python. They have many databases. The oldest stuff is Postgres. They also have redis, MySQL, their own db and riak. The post trip processing pipeline has to do a bunch of stuff (in Python) – collect ratings, send emails, arrange payment etc. The dispatch system has just been totally rewritten. Even though Joel Spolsky’s cautionary tale is well known, there were lots of problems with the existing system. In particular it expected one driver and one rider baked deep into everything. That stopped their ability to go to other markets like boxes or food, or have multiple riders sharing. Also they sharded their data by city but that wasn’t sustainable. They also had many points of failure that could bring the whole thing down. So even bearing in mind that cautionary tale, rewriting from scratch was still the right thing to do.
The new dispatch system
They generalised the idea of supply and demand. With services that had state machines that kept track of everything about these. Supply contains all kind of attributes – is there a child seat, number of seats available, wheelchair carriage, car share preferences. A service called disco matches things up, it stands for “dispatch optimisation”. Their old system only dealt with available cars, but this allows a wider view. They added geotag by supply, geotag by demand and routing/eta services. The disco service does a first pass to get nearby candidates, then checks on routing service to see if there are river boundaries or similar things which rule out some candidates. Some places like airports have queuing protocols. The geospatial index is cool. It has to be super scalable, with 100 million writes per second and many more reads. The old one only tracked dispatchable supply, so a global index was easy. The new world needs all supply in every state and their projected route into the future. So they use the “S2” library from Google which really helps with spherical geometry. Each latlong has a cell Id, which represents a box on the surface of the earth. An int64 can represent every cm on earth! They use level 12 which is about a km on a side. There is an interesting thing – you can quickly get the coverage for a shape – a 1km radius of our current location is covered by five cells. The cell Id is used as a shard key. It can scale out by adding new nodes. Sqmap.com is good for exploring this.
They wanted to reduce waste time and resources, help them to be productive, reduce extra driving, provide lowest overall ETAs. While normally the demand search currently available supply (which works pretty well), they decided to allow it to search drivers which are on a trip who might be nearer and available within a better timescale. This would give a better result for the demand, and also means that less unpaid drive time for a different driver. It is like the travelling salesman problem at an interesting scale in the real world.
So how do they get scalability?
Everybody wants availability but some want more than others. Banks don’t seem to care so much about their service availability and are quite happy with planned downtime since their customers rarely move elsewhere. Uber, on the other hand, find that if they are down the riders (and their money) goes straight to another ride company. Availability matters a lot to Uber. So everything is retryable. It must be made retryable, even if it is really hard to do. They spent a lot of time figuring out how to ensure everything is idempotent. Everything is killable (chaos monkey style). Failure is common, they exercise it often. They don’t have graceful shutdowns, they crash only shutdowns. They also want small pieces to minimise the cost of individual things dying.
No pairs of things like database, because randomly killing things don’t recover well. Kill everything means can kill database nodes too, which changes some of their database choices. Killing redis is expensive, killing ringpop is fine. Originally they had services talking to each other via independent load balancers. But what happens when load balancer dies? They decided that the clients need to have some intelligence about how to route around problems. They ended up making a service discovery system out of ringpop. This is just getting into production, will open source soon.
Horrible problems? Latency
Overall latency is greater than or equal to the latency of slowest component. If p99 is 1000ms, then using 1 component you know that at least 1% have a 1000ms latency. But if you have 100 components, then 63% would have a 1000ms latency. Note: He didn’t go into this in more detail, but I’m concerned with his maths here – it would be true if the 100 services were ‘AND’ed together – but in his service are they not ‘OR’ed together? A good way to solve this latency is “backup requests with cross service cancellation”. Service A send a request to service B(1) which also tells it that you are sending it to service B(2) as well. After 5ms Service A sends the same request to to service B(2) which says that it is also sent to B(1). Whichever Service B responds first then sends a cancel to the other server.
Horrible problems? Datacentres failure
Harder to prepare for, happens less frequently. Failover of data is easy enough to arrange, but how do you handle all the in process trips? The ones which are currently on the road? The failover datacentre just doesn’t have the data, as you can’t be replicating all the data all the time. Their solution is to use their driving partner phones. The server periodically creates an encrypted digest of information and sends it back to the phone. If failover happens the phone finds itself talking to a new server, which doesn’t recognise the current location of the phone and requests the encrypted digest so that it can bring itself up to date.
Philip Wills, the Guardian @philwills Five years ago they relaunched the Guardian site as a new shiny monolith. Which shows that everything that starts off as new and shiny becomes legacy over time! Over the last five years the guardian has broken down their monolith into micro services. Microservices are too big when we conflate solving two different kinds of problems.
- independent products
- single responsibility applications
Why micro services?
The guardian is owned only by the Scott Trust, which is there to safeguard their journalistic freedom and liberal values. That has enabled their investment into it. Last week they deployed forty applications to production. They produce the website, the cms, and the reporting services. What is a microservice? Martin Fowlers article makes it clear that it isn’t a monolith. Not an ESB (since it should rely on dumb pipes). But the article is a little wooly on details. We do want to deliver business innovation though. One of the things which is valuable is independent teams. Masters of their own destiny. In 2008 they had built a nice chunky monolith with Thoughtworks. But a short while after that they ran a great hack day, but they could never get any of their innovative ideas into production. The organisational complexity was too high. They never got into a gant chart of doom, but they did have a full time release manager for their fortnightly releases. They wanted to limit the scope of failures. They made changes to their applications. Micro apps. They put placeholders into the monolith so that it would get things from somewhere else. They could break some things out of the monolith this way. Then the big wiki leaks story happened. They had arranged an online Q&A with Julian Assange, but they didn’t have a real solution for it – so their used their online commenting solution to do it, but it was a disaster. The commenting use case was that almost everyone just looked at the first page of comments but now everyone asked for all the comments. It didn’t fall over completely, but it went really slowly, and all the monolith threads got tied up and the whole site ground to a halt. As a result of this they decided that they needed to allow things to fail independently. Otherwise people fear change. If you are going to push the boundaries you have to be aware that some innovations are not going to work. That also means that you need to be able to kill things. Clean software death which leaves nothing behind. Their first commenting system was using “pluck”, and moving that out of the monolith was a pain. Several times they tried to get rid of everything and kept having to go back to kill more off. It was very difficult to reason about whether it can be cleanly extracted.
Keeping teams in line with their products. A nice stable well defined interface. JSON is a really poorly defined data interface, and they are looking at moving away to thrift. They like the strong typing of scala, why not have that in their messages too? These should be deployable independently too. Move forwards should always be backwards compatible although keeping stable interfaces is hard though. For an independent product you must own your own data store. No integration between teams on the basis of database. This allows for independence but doesn’t speak to their other goals.
Single responsibility applications
Following the meme others have used this week, microservices should be “Small enough to fit in your head”. More usefully, they decide that a well partitioned service has one key metric which tells you whether it is doing the job it should do. These must be isolated from each other and don’t impact each other’s performance so they run everything in AWS and most operational problems are solved by either turning it off and on, or throwing more hardware at it. How do they actually structure things?
- content api
- asset manage,net
Jason McHugh, facebook Facebook purchased Atlas from Microsoft, and the contract included Microsoft continuing to run it on their infrastructure for a couple of years. Atlas was a fourteen year old company.
Ad serving tech
The vision of ads at Facebook is that thy don’t have to suck. Can they make it a positive experience. It is a huge industry with a vast budget. Digital advertising is fast growing. Only television isn’t falling in terms of other advertising, although people are now spending more time on mobile and online than with tv.
third party ad serving
Advertisers think of the people hosting their ads as publishers. They use an independent third party to aggregate numbers from many publishers (this is where atlas comes in). The third party also manages campaigns, creative concepts, action tags for what the customer is doing. Advertisers get snippets they give to publishers and action tags in their own site to understand conversions. When a request for an ad is made the ad server does an identity resolution, deduces some demographics and then puts together the advertiser creative content. Traffic patterns and probabilistic matching and machine learning are used to guess who people are.
Understanding the systems was complex, none of the devs had any experience in this business area. They also had to get to grips with the existing technology stack, architecture, the data model and databases (19 separate ones). One db had 345 tables with nearly 4000 columns. Couldn’t even answer some of the WTH questions. It was deployed on several thousand machines across many data centres. It was a huge mature product. Which subset did they want to implement?
- third part ad server definitely
- many other things
Lift and shift is a common acquisitions approach. An evolutionary approach. However, they didn’t want to do this. The hardware was old and owned by Microsoft, built on technologies closed to Microsoft. This was counter to their overall approach at facebook.
They took nothing from the original. This is their new high level architecture He then looked at the physical architecture of ad delivery and logging. User traffic goes to DNS (anycast) to see which cluster the request should be sent to. Cartographer pushes new maps to DNS as it determines changes in best routing right now. Then a cluster switch recipes a request and uses a hash to route for a layer 4 load balancer which uses consistent hashing. This then routes to proxygen which is a layer 7 load balancer. This sends the request to the atlas delivery engine. As an ad is delivered to the customer there is also a real time pipeline for processing everything which has happened. Atmonitor (fraud detection of nonhuman activities), Limiter (was it legal from a billing perspective), report preprocessing (messages sharded by strongest identity, second strongest identity and placement, stored in Presto db), Aggregator(roll up sums), invoicing etc. “Scribe” is a hugely important component. A high throughput message queue, highly scaled. Not lossless but guarantees are excellent. Decouples producers. Persistent for n days. Sharded consumption, checkpoint streams with fixed periodicity. Message queues can be costly. Repeatable re-execution is an important ability. The need to be able to find the repeating messages or larger units of work amongst billions is tricky.
The only mistake which they admitted to was wasting effort by minimising the code in their www tier without considering what other teams were doing in that domain. In the last two years there have been huge improvements there, and if would have been better if they had built for what was coming than what was there right now.
Randy Shoup Aims to give an impression on what these architectures look like in the large and feel like in the small.
They all end up looking somewhat similar. eBay started as a monolithic perl app written over a weekend, then became a C app with 3 million lines of code, then a Java monolith and now micro services. Twitter was a monolithic rails app which changed to a rails plus scala monolith, then to microservices Amazon was a monolithic c++ app which changed to Java/scala and then to micro services.
Ecosystem of services
Hundreds to thousands of independent services. Not tiers but ecosystems with layers of services. A graph of relationships. They are evolution rather than intelligent design. There has never been a top down view at Google. It is much more like variation and natural selection. Services justify their existence through usage. Architecture without an architect. The is no such role and no central authority. Most decisions are made locally rather than globally. Appearance of clean layering is an emergent Proprty. eBay by contrast had an architecture review board which had to pass everything, usually far too late to be valuable. Randy worked on cloud data store in app engine at Google. This was built upon a tower of prior services, each of which was added when something new was needed at a higher level of abstraction. Standardisation can happen without central control.
- standardised communication
- network protocols (stubby or rest)
- data formats (protobuf or json)
- interface schema
- standardised infrastructure
- source control
- config management
- cluster management
Standardisation is encouraged via libraries, support in underlying services, code reviews, searchable code. There is one logical code base at Google that anyone can search to find out whether any of the 10,000 engineers are already working on it. The easiest way to encourage best practices is with actual code. Make it really easy to do the right thing. There is independence of services internally. No standardisation around programming languages for instance. Four+ languages, many frameworks etc. they standardise the arcs of the graph, not the nodes. Services are normally built for one use case and then generalised for other use cases. Pragmatism wins. E.g. Google file system, bigtable, megastore, Google app engine, gmail. Deprecating old services. If it is a failure or not used any more, repurpose technology and redeploy people. Eg Google wave cancelled, core services having multiple generations.
Building a service
A good service has:
- single purpose
- simple, well defined interface
- modular and independent
- isolated persistence(!)
Goals of a service owner are to meet the needs of my clients in functionality, quality, performance, stability, reliability and constant improvemet over time. All at minimum cost and effort to develop and operate. They have end to end ownership of the status through to retirement. You build it, you run it. They have autonomy and authority to choose tech, methodology etc. You are focussed primarily upward to the clients of your service and downward to the services you depend upon. This gives a bounded cognitive load. Small, nimble teams building these services. Typical 3-5 people. Teams that can be fed by two large pizzas. Service to service relationships. Think about them as vendor-customer relationships. Friendly and cooperative but structured. Clear about ownership. Customer can choose to use the service or not. SLAs are provided – promises of service levels which can be trusted. Charging and cost allocation turned out to be important to prevent misuse of services. There was one consumer of their service which was using far too high a proportion of their service time. They kept asking the consumer to optimise their usage, but it was never a high priority for them – until they started getting charged for usage, at which point they quickly introduced an optimisation which cut usage to 10% off the previous figure.
- charge customers for usage of the service
- aligns economic incentives of customer and provider
- motivates both sides to optimise for efficiency
- pre/post allocation at Google
Maintaining service quality benefits from the following approaches:
- small incremental changes
- easy to reason about and understand
- risk of code change is nonlinear in size of change
- (-) initial memcache service submission
- solid development practices
- code reviews before submission
- automated tests for everything
- Google build and test system
- uses production cluster manager (tests are run at a lower priority though!)
- runs millions of tests in parallel every day
- backward and forward compatibility of interfaces
- never break client code
- often multiple interface versions
- sometimes multiple deployments
- explicit deprecation policy
Operating a service
Services at scale are highly exposed to variability in performance. Predictability trumps average performance. Low latency and inconsistent performance does not equal low latency. The long tail latency are much more important. Memcache service had periodic hiccups in performance. One in a million. Difficult to detect and diagnose. The cause was slab memory allocation. Service reliability.
- highly exposed to failure
- sharks and backhoes are the big problems killing cables
- operator oops (10x most likely)
Resilience in depth. Redundancy, load balancing, flow control. Rapid rollback for oops. Incremental deployment
- Canary systems
- staged rollouts
- rapid rollback
eBay feature flags have been rediscovered many times. Separating code deployment from feature deployment. You can never have too much monitoring!
Service anti patterns
The mega service.
- services that do too much
- too difficult to reason about
- scary to change
- lots of upstream and downstream dependencies
- breaks encapsulation
- encourages backdoor interface violations
- unhealthy and near invisible coupling of services
- the initial eBay SOA efforts were like this.
Yoni Goldberg, gilt Gilt is a flash sales company, with limited discounted limited sales, resulting in huge spikes at noon each day. There are about 1000 staff, 150 devs. Classic startup story. Started fast with Ruby on Rails with a Postgres db. The moment of truth was when they first added louboutin shoes on the site in 2009. Even with all their extra precautions they had planned, the site still failed dramatically. They needed thousands of ruby processes, Postgres was overloaded, routing between ruby processes was a pain. Thousands of controllers, 200k loc, lots of contributors, no ownership. Three things changed
- Started the transition to JVM
- Microservice era started
- Dedicated data stores for services
Their first ten services were
- core services (checkout, identity)
- supporting services (preferences)
- front end services
They solved 90% of their scaling problems, not not the developers pain points. Began the transition to scala and play. They now have about 300 services in product.
- deployments and testing
- Dev and integration environments
- service discoverability
- who owns this service?
Caitie McCaffrey. @Caitie She worked on halo 4 services for three years, and will talk about architectural issues. Halo 1 was only networked peer to peer and no services. Halo 2 had some Xbox services for achievements etc Halo 3 etc had more Halo 4 they wanted to take the original engine and old services based on physical hardware and build for a more sustainable future. Main services Presence (where you are) Statistics (all about your player) Title files (static files pushed to players, games, matchmaking) Cheat detection (analysed data streams and auto ban people, auto mute jerks) User generated content (making maps and game content. Big part of the long tail) They knew that they had to approach concurrent user numbers of up to 12 million+, with 2 million sold on day 1. They had 11.6 million players online, 1.5 billion games and 270 million hours. No major downtime.
You get a huge spike of users on day 1, and spikes at Christmas or at promotions – the opposite of ramping up. This is why they decided to go with cloud. Worker roles, blob, table storage and service bus. It had to be always available. The game engine expected 100% availability. They needed low latency and high concurrency. Typically people start with a stateless front end, stateless middle tier and storage. Then they added a storage cache, but that added concurrency issues. Other options? They wanted to have data locality. The data is highly partitionable by user, so that would be good to do. Hadoop jobs were too slow. So they looked at the actor model paper from 1973 for thinking about concurrency. An actor can
- send a message
- create a new actor
- change internal state
This gives stateful services with no concurrency issue. Erlang and others did this, but they wanted to do something in .Net. The Microsoft Research team were doing something called Orleans. An Orleans virtual actor always exists. It can never be created or deleted. The runtime manages:
- perpetual existence
- automatic instantiation by runtime
- location transparency
- automatic scale out
It uses the terminology “grains” within a “cluster”. Runtime does
- within the grain you don’t have to worry about concurrency either
- at least once is out of the box
- best for app to decide what it should do if it doesn’t get an acknowledgement back.
Orleans is AP. You can always talk to a grain. You might possibly end up with two instances of the same actor running at the same time.
- dot net framework. C#, F#. Plan is to be dot net core compatible.
- actor interfaces
- promises. They all return promises.
- actor references. Strongly typed so that it can work out what goes where.
- turns. They are single threaded and do messages in a “turn”. You can mark a grain as reentrant, but still always run a single thread.
- persistence. Doesn’t persist any grain state by default. Hard to solve for the general case, developers choose for themselves what they want to do.
Reliability is managed automatically by the runtime. If a machine dies, the runtime just brings the grains up on a different box automatically. Same if a single grain dies, it just gets reallocated the next time it is needed. Started working with this academic team in summer 2011 to get it up and running; a two way partnership. Ultimately a stateless front end talks to an Orleans silo which talks to persistent storage. Halo 4 statistics service (She built this) Player grain. Everything about you as the player. Comes up when you go online, garbage collected. Game grains. Everything for a game. Xbox posts stats to the game, this writes aggregate data to blob storage, and at end of game sends info to each player grain, which writes its state to table storage. The player operations are idempotent and could be replayed as necessary in case of message failure.
Performance and scalability
Not bare metal code. But it can run at 90-95% cup utilisation stably. They have run load tests at this level over 25 boxes for days. It also scaled linearly with number of servers. This was in a report published in March this year.
Programmer productivity and performance
They scaled their team from six to twenty. The devs picked it up quickly and were readily available for hire. Distributed systems now is a bit like compilers in the sixties. Something that works and is scaleable is now possible. Orleans is an early iteration of a tool that really enables us to do that. Easy and performant. Orleans is open source on GitHub.
(What was the failure detection method used – question from Ali) They used a configurable timeout, set to low – a couple of seconds, so it could fail fast and a new one got hydrated. (What was latency for serialising the messages for reads) Typically actors were doing just one hop. It could use any storage (expects in azure at the moment). They knew their read patterns so they knew fairly well how things were going to be used. They used service bus for durable transmission of messages. (how easy was it to onboard devs to this actor model? Are the actors parent/children or finer control needed) It was really easy to onboard people. As long as people understood async. There is no management of actors at all by the apps.  http://www.joelonsoftware.com/articles/fog0000000069.html  https://code.google.com/p/s2-geometry-library/  https://github.com/uber/ringpop  http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf  https://github.com/uber/tchannel  https://www.google.co.uk/webhp?sourceid=chrome-instant&rlz=1C1CHFX_en-gbGB557GB557&ion=1&espv=2&ie=UTF-8#q=backup%20requests%20cross%20service%20cancellation  http://martinfowler.com/articles/microservices.html  http://en.wikipedia.org/wiki/Consistent_hashing  http://en.wikipedia.org/wiki/Actor_model  http://ijcai.org/Past%20Proceedings/IJCAI-73/PDF/027B.pdf  http://research.microsoft.com/en-us/projects/orleans/  https://github.com/dotnet/orleans