Chris Rimondi: Top Five SRE Architecture Principles

When interacting with software developers at work, our site reliability engineering (SRE) team found common themes when discussing scalability issues surrounding applications we support; independent of the service under consideration. Therefore, we discussed ways which we could concisely communicate our expectations for application architecture when operating in a cloud environment. What we came up with is a distilled list of principles (only five) that we refer back to when consulting on new projects or evaluating technical specifications.

Stateless - The state of a service should be determined by a shared database and not dependent on data local to application. Storage should be treated as a service within itself and antiquated thinking of storage as a device should be avoided.
Scale linearly - An application should run as a single process on a small as possible footprint. This enables the SRE team to scale services linearly in a granular fashion. Code logic should not necessitate a specific number of instances but be capable of scaling up or down as load changes. Discrete functionality is preferred such that there is a single and obvious metric to scale upon.
Minimal configuration - Services should require little to no configuration. We have found configuration management to be a ripe source of human error, therefore services should ship with sane defaults and infer as much as possible on startup from consideration of environment variables or service discovery. Thread and memory footprints should configure automatically maximizing the resource usage on an instance. A side benefit is that the less configuration options available on a service the less permutations are needed for testing.
Robust communication - A great quote from Release It! is, "Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk." Therefore, we ask that all integration points of a service be enumerated and have a proper harness to torture test data input and output. Communication should be asynchronous whenever possible (and it almost always is). Adequate controls at integrate points should existing including circuit breakers, time-outs, bulkheads and protocol hand-shaking.
Application Visibility - At a minimum all inbound and outbound transactions should have telemetry that provides visibility on the number/size of transactions, their type and the time the transaction took to execute. Services should know if they are functioning correctly and make this health status available to the SRE team through some type of API (usually REST). Logging is also a critical component of application visibility. It should be obvious when reading the log files whether the service is working.

There were three sources that we relied heavily on for inspiration. I want to explicitly call these out so we can give credit where credit is due and encourage people to look up these fantastic resources:

The Art of Scalability (second edition has just been released)
The 12 Factor App
Release It!

Finally, a word on the importance of principles. Modern systems are complex. Distributed, cloud systems are particularly complex because of the number of integration points. The allure of intuitively understanding systems with thousands of nodes is toxic. When designing these distributed systems we prefer the pattern/anti-pattern approach where principles are preferred over attempting to enumerate every possible failure scenario. You will hit failure edge cases you never thought of. Therefore, relying on principles instead of our own capability to exhaustively understand a system is a practical dose of humility.

Chris Rimondi

Saturday, February 6, 2016

Top Five SRE Architecture Principles

No comments:

Post a Comment

AWS Glue, Fitbit and a "Health Data Lake" - part 1

Report Abuse