May 1, 2011 Leave a comment
A thread regarding the recent Amazon cloud outage on NANOG pointed me to this excellent post by John Ciancutti regarding Netflix’s approach to managing cloud services.
The most profound lesson I draw from Ciancutti’s post is #3 – “The best way to avoid failure is to fail constantly.”
As someone who works on large complex, highly-interdependent systems, this is absolutely essential – anything which can fail, will, and will do so in the most unexpected way. The best systems I’ve seen (from a resilience point of view) have a whole lot of engineering work put into separating functions so that while a single event can be disruptive, recovery is both possible and quick.
A significant website (>$50K / minute) had a 10 minute outage caused by a routine maintenance. After the sleuthing, it turned out that a little performance analysis widget on the site had a hidden dependency on a single-homed server in the “non-critical” portion of their server farm, and the routine maintenance had rendered that inaccessible.
That website, had they followed the Netflix “chaos monkey” approach, would have discovered this dependency and either made the widget more robust or scrapped it entirely – as it was, a good business case can be made that the value of the data learned by the widget was far less than the $100K which the outage caused.