The Great Chaos Monkey!

Apr 25, 2011
Working with the Chaos Monkey

Late last year, the Netflix Tech Blog wrote about five lessons they learned moving to Amazon Web Services. AWS is, of course, the preeminent provider of so-called “cloud computing”, so this can essentially be read as key advice for any website considering a move to the cloud. And it’s great advice, too. Here’s the one bit that struck me as most essential:

We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.

If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

Which, let’s face it, seems like insane advice at first glance. I’m not sure many companies even understand why this would be a good idea, much less have the guts to attempt it. Raise your hand if where you work, someone deployed a daemon or service that randomly kills servers and processes in your server farm.

Now raise your other hand if that person is still employed by your company.

Who in their right mind would willingly choose to work with a Chaos Monkey?

Angry-monkey-family-guy

Sometimes you don’t get a choice; the Chaos Monkey chooses you. At Stack Exchange, we struggled for months with a bizarre problem. Every few days, one of the servers in the Oregon web farm would simply stop responding to all external network requests. No reason, no rationale, and no recovery except for a slow, excruciating shutdown sequence requiring the server to bluescreen before it would reboot.

We spent months — literally months — chasing this problem down. We walked the list of everything we could think of to solve it, and then some:

swapping network ports
replacing network cables
a different switch
multiple versions of the network driver
tweaking OS and driver level network settings
simplifying our network configuration and removing TProxy for more traditional X-FORWARDED-FOR
switching virtualization providers
changing our TCP/IP host model
getting Kernel hotfixes and applying them
involving high-level vendor support teams
some other stuff that I’ve now forgotten because I blacked out from the pain

At one point in this saga our team almost came to blows because we were so frustrated. (Well, as close to “blows” as a remote team can get over Skype, but you know what I mean.) Can you blame us? Every few days, one of our servers — no telling which one — would randomly wink off the network. The Chaos Monkey strikes again!

Even in our time of greatest frustration, I realized that there was a positive side to all this:

Where we had one server performing an essential function, we switched to two.
If we didn’t have a sensible fallback for something, we created one.
We removed dependencies all over the place, paring down to the absolute minimum we required to run.
We implemented workarounds to stay running at all times, even when services we previously considered essential were suddenly no longer available.

Every week that went by, we made our system a tiny bit more redundant, because we had to. Despite the ongoing pain, it became clear that Chaos Monkey was actually doing us a big favor by forcing us to become extremely resilient. Not tomorrow, not someday, not at some indeterminate “we’ll get to it eventually” point in the future, but right now where it hurts.
Now, none of this is new news; our problem is long since solved, and the Netflix Tech Blog article I’m referring to was posted last year. I’ve been meaning to write about it, but I’ve been a little busy. Maybe the timing is prophetic; AWS had a huge multi-day outage last week, which took several major websites down, along with a constellation of smaller sites.

Notably absent from that list of affected AWS sites? Netflix.

When you work with the Chaos Monkey, you quickly learn that everything happens for a reason. Except for those things which happen completely randomly. And that’s why, even though it sounds crazy, the best way to avoid failure is to fail constantly.

Guest Post by Jeff Atwood

Advertisements

A Train Wreck Indeed!

How do you f’up the pay-per-view business? You don’t. No need to – it has been one train wreck since 1984. (Full disclosure: In 1984, I started a nationwide satellite delivered ‘A’ title Movie service called’ The People’s Choice’ alongside of Jeff Reiss’s ‘RequestTelevision’ and Scott Kurnit’s ‘Viewers Choice’). When I was in this business, Bill Mechanic (ex-CEO, Chairman of Fox, Disney, green lit ‘Titanic’) and Barry Diller were at Paramount, Jamie Kellner ( Orion Pictures who went on to run ‘The WB Network’), Hal Richardson ( President at Paramount) was at Disney/Dreamworks, Eric Frankel (President for 26yrs) and Stanley Solson along with Eddie Blier were at Warner Bros. ( close to the Steve Ross reign whom I knew well from High School days), Mike Medavoy at Tri-Star, Ned Nalle at Universal and Andy Kaplan at Sony. Most all of these people now still are around and are running their own ship BECAUSE back then, they had a some foresight and moxie. They DID agree to let the PPV at least try and get off the ground by granting PPV rights to a few nascent, early entrants in the business. At that time, there were only a few addressable homes to see the films.

Since the inception of PPV on the cable landscape, its always been a ‘promise’ business at best. Nothing really ever took off or was unbelievably successful (and I am referring to MOVIES, not the WWF, Boxing or the Adult business). Many a business and consulting firm was built around it, hardware made for it, ordering systems invented and manufactured and in the end, most went out of business. Most cable operators didn’t even understand it or what it was suppose to be, what ‘tier’ to put it on and how to promote it. Most felt it would cannibalize their existing cash cow, PAY TV.

It never cannibalized anything because it never got off the ground. No one could agree on a movie PPV ‘window’ (the timing of when a PPV movie should be allowed to be seen and ordered on PPV). Many a conference, discussion group, speech and convention sessions were had – all futile. Nothing was ever decided. The VCR’s were blamed as the culprit, then it was the movie studios, then it was theater owners, then it was Pay TV and the ‘exclusivity’ wars of the 90’s. Then the Internet crept upon us all and that was the new Darth Vadar. You can’t release a film on PPV too early because it could be copied easily and even easier become distributed by means of the internet all over the planet (meaning no more duplicating and bicycling cassettes as if my friends ever did this in mass to begin with). Now, using the Internet, movies would be all over the place, everywhere. Everyone would have a copy. Well? Do we ALL have copies of Avatar? Tootsie? As Good As It Gets? Dirty Harry? A good industry has got to know its limitations! And this one never did!

Now, theater owners are afraid of the 60 day release window. Pahleeese! Just read a few of the articles below.

http://engt.co/lWRaty

http://bit.ly/kXEgRU

http://lat.ms/kiRwwc

Theater owners and the Hollywood creative community are livid about Premium VOD, which they perceive as paving the road to cannibalizing theatrical attendance which would in turn harm a movie’s overall economics, creating a dangerous downward spiral. In addition, there’s concern that if consumers switch to watching movies on the small screen then the creative license implicit in a big screen emphasis will get squeezed. While their concerns MAY be justified, the good news for them is that Premium VOD will be lucky to achieve even minimal success.

Why? The cost is one – $ 30.00 for a poor film or film that has not done well at the theater or is released directly to DVD (or what was once called DVD) is insane. Sorry, justification by babysitter fees and popcorn costs don’t cut it. These are niche films. Avatar and other BIG films will never see this light of day through this window. But ‘Cloudy with a Chance of Meatballs ‘ will (and has already, sort of). Example – first film up is Just Go With It” starring Adam Sandler and Jennifer Aniston. Ho-hum. Good cast and a flop a the box office for the most part. I’d be pissed if I paid $30.00 for this AND CAN’T EVEN KEEP A COPY IN A DIGITAL LOCKER TO SEE WHEN I WANTED AGAIN? WTF? And frankly that could be one of the keys to making this viable. Give me the ability to KEEP it as if I bought the DVD ( keep it in a ‘cloud’ locker) and I’d might buy a few films – that would help at least justify the cost.

And, as Will Richmond from VideoNuze so aptly points out, “Studios seem to believe that making movies available sooner in the home will attract demand. But the problem is that there are already so many choices for watching movies in the home – pay-TV, Netflix, iTunes, Amazon, Vudu, etc. etc., that it will be very hard to break through the noise, solely with a “sooner” positioning, which is more than offset by a ridiculously high price point. Consumers are savvier than ever; they’ll quickly realize that they can get the same movie for $4-5, a sixth to a seventh the price of Premium VOD, just by waiting a couple more months for it to appear on pay-TV or online VOD.”

So, theater owners who vow to ‘go to war’ are wasting their time and efforts. I guarantee them that the Movie studios and cable operators and satellite delivery services will win the war for them. Somehow, these guys think that consumers are not too smart. When are they going to wake up and smell the coffee? When are they going to realize that all of us don’t rush to ‘steal’ digital copies of films for any number of reasons (i.e., they are 700megs of data AT LEAST, cumbersome to store, less than perfect copies that lack subtitles at times and extra’s.) They are not MP3’s! Music and movies may both have a digital base as a common denominator but ultimately I’ll listen to Hotel California many more times than I can watch Avatar in my lifetime. And the pirates don’t make a bit of difference except barely on the streets of lower Manhattan or Tokyo where poorly made copies sell for $5.00 until those vendors get caught that day. And they on sell about 30 movies at that point – no MASS market like that that would ruin a $250m box office in the theaters or in any ancillary market I know of.

Theater owners should rejoice that soon this whole business will be in Netflix’s (or some other digital distributors) capable hands and not the studios. (Apologies to those friends of mine at the studios now – its not your fault, it’s just the ‘economics’ to blame and perhaps a few at the top thinking we are still in the DVD/VCR age). Make the business consumer friendly – give us a copy of what we buy and allow us to watch it whenever we want for our money that we spent. After all, I can do this with new music released, why not new movies released?

Amazon’s EC2 ‘cloud’ outage is just a minor bump in major right road.

By now you’ve heard about Amazon’s EC2 (Elastic Compute Cloud) cloud service failure, or perhaps felt it. If you use Foursquare or read Reddit, use or Quora (among other services or websites) you no doubt felt the impact.

On 4.21 at 1:48am PDT. Quora even had a fun ‘down’ message: “We’d point fingers, but we wouldn’t be where we are today without EC2.” And this YouTube video:

Lew Moorman, chief strategy officer of Rackspace, said it best “It was the computing equivalent of an airplane crash. It is a major episode with widespread damage”. But airline travel, he noted, “is still safer than traveling in a car” — analogous to cloud computing being safer than data centers run by individual companies.

The fact remains, the cloud model is rapidly gaining popularity as a way for companies to outsource computing chores to avoid the costs and headaches of running their own data centers — simply tap in, over the Web, to computer processing and storage without owning the machines or operating software.

Consumers don’t realize that there are a host of sites that base a majority of their ‘up-time’ on cloud services, including Hotmail and Netflix to name just a few. Netflix was not affected by the recent outage because Netflix has taken full advantage of Amazon Web Services’ redundant cloud architecture (which is NOT inexpensive).

Industry analysts said the troubles would prompt many companies to reconsider relying on remote computers beyond their control. And while discussions surrounding that might happen in the next several weeks, in the long-term cloud computing will continue and thrive and evolve into what most industry experts and others already know it to be – a necessary and valued component of doing any kind of business or having any sort of web presence on the Internet. The truth is, every day many more companies around the globe experience ‘outages’ that take their services and sometimes web site down for hours. Added all together, they add up for far more lost time, money and engineering resources that Amazon’s interruption last week.

This round, the companies that were hit hardest by the Amazon interruption were start-ups who are focused on moving fast in pursuit of growth, and who are less likely to pay for extensive backup and recovery services or secondary redundancy in another data center (or Amazon’s redundant cloud architecture).

One of the things that most people are not aware of is that Amazon has an SLA (service level agreement) which is one of the weakest cloud compute SLA of any competing public cloud compute services, even though its uptime is actually very good. Most providers offer 99.99% or better, with many offering 100%, evaluated monthly, with service credit capping at 100% of that monthly bill. Amazon offers 99.95%, evaluated yearly, capping at 10% of that bill, and requires that at least two availability zones within a region be unavailable. Therefore, companies MUST take this into consideration when choosing a vendor as how it relates to what they do on the internet. Taking a secondary, back-up approach can close some of those holes, but it can get mighty expensive. Amazon’s EC2 pricing overall reflects this type of SLA and the ‘human’ support is not included — because of this aspect it can give a 10% to 20% uplift to the price, and it is geared primarily toward the very technically knowledgeable. Amazon is a cloud IaaS-focused (infrastructure-as-a-service) vendor with a very pure vision of highly automated, inexpensive, commodity infrastructure, bought without any commitment to a contract. Amazon is a thought leader; it is extraordinarily innovative, exceptionally agile and very responsive to the market.

That being said, the recent Verizon acquisition of Terremark should put most Tier 1 vendors on their toes including Amazon. Terremark offers colocation, managed hosting (including utility hosting on its Infinistructure platform), developer-centric public cloud IaaS (vCloud Express) and enterprise-class cloud IaaS (Enterprise Cloud). It is a close VMware partner (VMware is one of its investors), and is generally first to market with VMware-based solutions. It is a certified vCloud Datacenter provider. Some of Terremark’s perceived weak spots can and should now be addressed by the merger between the 2 service offerings, in particular the added personnel to better deliver on customer service and satisfaction (stretched thin’ has been the compliant). Now that it has a substantially bigger war chest from its parent Verizon and Verizon’s exceptional network worldwide (remember Uunet), it can take on and adapt more bleeding edge technologies, which it has done in the past, but has not been able to do so most recently.

Combinations like this will likely increase in this space over time as other vendors realize that 2 can be better than one. The devil is always in the details and the trick here is for company cultures to be merged efficiently with a clear and concise plan laid out for both sets of employees. The last thing you need are internal employees to wonder who is going to be replying to the same RFP (request for proposal) to any particular vendor moving forward. Strong, well thought out details by upper management should avoid these pitfalls for the most part, however, it can be pretty tricky to implement.

Long story short – I’d still bet heavily on the long-term success of this business. It’s a smart, cost efficient and labor efficient business model needed for most start-ups, mid-size and Enterprise clients. The days of sending your IT guys into a cage to update the companies software with numerous discs and software patches hoping that it doesn’t disrupt the companies servers should be long gone.

So What is a QR Code Really All About?

More and more I am finding and seeing those funny boxy black and white Rorschach like inkblots pasted on magazines, products and in the streets while walking. So what are these and why are they turning up more often now. Why do we see them and not those bar codes we are used to seeing instead?

Let’s take this in order. First, the U.S. is way behind the rest of the planet ( as usual) in using these. Europe and especially Asia have been plastering these all over the place for several years now – we are just catching up.

QR codes have been around since 1994. QR codes (QR stands for ‘quick response’) and in short it provides a quick link from the physical world to the web. In a way, QR codes are to messaging what Twitter is to SMS texts – a 2nd gen form of information download to you and me. Basically, QR codes deliver and hold MUCH more information that a text message or bar code can. QR codes can hold up to 7,000 digits (both vertical and horizontal). A price barcode can hold only 20 digits (and only in one direction- vertical). When a QR codes becomes ‘translated’ or read, that item becomes a website – it acts as a hyperlink to the web, or to a google map or a youtube video, etc. Google is actually incorporating these into its maps for local businesses http://bit.ly/dY7xFU .

QR codes consolidate what you want to say down into a small graphic which can be read by anyone with a QR reader (typically installed on your mobile phone). Yes, there’s an app for that (http://bit.ly/gsL7q8 ) Once loaded, you launch the app, point your phones camera at the QR graphic and the app reads the code and launches the web site that delivers further info, coupons, addresses, etc. QR codes can be used as a business cards (or with business cards, on posters, billboards, food products, art and even tattoos and fashion (http://p8tch.com/).

Here is mine:

If you want, Kaywa has a QR generator on the web that you can try to make your own QR code patch. Its pretty easy and quick. http://qrcode.kaywa.com/.

Netflix vs. YouTube and TV on the Net.

Time to delve back into the world of video. Oh, and don’t forget to watch SharkTank on ABC this friday at 8pm/7pm 🙂 ..

It has taken some time but Netflix and Youtube have each taken their position in the video entertainment world and I get the feeling that Youtube is not too happy about it.

On Youtube you can maybe change the world. On Youtube you can be discovered and help discover the next Justin Bieber. On Youtube, if one of your videos goes viral, you can make tens of thousands of dollars, and if you can replicate the feat of popularity, you can make hundreds of thousands of dollars annually. Those are real commission dollars .

But wait, there is more good from Youtube. Any one around the world can get Youtube to subsidize the cost of hosting their family/wedding/team/business/class/personal videos. Hopefully perpetually. These are unique, honorable,impactful and expensive roles that Youtube has chosen to under take.

But if you want to veg out and watch a TV show or movie, the vast majority of people just turn on the TV. About 11mm people turn on Netflix..

The lines of division between Youtube, Netflix and traditional TV have become crystal clear.

Traditional TV is where you get entertainment in real time. Live major sports, the latest movies on VOD, original episodes of your favorite TV shows, all in the highest, no – buffering quality available to your TV. Plus they have smartly opened the door to TV EVerywhere and in home tablet streaming so that there is a pay once, watch anywhere opportunity for their content.

Netflix is where you get streaming access to a growing library of thousands of TV shows and movies, and soon, a smattering of original content as well. Netflix has done an extraordinary job of being available easily on any and every device known to the internet. 11mm (those streaming, not all netflix users) or so users have happily paid Netflix $7.99 per month for this service and it shows no signs of slowing down.

Youtube is the counter-balance to Netflix and Traditional TV. Youtube is where you know 99pct of what is on the site is pure junk that has no relevance to you. It’s like walking through the bargain bin at Walmart hoping to find something that might interest you, knowing the price is right. Youtube is Community Access Television for the world.

Remember back in the day when Cable had A and B sides of the set top box ? You got all the good channels on the A side, and all the community access stuff was on the B side ? Youtube is the aggregation of every B side of every cable system in the world. That is not a knock on Youtube. It just ain’t what it ain’t.

The B side of cable was community driven. The B side of cable was an open door for anyone with access to a video camera. The cable company would let you schedule shows and put them on their schedule . Like Youtube, back in the day, there were shows that would break out and create mainstream opportunities.

I can’t help but include this paragraph from the history of Public Access TV in Manhattan

“Public access has a fundamental PR problem, which one producer summed up with this rhetorical question: “If anybody can do it, who would want to?” I don’t think there is any particular personality type that is drawn to public access; as with anything, it attracts good, bad, and ugly. But these people (each of whom I met by chance through the help of someone else I interviewed) have some things in common. All are creative, and all seem to have a thick skin and a high threshold for frustration. None were paid for their shows. Most actually shelled out their own money for studio time. Three admitted to suffering career setbacks later as a result of appearing on public access. They approached their work in television with a level of intensity and passion that only exists in the realm of avocations and came away with uniquely philosophical perspectives on the nature of television.”

The same thing could easily be said about Youtube producers today. And that is a business problem and social opportunity for Youtube. They have become Community Access for the Internet. That is a brilliant opportunity if you are trying to change the world or create huge communities . That is a huge challenge if you are trying to maximize earnings per share for your parent corporation. People won’t pay a subscription fee for any of it and most of it will never pay for itself with advertising because most of it will never be seen. It is the B side of the content world.

Which is exactly why I believe Youtube is channeling 1998 and gearing up to do quite a bit of live streaming. They don’t like being the third entertainment option . They don’t like being the “b or c side of content””. They are hoping live streaming can change the standings.

Offering everyone in the world the opportunity to stream whatever they want, live to the rest of the world, could actually change the world. But it won’t change the content stratification challenge Youtube is facing now. It won’t change how people see Youtube relative to traditional TV and Netflix.

The reality is that both cable/telco/sat distributors on your TV and Netflix are moving faster in terms of the introduction of technology (TV Everywhere/Remote DVR/IPad and multi device suuport) and the introduction of new and original high value content than Youtube. I think Youtube is hoping that live streaming will change that. It will be interesting to see if it does.

Personally, I’m not optimistic. But hey Youtube, call me. I’ve been there , done that and I can help you out.

Guest post by Mark Cuban. 4.12.2011 via blogmaverick.com