Skip to content

Coradiant

Archive for August, 2007

How Microsoft broke Skype by accident


Monday, August 20th, 2007 Posted by: Alistair Croll

Skype broke.

This should serve as a lesson to us all. Sometimes the old ways are the best, and we ignore them at our peril.

The folks at Skype said:

On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.

Yep, that’s right. Microsoft sent out a patch, and it brought down Skype.

TCP is a great example of simple, elegant implementations. TCP is breaking at the seams — it doesn’t support enough ports; it’s a jack-of-all-trades transport that isn’t particularly efficient; it requires a lot of computation; and it’s redundant in a lot of encryption and compression systems. Companies like Netli (acquired by Akamai) built businesses on the inefficiency of TCP. Making TCP efficient is a major factor in how Application Front End products (like Citrix’s NetScaler) speed up sites and reduce the load on servers.

But TCP is elegant. One of the things it does best is recover from problems. Wikipedia tells us:

“Modern implementations of TCP contain four intertwined algorithms: Slow-start, congestion avoidance, fast retransmit, and fast recovery (RFC2581).”

Ethernet does this well, too. When congestion occurs, senders keep talking long enough to make sure everyone heard the congestion, then back off for a random length of time. From Wikipedia, again:

“This can be likened to what happens at a dinner party, where all the guests talk to each other through a common medium (the air). Before speaking, each guest politely waits for the current speaker to finish. If two guests start speaking at the same time, both stop and wait for short, random periods of time (in Ethernet, this time is generally measured in microseconds). The hope is that by each choosing a random period of time, both guests will not choose the same time to try to speak again, thus avoiding another collision. Exponentially increasing back-off times (determined using the truncated binary exponential backoff algorithm) are used when there is more than one failed attempt to transmit.”

Think about that for a second. The guys who built these protocols realized that congestion would happen, and built models for dealing with unpredictable situations by backing off a random time, and for detecting congestion and avoiding it. And this was back in the day when there were only a few nodes on the Internet. Yet they function reasonably well even today.

So why didn’t Skype work properly? Without getting into too many details, the folks at Skype explained:

Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly.

There are two important lessons to be learned here:

  • First, it’s critical to look at traffic volumes. Many of the people who buy our UPM equipment used to rely on synthetic testing to monitor their sites. Often, they couldn’t answer simple questions like, “how many users do you have on your site today?” Their marketing department might know, through web analytics tags, how many sessions were active; but there was no way to stitch together traffic levels and performance.
  • And second, the Skype incident is a great example of how complex systems can fail in unexpected ways, and how everything on the Internet is intertwingled. Microsoft’s practice of updating and automatically rebooting billions of computers independent of owner control creates tremendous traffic spikes — and this is true of web-connected services such as antivirus updates and desktop plug-ins. But the impact of these spikes isn’t tracked or understood.

Understanding the relationship between load and performance is critical for anyone running a production web application. Applications will break; and without the right information at your disposal, you won’t be able to detect problems or fix them effectively.

With billions of nodes on the Internet and millions of changes a day to production systems, Sod’s Law (a variant of Murphy’s law) is definitely true: “Anything that can go wrong, will.” But it’s also possible to invoke Hanlon’s razor, a corollary to Murphy, that says, “Never assume malice when stupidity will suffice.”

Why movies teach us bad things about IT tools


Monday, August 6th, 2007 Posted by: Alistair Croll

I watched the Bourne trilogy this weekend.

I have to confess that I love the series. One of the things I most admire about it is that the hero actually thinks. I mean, in the first film, he grabs a radio off an opponent, rips a floor map off a wall, and uses that to evade capture and get out of the building. Sure, the films have some crazy car chases (which, by the way, result in a lot of accidents — how unusual!) And there are flight scenes and explosions. But they’re always reasonable.

It’s sad that I’m so impressed by someone acting wisely and normally. As I thought more about it, it occurred to me that Hollywood fills films with convenience. They do this so much that cleverness and pragmatism are refreshing. We’re so used to the Macguffin that when there isn’t one, we’re actually pleasantly surprised.

Peter’s Evil Overlord List is a great, and growing, list of silly conceits from movies. It does a better job than I can of making my point. Some examples:

  1. My Legions of Terror will have helmets with clear plexiglass visors, not face-concealing ones.
  2. My ventilation ducts will be too small to crawl through.
  3. Shooting is not too good for my enemies.
  4. One of my advisors will be an average five-year-old child. Any flaws in my plan that he is able to spot will be corrected before implementation.
  5. No matter how well it would perform, I will never construct any sort of machinery which is completely indestructible except for one small and virtually inaccessible vulnerable spot.
  6. I will never build only one of anything important. All important systems will have redundant control panels and power supplies.
  7. For the same reason I will always carry at least two fully loaded weapons at all times.
  8. Once my power is secure, I will destroy all those pesky time-travel devices.
  9. If I have massive computer systems, I will take at least as many precautions as a small business and include things such as virus-scans and firewalls.
  10. No matter how many shorts we have in the system, my guards will be instructed to treat every surveillance camera malfunction as a full-scale emergency.

So what does this have to do with IT? Well, often demos are so convenient they lull buyers into a false sense of security. We want to accept the convenient explanations, because they make things simple.I remember movies from the seventies in which the bad guy locked Our Hero in the sauna, hoping he’d steam to death.

Oh, come on. How many saunas have doors that lock from the outside?

For that matter, how many data centers have big “self destruct” buttons, clearly marked? How many security guards have nametags without photos on them? How many times is the back door to the secret lair conveniently ajar? None of these things happen in the real world; but they happen in movies, and we accept them. Software demos do the same thing. We see a demo, and it looks fine. We want to believe it can save us. We’re willing to accept the coincidences. Salvation is real and imminent.

But reality is a lot more bleak. The tools are seldom as straightforward as they were in the demo. In our field — user performance management for online applications — there are plenty of examples of how things in the real world aren’t nearly as convenient.

Here’s my list of ten differences between the demo and the real world for web monitoring technologies.

1. There’s always a security problem

Whenever you try to deploy new software, there are always security issues. Applications require ports for communication, and have to be tested by the security department. Capturing user data means compliance and oversight — depending on your industry, you may have to store it for seven years. And physical devices may be subject to attacks or may be an unsupported operating system. Good, secure tools that work out of the box without annoying your security officer are worth their weight in gold.

2. URIs aren’t sensible

Sites don’t always have easy-to-read names. Sure, Wikipedia might have http://en.wikipedia.org/wiki/Evil_Overlord_List as a URL that’s pretty easy to parse. But more often than not, it’ll be something like http://www.ifaw.org/ifaw/general/default.aspx?splash&oid=17767 (which, by the way, is the home page for the International Fund for Animal Welfare — but you wouldn’t know it from the URL.) Assume that for something to be useful, it has to be flexible enough to accommodate the quirks of your site’s structure.

3. The things you’re testing change

Nothing is static. We have customers whose websites’ code changes daily. For them, a simple test isn’t really relevant; it’s useful for a day. If a key function is a constantly moving target, make sure your tools can stick to that target like glue. Otherwise, when something breaks you’ll be looking at yesterday’s data. Ask yourself whether a tool can adapt quickly to changes in the site.

4. All functions aren’t equal

The typical website has dozens of funtions, from login to reporting to search to account management. We don’t expect all of them to take the same time. Logging in should be relatively quick; but generating a detailed report could take a while. And we’re okay with that. Unfortunately, performance measurement isn’t. Most web performance tools have a “one size fits all” approach to thresholding. This means that you’re either flooded with false alarms (which you’ll turn off) or missing important ones. Does the monitoring technology recognize the context of a function and a user, and automatically adjust to different functions?

5. Every site breaks in its own special ways

I used to have a bounty for broken sites. Over the years, people have sent me hundreds of screenshots of applications breaking in new and unexpected ways. (to this day, one of my favourites is http://www.starwars.com/welcome/404.html.) Some sites try to hide their errors behind polite apologies. Others give detailed error information on the page. Some errors don’t even produce data: A premature server reset or excessive TCP retransmissions, for example, happens outside the realm of HTTP; but it’s still a problem. What if your site breaks in ways that aren’t in the demo you’re seeing?

6. No matter what reports you’ve got, you don’t have the right one

You can never tell what you’re going to need to look at. Sure, it might be useful to see which server is busiest, which browser is slowest, or which page has the most errors. But sooner or later you’re going to get a “complicated” question: “Are Firefox browsers from China who search by zipcode generating more errors?” (seriously, one of our customers needed to know this.) If the tool can only slice data in predefined ways, you’re going to be stuck guessing. How flexibly can you focus the analysis of the tool on specific segments of traffic? Can you drill into it?

7. The installation of agents always has issues

The software agent is the IT equivalent of a dentist saying, “trust me, this won’t hurt a bit.” Agents need management and updating. They have to transmit data, and present points of attack. They’re silent when the servers they run on are broken. They generate network traffic. And they’re sandboxed, trapped within the environment on which they run.

Sure, agent-based monitoring is a necessary evil. But it should be used judiciously, and you need to deploy agents with a recognition that things won’t be as rosy as they sound. You’ll have to lobby for their deployment. You’re going to jump through hoops to get them communicating with your management systems. When you’re looking at a demo that has complete visibility, spend a lot of time on the organizational cost of that visibility.

8. Editing tags has hidden costs and limited visibility

The web alternative to agents is tags. These included pieces of Javascript provide some monitoring by asking the browser to report on performance and errors. Javascript and tagging is a big headache. For marketing departments, it’s an invaluable tool — but Gartner claims that maintaining tags and scripts is the biggest downside to web analytics.

Using tags for monitoring sounds easy in principle. In practice, however, it’s fraught with peril. Javascript collection makes the assumption that the page loaded properly (otherwise, how did you get the Javascript?) It also assumes that the client will run the script (which isn’t the case for many phones, for non-HTML content, and for users with privacy settings turned on.) And the client is sandboxed: For security reasons, the Javascript on the client doesn’t have access to the networking stack or facts about the network. What’s worse, the act of including Javascript can often slow down the page load time. Consider the organizational cost and the amount of technical information you’ll get when things go wrong.

9. Users don’t follow simple paths

Most e-commerce sites like to think they have simple transactions. Users put things in a cart, check out, pay for their goods, and confirm the shipping address. The reality is, users don’t follow proscribed routes. They meander around the site, going backwards and forwards, opening new tabs, changing their minds. For IT operations, what matters more is the health of key steps in a process, and which users encountered problems at those steps. Don’t assume users will do what you expect.

10. It’s always expensive to run things

Many studies have repeatedly shown that the real cost of IT is operational. Eric Dean, CIO of United Airlines, told Forbes that that for every dollar he spends on a package, he must spend $5 to $7 more on consulting to make it work. Network Appliance estimates that for every dollar of storage, users spend $5 to $7 to manage it (though their tools claim to get that down to $2 to $3 — partly due to their appliance focus.) And the Seybold Group estimates that with even standard packaged software, for every dollar spent on software a company spends $5 on consulting, systems integration, and custom programming. So when you’re seeing an IT offering, ask yourself: How much will this cost to run? Will it take care of itself?

Back to the real world

Demos often feature nice, simple sites where users are well behaved, installation is assumed, reports show the right data, and security’s not an issue. That’s the IT sales equivalent of the hero defusing the bomb with two seconds left, then finding an escape pod. It’d be nice, but it’s no way to run a business.

Next time you’re evaluating IT tools, think of the cheap tricks that movies pull to conveniently move the plot along. Then think about how much of what you’re seeing is conveniently tweaked for an ideal story.

We used to run websites, so when we started making tools for web operators, we vowed never to make things that looked better in the demo. In fact, we don’t have demo boxes. We have production units that prospective customers buy. They nearly never come back. We don’t really believe in demos: If the product is going to be useful, you should be using it from day one.

In short: If you can’t get results from it the day you plug it in, it’s probably not going to get used once you sign the check.

I’m going to finish this off with a joke, even though you’ve probably heard it and I may have already given away the punchline.

A software salesperson is killed trying to save a schoolbus full of orphans. St. Peter says, “I’m a little unsure what to do. On the one hand, you gave your life so others could live. On the other hand, you sold software that promised far more than it could actually deliver in the real world. So I don’t know whether you go to heaven or hell.”

The salesperson replies, “well, what’s the difference between the two?”

St. Peter answers, “I’m willing to let you visit both places briefly, if it will help your decision.”

First, St. Peter sends the salesperson to hell. And it’s beautiful! Sunny, clear, with attractive people enjoying delicious food, frolicking in the ocean.

“This is great!” says the salesperson. “If this is hell, I really want to see heaven!”

St. Peter snaps his fingers and they’re in heaven. It’s high above fluffy clouds, with angels singing and playing soft, Enya-like music.

The salesperson thinks for a minute, then says, “I guess I’ll take hell.”

Two weeks later, St. Peter decided to see how his charge was doing. When he got there, he found the poor salesperson in chains, hair singed off, screaming as he was tormented by fireball-tossing imps and succubi.

“How’s it working out?” he asked.

The salesperson sobbed, “this is nothing like the hell I visited two weeks ago! What happened?”

“Oh, I’m sorry,” said St. Peter. “That was the demo.”

I guess the moral of the story is, there’s no substitute for seeing the real thing.

Don’t underestimate the importance of products that do what they say they do, well, the day you get them.