Have we solved the availability problem?
Tuesday, October 10th, 2006 Posted by: Alistair Croll
A few years ago, before the days of TrueSight, Coradiant used to be in the managed services business. This gave us lots of hands-on experience and a good deal of opportunity to see vendors from the customer’s perspective. We try to put that background into our products — lots of debates around technical issues are resolved with a firm, “would we have bought it if it did that?”
But not everything’s the same.
One of the things that’s changed a lot since then is the emphasis on availability and performance. Back then, nobody cared about performance-they were worried mainly about uptime. We hadn’t yet figured out how to make the world reliable.
- People spent money on big, reliable servers. Then Google showed us how to make millions of cheap PCs run well.
- Websites used to have dozens of redundant networks to overcome peering congestion. Then Internap raised the bar, and peering got better.
- Load-balancers argued over which algorithms to use to detect outages. Then they got good at it, adopted best practices, and got in the middle of connections where they could see problems first-hand.
- Browsers broke. Then the innovation turned to standardization and even 2-year-old clients could handle Javascript properly.
- Monolithic servers were expected to do everything. Then the three-tiered model abstracted presentation, processing, and storage. (I actually found an old study I wrote in 1999 on “the emerging 3-tier model” of computing. Things weren’t always this way!)
- But most importantly, we stopped thinking about device availability and started thinking about system availability. Web operators don’t care about the failure of a single server these days (although it’s a great way to get a free lunch from a supplier.) They watch the overall system.
With this change has come a major rethinking of priorities. Now that systems are highly available—despite their notoriously unreliable components—people have turned their attention to performance.
I’m not talking about the old “eight second” rule of product purchases. Frankly, if Kayak or Overstock or Lendingtree or Brassring is a little slow today, I’ll wait: I know they’re good for it. I have accounts there. The cost of switching is high.
What I am talking about is the impact of performance on everything from call center volumes, to lost productivity, to spontaneous coffee breaks, to adoption failures. As more and more companies focus on performance, they unearth skeletons in their closets.
One of our customers described a situation where performance was unbelievably slow for users at one office in Asia Pacific. On closer analysis, the users there were connecting to a U.S. proxy, completely bypassing the one in their branch office. The Internet was available to them—but unbelievably slow. And by switching them to the local proxy, the company avoided upgrades of around $60,000 a month.
Back in our days as an MSP, we used to build reports of availability. We used to alert and alarm on it. But today, when I talk to customers, few of them are concerned with availability and uptime. They want the forensics on a failure (to wave at the aforementioned vendor while ordering the Baked Alaska) to be sure. But most of the interest we get is in performance and traffic level analysis.
In particular, we get questions about second-order analysis. “Don’t just tell me how slow it was,” said one customer last month, “tell me how many users were dissatisfied by the performance.” Similarly, people ask me “can you show me whether this level of latency is normal for this level of load.” Answering these questions is far more meaningful. Customer support and capacity planning can use the answers to really tackle some fundamental issues they face. The answers involve quite a lot of computation, and they require that we measure performance, availability, and traffic levels.
I think we may have solved the availability problem. But the performance problems are just beginning.
