Skip to content

Coradiant

Blog

Interop New York

October 25th, 2007 Posted by: Alistair Croll

I’m starting day two of the Data Center Summit at Interop, back-to-back with user group events in New York and Boston this week. The first day was a very interesting set of topics:

  • Steve Shah of Risingedge presented a session on the “State of the cage.” This fascinating presentation looked at the evolution from mainframes to clustered computers, and from local procedure calls to intra-data-center delays.
  • A panel of folks including Michael Baum (CEO of Splunk), James Sayles (CCO of Ecora) and Michael Weider (CTO of IBM’s recently-acquired Watchfire) discussed issues of compliance and privacy in data centers.
  • John Carton, formerly at Accenture and now the Senior Director of Web Services at Nature’s Bounty, presented a model for thinking about disaster recovery in data centers.

There was lots to learn from the sessions. Steve pointed out that with the adoption of service-oriented architectures, back-end procedure calls that used to take microseconds now take milliseconds, and that virtualization will make this worse since developers don’t know whether an often-reached machine is local or remote. Michael observed that compliance, which tells people to keep everything, conflicts with privacy, which says that more data makes a breach more risky. And John suggested that companies need to evaluate not only availability, but also how much data they can afford to lose, when setting recovery policies for data centers.

Today’s tracks look at the range of data center models (from on-demand to full colocation with a content delivery network); the issues of power, cooling, and efficiency in greening modern data centers; and automating changes in within data center environments.

A quick sidenote: Our San Diego headquarters was a busy place this week, with the wildfires consuming a huge swath of the Southwest. While many of our employees were evacuated, nobody was hurt; and with offices in Boston and Montreal, Canada, we were able to handle order shipments and provide continuous support to our customers worldwide despite the crisis. Thanks to everyone involved in keeping things running despite the crisis.

Green code

October 18th, 2007 Posted by: Alistair Croll

I recently wrote a blog for GigaOm’s Earth2Tech site on “Green Code.” The idea is that the quality of code matters. Two coders, writing code for the same application, can have a tremendous difference in efficiency. And that can translate to big differences in power consumption and resource costs — particularly in a virtualized or on-demand environment.

Over here on the Coradiant blog, I can speculate a bit more specifically about what this means. One of the interesting things you can do with user experience is to measure the total processing involved in a page or a user visit.

Because much of the delay on the Internet comes from network performance, two applications with significantly different host efficiency might seem as fast as one another to an end user, so you can’t really measure this just by trying two sites.

But the precision of Real User Monitoring technologies makes even millisecond differences in host processing time clear. And while web operators usually look at average (or percentile) host time, one of the more unusual ways to measure host time is to sum it. This effectively shows you the “total thinking done” for a user’s session.

This can be the start of some pretty fascinating math. Once you know host time per session, you can see how many host-seconds your infrastructure devotes to a visitor. This can show you things like whether a certain class of users is consuming more than its fair share of “heavy” searches.

(Incidentally, on the Coradiant.com site, this often reveals blog spammers from China posting comments about their various vitamins, and more questionable offerings.)

But you can also tie this host time back to IT costs.

I’m teaching a course on data center growth as part of Interop’s Data Center Summit in New York next week (more on this in a later post.) In preparing for that session, I spent a lot of time looking at the cost models behind on-demand hosting, managed servers, collocation, and global CDNs. And it made me realize there are good ways to model IT costs that vary widely according to each business.

Let’s look at combining these two metrics — host time and IT costs — to better understand the business impact of IT.
If you have a good model for IT costs (such as collocation, power, cooling, and storage) and you divide your monthly IT costs by the sum of host time for the month, you know your IT-cost-per-host-second. You don’t want to include bandwidth costs, which aren’t related to the host time.

If you then multiply host-seconds for each user session by that IT cost, you can calculate how much each user session costs you.

This is an excellent basis for evaluating change across releases. It will reflect increased costs in hosting (such as the introduction an application accelerator,) reductions in delay (such as a drop in host time from the AFE’s application acceleration functions reducing the load on servers,) and even changes in pages per session.

You can actually report average IT cost per user session.

As a result, you’ll now know the actual impact of that deployment: Did the reduction in IT-cost-per-host-second outweigh the investment in the AFE? How many weeks did it take to pay the cost back? Is the additional site navigation costing us more?

Of course, there are many other benefits to reducing host time, from user satisfaction to increased capacity to reduced SLA refunds. But this idea of IT-cost-per-host-second is a nice, concrete way to think about what code changes or other modifications to your operations do to your business.

Now back to the fascinating sessions at Web 2.0.

Web2Summit, Day One

October 17th, 2007 Posted by: Alistair Croll

I’m in San Francisco for three days of Web discussions. The Web2 series is always interesting, and offers a good look at what might happen in the future. I just attended a presentation on eBay and open services. The presenter compared eBay’s decision to open its APIs to developers to that of AT&T’s decision to allow third-party devices to connect to the telephone network (the presentation will be available at http://innovation.ebay.com/)
In both cases, the openness led to tremendous advantage.

  • For US phone users, billions of dollars in revenue — from answering machines to faxes to modems — were added to the economy. One could even argue that the addition of all these components paved the way for today’s Internet. Imagine if we had to get the telco-approved home router and what that would do to stifle innovation.
  • For eBay the market was already building a number of tools for modifying and optimizing both the buying and selling process. At the time, this was achieved through screen-scraping: Pulling down pages and extracting the HTML from them. Not only was this inefficient and error-prone, but every time eBay changed its site, this broke the applications.

Some eye-opening mash-ups — including the combination of Craig’s List housing properties and Google Maps — prompted the folks at eBay to open and document the interfaces to their user base. The results were impressive: Over 55 percent of listings on eBay are submitted by their APIs rather than the traditional eBay web application. That’s billions of dollars in transactions over non-human-web interactions.

The idea of openness is one we spend a lot of time working on at Coradiant. We have a wide range of APIs, from legacy protocols such as SNMP (used by practically every Enterprise Management Software package) to more cutting-edge interfaces like real-time streams of user traffic that can be visualized in interesting ways through browsers or desktop applications.

Our openness has been a deciding factor in many of our customers’ decision to buy TrueSight. Of course, the main focus of our Real User Monitoring appliances is their own interfaces, which operators use to troubleshoot and optimize web apps. But a secondary use is delivering real user data to other destinations. Our ability to get to the individual user sessions and objects, and then to step back and aggregate huge amounts of traffic in ways that make them interesting to the business, is a cornerstone of what we do.

We’re big believers that if we’re open, our customers will surprise us with new things. So far, they haven’t disappointed us.

Now for Jonathan Zittrain, an Oxford Law professor, with the provocatively named Web 2.NO, which I’m choosing over the alternate “Print 2.0″ (or, as Jonathan’s labelling it, “how do I fix my printer driver?”)

Shots from the show at http://www.flickr.com/photos/tags/web20summit/

The high cost of switching

September 17th, 2007 Posted by: Alistair Croll

Back in the dot-com heyday, everyone was terrified of customer churn.

The churn occurred, the theory went, when someone couldn’t buy a book from Amazon.com and switched over to Barnes and Noble (or vice-versa.) This led to shopping cart abandonment and costly acquisition of new customers.

The reality is a bit different. My Amazon account has all kinds of personalized features — from billing and payment information, to a wish list, to recommendations. I’m unlikely to switch unless they really upset me. I’ve changed providers for some things in the past, such as really bad airline procedures, abject failure, or the inability to provide a product or service. Sure, I went looking for a pair of shoes online and tried three or four places before finding them. But for relationship-based selling, where I return time after time, switching doesn’t happen much.

Or rather, switching happens a lot, but people don’t measure it right. Don’t get me wrong: Churn is a big problem. It’s just that traditional thinking about churn won’t work any more.

An often-quoted study conducted in the late nineties by Booz Allen & Hamilton compared the relative costs of a transaction in a bank:

  • Internet: $0.01
  • ATM: $0.27
  • Automated call center: $0.44
  • Call center personnel: $0.85
  • Branch: $1.07

If I need to complete a transaction, and their website isn’t working, I’ll go to the branch. I hate that, and so, apparently, do their accountants. The first modern switching cost is channel switching: When I use an inefficient channel, I cost the company money and I get irritated.

A second switch occurs when I stop being productive and engaged. I recently presented at the Application Continuity Conference in San Jose, and looked at “user continuity.”

The gist of this is that, when performance degrades to horrible levels, it’s pretty clear to all involved that the application may as well be down. But what’s less clear is the cost of users switching their level of engagement. The second modern switching cost is engagement switching.

In 1968, Robert B. Miller published a study entitled “Response time in Man-Computer Conversational Transactions.” He looked at how the human brain behaves when the system it is using responds with different levels of delay.

Miller identified three main threshold levels of human attention:

  • 100 ms or less and the person feels that response is instantaneous
  • 1 second or less and the person feels they are “freely interacting” and can enter what Mihaly Csikszentmihalyi called a “Flow State” in which concentration and productivity climb while errors drop.
  • 10 seconds or less and the person feels they are “attention focused,” meaning they are consistently engaged with the task at hand.
  • For interactions that take more than 10 seconds, humans become distracted and will try to multitask, and productivity will drop dramatically.

What I like best about this study is that it predates the Internet; it’s about how we’re wired. We’d scan the grasses to look for the sabre-toothed tiger for about 10 seconds before returning to the task at hand. And according to an MIT thesis, signals travel about 90 meters per second along a sheathed neuron, so we pretty much treat things that happen in under a millisecond as “right now.”[1]

So when we look at switching costs, on many sites we’re not worried about visitor churn in the traditional sense. What’s a lot more relevant is the cost of channel switching and engagement switching that can drive up the cost of serving a customer or the disengagement of the user’s attention and productivity.

[1] Interestingly, our clock speed is between 500 milliseconds and 4 seconds, or 250-2,000 HZ, so those Pentium chips are catching up on us.

How Microsoft broke Skype by accident

August 20th, 2007 Posted by: Alistair Croll

Skype broke.

This should serve as a lesson to us all. Sometimes the old ways are the best, and we ignore them at our peril.

The folks at Skype said:

On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.

Yep, that’s right. Microsoft sent out a patch, and it brought down Skype.

TCP is a great example of simple, elegant implementations. TCP is breaking at the seams — it doesn’t support enough ports; it’s a jack-of-all-trades transport that isn’t particularly efficient; it requires a lot of computation; and it’s redundant in a lot of encryption and compression systems. Companies like Netli (acquired by Akamai) built businesses on the inefficiency of TCP. Making TCP efficient is a major factor in how Application Front End products (like Citrix’s NetScaler) speed up sites and reduce the load on servers.

But TCP is elegant. One of the things it does best is recover from problems. Wikipedia tells us:

“Modern implementations of TCP contain four intertwined algorithms: Slow-start, congestion avoidance, fast retransmit, and fast recovery (RFC2581).”

Ethernet does this well, too. When congestion occurs, senders keep talking long enough to make sure everyone heard the congestion, then back off for a random length of time. From Wikipedia, again:

“This can be likened to what happens at a dinner party, where all the guests talk to each other through a common medium (the air). Before speaking, each guest politely waits for the current speaker to finish. If two guests start speaking at the same time, both stop and wait for short, random periods of time (in Ethernet, this time is generally measured in microseconds). The hope is that by each choosing a random period of time, both guests will not choose the same time to try to speak again, thus avoiding another collision. Exponentially increasing back-off times (determined using the truncated binary exponential backoff algorithm) are used when there is more than one failed attempt to transmit.”

Think about that for a second. The guys who built these protocols realized that congestion would happen, and built models for dealing with unpredictable situations by backing off a random time, and for detecting congestion and avoiding it. And this was back in the day when there were only a few nodes on the Internet. Yet they function reasonably well even today.

So why didn’t Skype work properly? Without getting into too many details, the folks at Skype explained:

Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly.

There are two important lessons to be learned here:

  • First, it’s critical to look at traffic volumes. Many of the people who buy our UPM equipment used to rely on synthetic testing to monitor their sites. Often, they couldn’t answer simple questions like, “how many users do you have on your site today?” Their marketing department might know, through web analytics tags, how many sessions were active; but there was no way to stitch together traffic levels and performance.
  • And second, the Skype incident is a great example of how complex systems can fail in unexpected ways, and how everything on the Internet is intertwingled. Microsoft’s practice of updating and automatically rebooting billions of computers independent of owner control creates tremendous traffic spikes — and this is true of web-connected services such as antivirus updates and desktop plug-ins. But the impact of these spikes isn’t tracked or understood.

Understanding the relationship between load and performance is critical for anyone running a production web application. Applications will break; and without the right information at your disposal, you won’t be able to detect problems or fix them effectively.

With billions of nodes on the Internet and millions of changes a day to production systems, Sod’s Law (a variant of Murphy’s law) is definitely true: “Anything that can go wrong, will.” But it’s also possible to invoke Hanlon’s razor, a corollary to Murphy, that says, “Never assume malice when stupidity will suffice.”

Why movies teach us bad things about IT tools

August 6th, 2007 Posted by: Alistair Croll

I watched the Bourne trilogy this weekend.

I have to confess that I love the series. One of the things I most admire about it is that the hero actually thinks. I mean, in the first film, he grabs a radio off an opponent, rips a floor map off a wall, and uses that to evade capture and get out of the building. Sure, the films have some crazy car chases (which, by the way, result in a lot of accidents — how unusual!) And there are flight scenes and explosions. But they’re always reasonable.

It’s sad that I’m so impressed by someone acting wisely and normally. As I thought more about it, it occurred to me that Hollywood fills films with convenience. They do this so much that cleverness and pragmatism are refreshing. We’re so used to the Macguffin that when there isn’t one, we’re actually pleasantly surprised.

Peter’s Evil Overlord List is a great, and growing, list of silly conceits from movies. It does a better job than I can of making my point. Some examples:

  1. My Legions of Terror will have helmets with clear plexiglass visors, not face-concealing ones.
  2. My ventilation ducts will be too small to crawl through.
  3. Shooting is not too good for my enemies.
  4. One of my advisors will be an average five-year-old child. Any flaws in my plan that he is able to spot will be corrected before implementation.
  5. No matter how well it would perform, I will never construct any sort of machinery which is completely indestructible except for one small and virtually inaccessible vulnerable spot.
  6. I will never build only one of anything important. All important systems will have redundant control panels and power supplies.
  7. For the same reason I will always carry at least two fully loaded weapons at all times.
  8. Once my power is secure, I will destroy all those pesky time-travel devices.
  9. If I have massive computer systems, I will take at least as many precautions as a small business and include things such as virus-scans and firewalls.
  10. No matter how many shorts we have in the system, my guards will be instructed to treat every surveillance camera malfunction as a full-scale emergency.

So what does this have to do with IT? Well, often demos are so convenient they lull buyers into a false sense of security. We want to accept the convenient explanations, because they make things simple.I remember movies from the seventies in which the bad guy locked Our Hero in the sauna, hoping he’d steam to death.

Oh, come on. How many saunas have doors that lock from the outside?

For that matter, how many data centers have big “self destruct” buttons, clearly marked? How many security guards have nametags without photos on them? How many times is the back door to the secret lair conveniently ajar? None of these things happen in the real world; but they happen in movies, and we accept them. Software demos do the same thing. We see a demo, and it looks fine. We want to believe it can save us. We’re willing to accept the coincidences. Salvation is real and imminent.

But reality is a lot more bleak. The tools are seldom as straightforward as they were in the demo. In our field — user performance management for online applications — there are plenty of examples of how things in the real world aren’t nearly as convenient.

Here’s my list of ten differences between the demo and the real world for web monitoring technologies.

1. There’s always a security problem

Whenever you try to deploy new software, there are always security issues. Applications require ports for communication, and have to be tested by the security department. Capturing user data means compliance and oversight — depending on your industry, you may have to store it for seven years. And physical devices may be subject to attacks or may be an unsupported operating system. Good, secure tools that work out of the box without annoying your security officer are worth their weight in gold.

2. URIs aren’t sensible

Sites don’t always have easy-to-read names. Sure, Wikipedia might have http://en.wikipedia.org/wiki/Evil_Overlord_List as a URL that’s pretty easy to parse. But more often than not, it’ll be something like http://www.ifaw.org/ifaw/general/default.aspx?splash&oid=17767 (which, by the way, is the home page for the International Fund for Animal Welfare — but you wouldn’t know it from the URL.) Assume that for something to be useful, it has to be flexible enough to accommodate the quirks of your site’s structure.

3. The things you’re testing change

Nothing is static. We have customers whose websites’ code changes daily. For them, a simple test isn’t really relevant; it’s useful for a day. If a key function is a constantly moving target, make sure your tools can stick to that target like glue. Otherwise, when something breaks you’ll be looking at yesterday’s data. Ask yourself whether a tool can adapt quickly to changes in the site.

4. All functions aren’t equal

The typical website has dozens of funtions, from login to reporting to search to account management. We don’t expect all of them to take the same time. Logging in should be relatively quick; but generating a detailed report could take a while. And we’re okay with that. Unfortunately, performance measurement isn’t. Most web performance tools have a “one size fits all” approach to thresholding. This means that you’re either flooded with false alarms (which you’ll turn off) or missing important ones. Does the monitoring technology recognize the context of a function and a user, and automatically adjust to different functions?

5. Every site breaks in its own special ways

I used to have a bounty for broken sites. Over the years, people have sent me hundreds of screenshots of applications breaking in new and unexpected ways. (to this day, one of my favourites is http://www.starwars.com/welcome/404.html.) Some sites try to hide their errors behind polite apologies. Others give detailed error information on the page. Some errors don’t even produce data: A premature server reset or excessive TCP retransmissions, for example, happens outside the realm of HTTP; but it’s still a problem. What if your site breaks in ways that aren’t in the demo you’re seeing?

6. No matter what reports you’ve got, you don’t have the right one

You can never tell what you’re going to need to look at. Sure, it might be useful to see which server is busiest, which browser is slowest, or which page has the most errors. But sooner or later you’re going to get a “complicated” question: “Are Firefox browsers from China who search by zipcode generating more errors?” (seriously, one of our customers needed to know this.) If the tool can only slice data in predefined ways, you’re going to be stuck guessing. How flexibly can you focus the analysis of the tool on specific segments of traffic? Can you drill into it?

7. The installation of agents always has issues

The software agent is the IT equivalent of a dentist saying, “trust me, this won’t hurt a bit.” Agents need management and updating. They have to transmit data, and present points of attack. They’re silent when the servers they run on are broken. They generate network traffic. And they’re sandboxed, trapped within the environment on which they run.

Sure, agent-based monitoring is a necessary evil. But it should be used judiciously, and you need to deploy agents with a recognition that things won’t be as rosy as they sound. You’ll have to lobby for their deployment. You’re going to jump through hoops to get them communicating with your management systems. When you’re looking at a demo that has complete visibility, spend a lot of time on the organizational cost of that visibility.

8. Editing tags has hidden costs and limited visibility

The web alternative to agents is tags. These included pieces of Javascript provide some monitoring by asking the browser to report on performance and errors. Javascript and tagging is a big headache. For marketing departments, it’s an invaluable tool — but Gartner claims that maintaining tags and scripts is the biggest downside to web analytics.

Using tags for monitoring sounds easy in principle. In practice, however, it’s fraught with peril. Javascript collection makes the assumption that the page loaded properly (otherwise, how did you get the Javascript?) It also assumes that the client will run the script (which isn’t the case for many phones, for non-HTML content, and for users with privacy settings turned on.) And the client is sandboxed: For security reasons, the Javascript on the client doesn’t have access to the networking stack or facts about the network. What’s worse, the act of including Javascript can often slow down the page load time. Consider the organizational cost and the amount of technical information you’ll get when things go wrong.

9. Users don’t follow simple paths

Most e-commerce sites like to think they have simple transactions. Users put things in a cart, check out, pay for their goods, and confirm the shipping address. The reality is, users don’t follow proscribed routes. They meander around the site, going backwards and forwards, opening new tabs, changing their minds. For IT operations, what matters more is the health of key steps in a process, and which users encountered problems at those steps. Don’t assume users will do what you expect.

10. It’s always expensive to run things

Many studies have repeatedly shown that the real cost of IT is operational. Eric Dean, CIO of United Airlines, told Forbes that that for every dollar he spends on a package, he must spend $5 to $7 more on consulting to make it work. Network Appliance estimates that for every dollar of storage, users spend $5 to $7 to manage it (though their tools claim to get that down to $2 to $3 — partly due to their appliance focus.) And the Seybold Group estimates that with even standard packaged software, for every dollar spent on software a company spends $5 on consulting, systems integration, and custom programming. So when you’re seeing an IT offering, ask yourself: How much will this cost to run? Will it take care of itself?

Back to the real world

Demos often feature nice, simple sites where users are well behaved, installation is assumed, reports show the right data, and security’s not an issue. That’s the IT sales equivalent of the hero defusing the bomb with two seconds left, then finding an escape pod. It’d be nice, but it’s no way to run a business.

Next time you’re evaluating IT tools, think of the cheap tricks that movies pull to conveniently move the plot along. Then think about how much of what you’re seeing is conveniently tweaked for an ideal story.

We used to run websites, so when we started making tools for web operators, we vowed never to make things that looked better in the demo. In fact, we don’t have demo boxes. We have production units that prospective customers buy. They nearly never come back. We don’t really believe in demos: If the product is going to be useful, you should be using it from day one.

In short: If you can’t get results from it the day you plug it in, it’s probably not going to get used once you sign the check.

I’m going to finish this off with a joke, even though you’ve probably heard it and I may have already given away the punchline.

A software salesperson is killed trying to save a schoolbus full of orphans. St. Peter says, “I’m a little unsure what to do. On the one hand, you gave your life so others could live. On the other hand, you sold software that promised far more than it could actually deliver in the real world. So I don’t know whether you go to heaven or hell.”

The salesperson replies, “well, what’s the difference between the two?”

St. Peter answers, “I’m willing to let you visit both places briefly, if it will help your decision.”

First, St. Peter sends the salesperson to hell. And it’s beautiful! Sunny, clear, with attractive people enjoying delicious food, frolicking in the ocean.

“This is great!” says the salesperson. “If this is hell, I really want to see heaven!”

St. Peter snaps his fingers and they’re in heaven. It’s high above fluffy clouds, with angels singing and playing soft, Enya-like music.

The salesperson thinks for a minute, then says, “I guess I’ll take hell.”

Two weeks later, St. Peter decided to see how his charge was doing. When he got there, he found the poor salesperson in chains, hair singed off, screaming as he was tormented by fireball-tossing imps and succubi.

“How’s it working out?” he asked.

The salesperson sobbed, “this is nothing like the hell I visited two weeks ago! What happened?”

“Oh, I’m sorry,” said St. Peter. “That was the demo.”

I guess the moral of the story is, there’s no substitute for seeing the real thing.

Don’t underestimate the importance of products that do what they say they do, well, the day you get them.

IT Executives Speak – 4 ways to get visibility into web performance

July 27th, 2007 Posted by: Alistair Croll

Web applications are typically a complex mix of infrastructure, platforms, services and content. Unlike the mainframe-based applications or LAN client-server applications of the past, no single, simple set of metrics had existed to measure the performance of online applications.

I had an opportunity this week to discuss this issue with four web operations experts at a Coradiant round-table discussion series. Each participant runs very large sites, and each is a clear leader in their respective industries. These four IT executives represented very different industries: A pharmaceutical Software-as-a-Service (SaaS) provider, the highest volume online financial services organization, a regional healthcare provider, and the leading travel industry search site.

Ultimately, the goal is to overcome the visibility gap between IT executives and their web applications. And what’s clear is that watching user experience provides these organizations with a single tool set to measure online performance, find problems and govern IT effectiveness at the senior executive level of the organization. We’ve been working on technology to watch user experience since 2000, and it’s great to see this approach taking hold.

Each of these IT executives uses elements from four major groups of tools, but it’s clear that our User Performance Management (UPM) now gives IT a single, comprehensive capability to perform their jobs well. Because of this — and because UPM focuses on the user- and business-level perspective — it is quickly becoming the prevalent view of web operations health throughout their organizations.

There are four main ways to look at online performance:

  • Platform Management: Monitoring the health of hardware platforms and infrastructure that applications run on
  • Synthetic Testing: Repeated tests of certain user-initiated web processes using automated test scripts
  • Web Analytics: Collecting information to analyze the source of visits, site navigation and buying trends
  • User Performance Management: Monitoring actual end user traffic to identify errors and problems and to measure delivered performance

Web analytics and User Performance Management are based on data from user transactions, while synthetic testing and platform management use tests and platform metrics. Web analytics and synthetic tests are often the domain of marketing, while User Performance Management and platform management are more likely to be used by operational teams.

Let’s look at each in detail.

Platform management

The inherent complexity of modern applications means that there is less and less relationship between actual user experience and the health of platforms. A server can be unavailable—but load-balancers may hide the problems so that users aren’t affected. Similarly, a network can be forwarding packets perfectly—but users are getting content errors.

Server logs provide huge chunks of information that is difficult to correlate to actual user problems. Finding problems based on these logs often takes days of complex effort on the part of experts.

Platform monitoring tools are increasingly employed for diagnostic purposes and forensic root cause investigations. Platform monitoring is a necessary, but not sufficient, part of a web monitoring strategy.

Watching platforms tells you whether components functioned; but User Performance Management tells you whether the whole system delivered and what the resulting user experience was.

Web analytics

Web Analytics measures the effectiveness of online campaigns and web conversions. Marketing organizations use web analytics to optimize the process of converting viewers into buyers.

Modern web analytics has evolved into a powerful set of page-tagging and navigational analysis tools integrated into content management systems. Because it is not economical for web operators to capture and store the tremendous amount of information that collection can generate, analytics is most often delivered through a hosted service.

Web analytics shows purchases, lead sources, and search effectiveness, but doesn’t show whether poor performance affects conversions. Failed transactions often go unnoticed.

Analytics shows who did buy; but User Performance Management shows whether they could.

Synthetic testing

Synthetic testing gives an estimate of application performance. Synthetic testing uses scripts to provide a rough baseline to compare performance over time and to compare to competitors. They also act as a “sanity check” when a user complains of performance from a particular region.

Because of their repeatability synthetic testing has been particularly appealing to marketing organizations. Unfortunately, it has also lulled companies into a false sense of security. Modern, dynamic web applications generate unique pages for every visitor, and many of them are one-time transactions. Most organizations answer confidently when someone asks, “is your site working?” But they are far less certain when asked, “is your site broken?”—because someone, somewhere, may be having an error on a page that isn’t being tested.

Synthetic testing shows whether a site is working; but User Performance Management shows whether it is broken.

User Performance Management

Watching real users allows web operators to detect every error as it happens. More importantly, capturing users’ sessions means that there is a record of the failure. This makes it dramatically easier to reproduce, diagnose, and resolve the problem. User Performance Management also means organizations know what each user’s experience was like, as well as how the site really performs under production conditions.

This addresses some of the biggest challenges the panelists had:

  • Assuring and effectively reporting service levels
  • Finding and fixing problems quickly
  • Knowing the effect of changes on their end users
  • Real-time visibility into online user experience

Four essential elements

All four monitoring technologies are employed to run web applications. But it is clear that User performance Management gives web operators the ability to deliver fast, error-free, available applications and communicate IT effectiveness with the rest of the business.

Web2Expo Day 3

April 18th, 2007 Posted by: Alistair Croll

We’re on to the third day of Web 2 Expo, and it’s a been a bit hectic. Good hectic, but busy nevertheless.
I got to the show early on Sunday, before things had opened up. Registration was underway.

Registration before the rush

Our ad in the program looked decent; nice and simple.

Our Web 2.0 ad

I attended a session on web performance put on by Yahoo that was informative and interesting. Anyone who talks about the impact of cookies on performance is my kind of speaker.

The impact of cookies

Even on a Sunday morning, for a topic this dry, the room was packed.

Web2 performance session

I went to check out our booth location; not much going on as it was Sunday and the show floor didn’t open until Monday afternoon.

Booth starting setup

The location was perfect - right along the main corridor, and you couldn’t miss us from the Google booth.

Coradiant booth from Google

That evening, a bunch of Coradiant employees and customers attended Ignite. This was a series of 20-slide, 5-minute presentations on all things Geeky. Very entertaining; one of the speakers was Justin from justin.tv — a guy who’s basically broadcasting his life, 24/7, as the ultimate in reality TV. We need to organize an Ignite in Montreal at the next Democamp.

Here’s a shot of Justin (on stage) and James Ward (from Adobe) looking at Justin’s website at the same time. I guess we could use this to measure latency, since on the site it looks like he’s just about to walk up the stairs to the stage when in real life he’s already there.

Justin TV

The next morning, I moderated a session with folks from Crescendo, Microsoft Windows Live, Amazon Web Services, and MySQL. The panelists were:

Hooman Beheshti, VP of Technology, Crescendo Networks
Mike Culver, Amazon Web Services
James Hamilton, Architect, Microsoft Windows Live
Zack Urlocker, Executive Vice President, MySQL AB

Decent conversation to a big audience.

Web2 Operations session attendance

(and if you look closely you can see a bunch of our customers in the room too.)

By then, the booth assembly was well underway. I’d cunningly timed my session to avoid all the real work.

Booth half set up

Once the floor opened, we were besieged.

Booth with activity

The new Web.I product really blew people away. It’s an amazing blend of reporting, visualization, dashboards, and data mining capabilities. One of the things we were showing was a Gapminder-like animation of site traffic showing sites and pages according to their health. It’s fascinating to see how quickly a human can grasp something once it’s displayed intuitively.

It seems that the entire Web2 world has built sites without much thought to performance and user experience — ironically, one of the main reasons for AJAX and Rich Internet Applications is to improve the user’s experience and yet it makes it harder than ever to manage or guarantee.

It also occurs to me (and maybe this is the topic for another post) that Real User Monitoring is the Long Tail approach. While synthetic testing watches the thin wedge of popular sites, most of a site’s hits aren’t to the pages that are tested. That wasn’t true a few years ago, but it’s certainly the case today. As a result, watching user traffic yields far better coverage for the far larger portion of the site that’s unwatched.

As if right on cue, one of our customers built a mash-up with Yahoo Maps and TrueSight user data to visualize activity to their site. Apparently it’s quickly become a favourite site for guests, prospects, and even interview candidates. Great work!

Thursday and Friday is the West Coast user group, too. Wow. So much stuff going on, and so many people starting to plug our User Performance Management technology into their existing business processes.

Web 2.0 Expo 2007, Day 0

April 15th, 2007 Posted by: Alistair Croll

We have a lot happening this week. We’re launching an entirely new product line — Web.I — at Web 2.0 Expo in San Francisco. We also have a user group event on Thursday and Friday for our West Coast users. I’m running a session at the show on next-generation data centers as part of the Web Operations tracks.

Dozens of Coradiant folks are swarming the city later today, but I got here early after participating in NetQoS’s user summit in Austin last week. The feedback from this partner’s event was great, and we’re working on even closer integration between TrueSight and their NetQoS Performance Center (NPC.)

But for now, the show floor is quiet and the registration halls are gradually filling up with Sunday’s tracks. It’s a testament to how web-centric this city is that the attendance for Sunday events is already decent. Like a giant “lunch-and-learn” for the development community.

In a Wired magazine interview, Tim O’Reilly claimed that they were expecting between 7,000 and 10,000 attendees. Hopefully lots of them care about running the next-generation sites they’ve built!

Where should I use User Experience?

April 3rd, 2007 Posted by: Alistair Croll

User experience has many applications. We’ve seen people adopt it pretty aggressively for incident management and service level management. But we’re also working with customers and third-party partners on a number of other applications.

User performance data joins test-based and device-based monitoring as the three fundamental building blocks of web performance management. And just as testing is used everywhere from capacity planning to reachability monitoring to penetration testing, so real user monitoring is finding a wide range of applications.

One of the reasons for this is its relevance to groups outside of IT. Business information such as the value of a transaction or the name of a subscriber are a part of the data that’s collected, so it’s much more than just performance information. It’s a real-time feed of user activity that gives the business insight into its online interactions.

I put together the circle diagram below to illustrate some of the ways that user experience is being employed.

The User Experience Management circle

Starting with the fundamentals — good, accurate, detailed per-hit and aggregate data collected from not only web pages but also Rich Internet Applications — user experience applies to all of these areas:

  • User Analytics, in concert with a web analytics tool to look at conversion and search engine sources. For some web applications, user experience is the only way to collect transaction information since the site isn’t publicly deployed.
  • QA and testing, both at the start of the test cycle (recording a user session for later use in a load-testing application) and at the end (watching code as it goes into production to see if QA missed any issues.)
  • Helpdesk, for problem diagnosis and user assistance.
  • Billing, for generating usage reports by subscriber or customer and assessing bills for excessive use.
  • Dispute resolution, using facts instead of anecdotes to see what really happened and resolve an issue fairly.
  • Incident management, in which problems are detected as soon as a user experiences them — before the phone rings — and resolved using the forensic data that was recorded from the web session.
  • Service Level Management, generating performance and availability reports by customer, geography, or branch office.
  • Baselining, watching a particular function, server, or site to get an idea of what “normal” is in order to set thresholds or measure long-term growth.
  • Capacity planning, in which the relationship between traffic (load) and latency (performance) is calculated over time to see how much a site can handle before becoming unacceptably slow.
  • Compliance, keeping a record of transactions for long periods of time in order to comply with industry law or regulations or to protect the company from risk.
  • Fraud detection, in which user traffic is analyzed to look for patterns of anomalies or inappropriate use — from hack attempts to site harvesting to sharing of account logins.

Our customers are building many of these themselves, using third-party and open-source tools alongside our equipment. We’re also partnering with a number of companies to test and document proven integrations. Our new VP of Business Development, Ali Hedayati, has his hands full with all of these relationships and others.

Whatever the final result, there’s no doubt that user experience is a ripe field for innovation, and that it’s transforming many parts of an organization far beyond simple incident detection.