Skip to content

Coradiant

Archive for the 'Performance theory' Category

Your site’s performance is important to Google


Wednesday, May 5th, 2010 Posted by: Jonathan Ginter

Poor performance can now degrade your business in an even more real and meaningful way.  Recent changes by Google will allow site performance to affect whether traffic is driven to your site.  This places a new urgency on the ability to accurately measure performance in terms of the end user.

“Time is ticking out” by flickr user Mao Lini. Used under Creative Commons license.

Recently, the Mashable blog reported on a decision by Google to add performance as another deciding factor in how they rank their search results.  This marks a significant milestone in which performance will affect your site’s ability to attract visitors.  Essentially, the faster sites will receive more attention.  By taking this action, Google hopes to shine a spotlight on performance and drive better overall development practices.  I have no doubt that this is a direct result of Steve Souders‘ move to Google from Yahoo, where he had been busy leading the YUI best practices effort that produced YSlow.  Since that time, Google has become much more of an advocate for web performance.  Here’s a direct quote from their blog that sums up their current perspective:

“We encourage you to start looking at your site’s speed … not only to improve your ranking in search engines, but also to improve everyone’s experience on the Internet.”

Essentially, Google has now made performance an important factor in driving business to your site.  At the same time, performance is one of the hardest things to accurately and properly measure for web-based traffic.  Google suggests a number of very good tools, most of which run inside your browser.  These tools look at performance in terms of the end user, which is now a recognized best practice.  However, they only measure the performance of your site while you are actively browsing it.  They won’t tell you anything about how your site performs for others or how it performs during the other 99% of the time when you are not personally measuring it with your browser.  Synthetic testing suffers from the same drawbacks.

More importantly, these approaches miss the elusive problems that affect specific people or that occur at specific times.  Nor will they find the problems that only occur under specific circumstances.  The fact is, these elusive issues are the most common.  They are the ones that plague every web site administrator because they are hard to find and nearly impossible to reproduce.  They eat up days of investigation time and are the biggest destroyer of public confidence in your site.  To find these problems, you need to be watching ALL traffic on your site every second of every day.  Moreover, you must do so while maintaining the end user perspective.

“Houston we have a problem...” by flickr user Mihael Mafy. Used under Creative Commons license.

There are very few solutions out there that can do this effectively.  Consequently, we are justifiably proud of our TrueSight product line and it’s ability to tackle this very difficult problem.  Once installed in your data center behind your firewall, TrueSight is able to monitor all traffic from your load balancer all the way back to your database and 3rd party tiers – from web request right down to code, SOA and SQL calls.  We auto-discover new applications and web servers as they are deployed without any need for additional configuration.  Moreover, our monitoring solution continues to perform its duties, even as you gradually virtualize your infrastructure or integrate back-end cloud services.

In fact, using a unique ground-breaking integration, we can provide full visibility of any deployments using Akamai’s Application Delivery Assurance services, like acceleration or caching.  Our unique relationship with Akamai has also recently led to the first co-developed solution that is being offered directly from Akamai for managing their services.

Furthermore, our technology is capable of providing accurate overall performance measurements for mashups or hybrid solutions.

If you have blind spots in your performance monitoring, you will not be protected from the negative consequences of this new trend.  If you lack insight on your end users’ actual experience on your site, you should seriously start planning to acquire it.

Another Traffic Jam – End-User Experience is Key


Thursday, August 20th, 2009 Posted by: Hon Wong

Another traffic jam this morning. As I inched forward, I brooded over the similarity between highway design engineering and the design of Web applications.  It may look good on paper, but when the real world intervenes, all bets are off.

In the world of the Web it’s the same way. Sure the testing and pre-deployment steps are critical, but as Web applications become more complex, it’s not enough. It’s important to thoroughly test your Web applications.  But it’s even more important to “test” your Web applications after deployment to make sure that they continue to run correctly.

In the pre-deployment world there is a lot to manage; make sure that all the application logic and algorithms are sound, have users run through the UI, hunt and kill a load of bugs, load-test the application, perform integration testing to ferret out conflicts with other applications, make sure all the system components required to run the application are there and that you’re running the right version in the targeted infrastructure, and so on and so forth. It’s a daunting task. But at least then you have full control.

But after it goes live, you’ve lost some of that control. That’s the unpredictable nature of the Web.  You don’t have much say in how networks perform, or which devices may be running your application on the user side.  And Web users are fickle, with hard to predict usage patterns. Turn on your perfect application and soon you’ll get a call from your IT operations people passing on a customer complaint relating to some totally unexpected thing.  Real users always do the unexpected. You can’t always predict every scenario. Similarly, no amount of testing using traditional IT tools can possibly find all the hidden problems that can pop out in live deployment situation.  Throw in the complexity of the Web application infrastructure, and you’ll be spending a lot of time looking for hidden problems that might have nothing to do with your code.

Knowing the end-user experience is key. Organizations that run web applications have discovered that the end-user experience is the key metric for success. By understanding what is being delivered in real-time, you can prevent dissatisfaction and application abandonment.  If you serve your users well, the application is perceived as successful. If you fail in delivering the service expected by users the application is deemed a failure.

OK – so we can’t really test our highways and make dynamic changes after it’s built  – yet, but you can certainly do that with Web applications.

The key is to provide the right tools. Web Application Performance Management solves the dilemma of gaining real-time visibility into end-user performance, and providing the actionable information you need to make decisions and changes to keep everything running at the speed limit.

Blind Spots in Web Application Performance Monitoring


Thursday, August 6th, 2009 Posted by: Jonathan Ginter

Contrary to popular belief, the brain is not a Personal Video Recorder, recording everything submitted by your various senses.  That would be too much data for any brain to handle.  Instead, it sifts through sensory input looking for relevant data points that it can trust and throws everything else away.  The important words in that last sentence are “relevant” and “trust”.

If a data point is not relevant, then it is considered to be a distraction.  There are well-known studies on Inattentional Blindness and Change Blindness which demonstrate that even large-scale events can be filtered out by the brain if they are considered irrelevant to the task at hand.  Similarly, if the data point cannot be trusted, the brain tosses it out as well (whether your senses can be trusted has been a heated debate in philosophy for centuries, but I digress).  Trust and relevance are crucial to the brain’s ability to eliminate useless noise and derive good results.

These same principles apply to monitoring your web applications.  Instead of monitoring the universe, you should be reducing your data flood to those points that are relevant.  Moreover, you should only be using the most trusted tools and methodologies to draw conclusions.

For web applications, the most relevant data is the data that directly describes or explains your user’s experience and places it in context.  In order to identify that data, you must be able to draw a direct line from your user’s experience to those data points.  If you cannot do that, you are probably chasing your tail and wasting a lot of valuable resources. It is important to realize that a lot of tools cannot draw a direct line from user experience to monitoring data without leaving a few gaps and logical leaps of faith.

As an example, operations teams love to know whether a database is down.  Although this is valuable data, is it relevant?  If users experienced worse performance around the same time, does that mean that fixing the database will solve the performance problem?  In fact, in a well-architected environment, the loss of a web server, app server or database should have little, if any, effect on the end user’s experience due to clustering and load-balancing. A lot of solutions love to use time correlation as a magnificent leap of faith, but it simply makes unreliable conclusions look enticing.

To draw that line between user experience and environmental monitoring, you need a tool that can see the actual users’ experience and is able to relate it directly to problems in your network, application design, deployment, code quality, etc.  Moreover, it must prove itself to be a trusted source of information, returning results quickly and reliably without drowning you in irrelevant data.  In other words, it must be trusted to extract and analyze relevant information and return high-quality results.

Is User Identification Hopelessly Broken?


Wednesday, June 4th, 2008 Posted by: Jonathan Ginter

The Web Analytics industry is in the midst of a debate about how to identify and count Unique Users.  Some people are starting to suggest that we should abandon the idea of Unique Users in favor of counting something easier.  At the heart of that debate is the question of whether we will ever be able to uniquely identify users on the web.
 

Surely I can trust the client IP?

The problems with the client IP have been public knowledge for a long time.  This is an excerpt from a tutorial about Web Analytics, published by Summary.net (a log analysis tool) back in 2002:
“The majority of Internet users connect through dial-up services of some kind. In order to preserve IP numbers (there are a limited number available right now), the dial-up providers will assign each user a number when he connects and then reuse the number when he is done with it. So a dial-up service may have 100 IP numbers that they select from and use to serve 2000 users. This gets even more complicated with caches and proxies that many providers now use to improve performance …”
With the introduction of mega-proxies (like AOL), this problem gets even worse.  Mega-proxies will spray their traffic across multiple gateways.  Since the internet was designed to treat each hit as a stand-alone transaction, this means that every request making up a single page can be routed through a different client IP and port.  So, instead of a one-to-many relationship between the IP and the users, we have a many-to-many relationship.
Entities like corporate firewalls are rendering the client IP extremely weak and unreliable as a user identifier.  Mega-proxies completely destroy its reliability.
 

What about the user agent?

A lot of people choose to set aside the concern about mega-proxies and talk about combining the user agent with the client IP as a differentiator.  The problem with this is that there are a finite number of user agents in the world.  Admittedly, they number in the thousands.  However, these user agents are shared by millions of web users, which means that tons of users are being represented by the same user agents.  In fact, this understates the problem since most people are running the same browsers and plugins on the same basic operating systems, reducing the pool of popular user agents.  Combining IP and user agent will still result in users that are sharing the same combination.
If you are expecting to use this as a means for identity tracking – as in “this is Bob” – then you are going to be disappointed.  Since the user agent contains information about the browser and the OS, it can easily mutate over time as users upgrade their browser, download plugins, install service packs, etc.  Moreover, users are not limited to one browser – I use Firefox but am occasionally forced to use IE on specific sites – or one system.  I surf from my laptop, my wife’s computer and my iPod, so I’m using three different platforms as well as three different browsers.
 

Enter the plugin

At this point, you may be thinking that the user agent will at least improve your odds.  This would be true if it weren’t for plugins.  Plugins within a browser are allowed to request their own resources from the server.  When they do so, they send a user agent and it does not have to be the same one used by the browser.  The Java plugin is a classic example.
 

Grab your bootstraps and pull

This problem – as with all others – begins at home.  If you want to track users, do not expect the HTTP protocol to help you.  It was originally designed for anonymous traffic.  Deploy your own tracking IDs that are tailored to your needs.  Most web servers have mastered the art of injecting user awareness into the traffic (via cookies or URL-rewriting).  If you need identity awareness, then you need to take the next step and have your developers build that into your application.
There is no magic bullet.  You need to solve this problem for yourself.

Green code


Thursday, October 18th, 2007 Posted by: Alistair Croll

I recently wrote a blog for GigaOm’s Earth2Tech site on “Green Code.” The idea is that the quality of code matters. Two coders, writing code for the same application, can have a tremendous difference in efficiency. And that can translate to big differences in power consumption and resource costs — particularly in a virtualized or on-demand environment.

Over here on the Coradiant blog, I can speculate a bit more specifically about what this means. One of the interesting things you can do with user experience is to measure the total processing involved in a page or a user visit.

Because much of the delay on the Internet comes from network performance, two applications with significantly different host efficiency might seem as fast as one another to an end user, so you can’t really measure this just by trying two sites.

But the precision of Real User Monitoring technologies makes even millisecond differences in host processing time clear. And while web operators usually look at average (or percentile) host time, one of the more unusual ways to measure host time is to sum it. This effectively shows you the “total thinking done” for a user’s session.

This can be the start of some pretty fascinating math. Once you know host time per session, you can see how many host-seconds your infrastructure devotes to a visitor. This can show you things like whether a certain class of users is consuming more than its fair share of “heavy” searches.

(Incidentally, on the Coradiant.com site, this often reveals blog spammers from China posting comments about their various vitamins, and more questionable offerings.)

But you can also tie this host time back to IT costs.

I’m teaching a course on data center growth as part of Interop’s Data Center Summit in New York next week (more on this in a later post.) In preparing for that session, I spent a lot of time looking at the cost models behind on-demand hosting, managed servers, collocation, and global CDNs. And it made me realize there are good ways to model IT costs that vary widely according to each business.

Let’s look at combining these two metrics — host time and IT costs — to better understand the business impact of IT.
If you have a good model for IT costs (such as collocation, power, cooling, and storage) and you divide your monthly IT costs by the sum of host time for the month, you know your IT-cost-per-host-second. You don’t want to include bandwidth costs, which aren’t related to the host time.

If you then multiply host-seconds for each user session by that IT cost, you can calculate how much each user session costs you.

This is an excellent basis for evaluating change across releases. It will reflect increased costs in hosting (such as the introduction an application accelerator,) reductions in delay (such as a drop in host time from the AFE’s application acceleration functions reducing the load on servers,) and even changes in pages per session.

You can actually report average IT cost per user session.

As a result, you’ll now know the actual impact of that deployment: Did the reduction in IT-cost-per-host-second outweigh the investment in the AFE? How many weeks did it take to pay the cost back? Is the additional site navigation costing us more?

Of course, there are many other benefits to reducing host time, from user satisfaction to increased capacity to reduced SLA refunds. But this idea of IT-cost-per-host-second is a nice, concrete way to think about what code changes or other modifications to your operations do to your business.

Now back to the fascinating sessions at Web 2.0.

The high cost of switching


Monday, September 17th, 2007 Posted by: Alistair Croll

Back in the dot-com heyday, everyone was terrified of customer churn.

The churn occurred, the theory went, when someone couldn’t buy a book from Amazon.com and switched over to Barnes and Noble (or vice-versa.) This led to shopping cart abandonment and costly acquisition of new customers.

The reality is a bit different. My Amazon account has all kinds of personalized features — from billing and payment information, to a wish list, to recommendations. I’m unlikely to switch unless they really upset me. I’ve changed providers for some things in the past, such as really bad airline procedures, abject failure, or the inability to provide a product or service. Sure, I went looking for a pair of shoes online and tried three or four places before finding them. But for relationship-based selling, where I return time after time, switching doesn’t happen much.

Or rather, switching happens a lot, but people don’t measure it right. Don’t get me wrong: Churn is a big problem. It’s just that traditional thinking about churn won’t work any more.

An often-quoted study conducted in the late nineties by Booz Allen & Hamilton compared the relative costs of a transaction in a bank:

  • Internet: $0.01
  • ATM: $0.27
  • Automated call center: $0.44
  • Call center personnel: $0.85
  • Branch: $1.07

If I need to complete a transaction, and their website isn’t working, I’ll go to the branch. I hate that, and so, apparently, do their accountants. The first modern switching cost is channel switching: When I use an inefficient channel, I cost the company money and I get irritated.

A second switch occurs when I stop being productive and engaged. I recently presented at the Application Continuity Conference in San Jose, and looked at “user continuity.”

The gist of this is that, when performance degrades to horrible levels, it’s pretty clear to all involved that the application may as well be down. But what’s less clear is the cost of users switching their level of engagement. The second modern switching cost is engagement switching.

In 1968, Robert B. Miller published a study entitled “Response time in Man-Computer Conversational Transactions.” He looked at how the human brain behaves when the system it is using responds with different levels of delay.

Miller identified three main threshold levels of human attention:

  • 100 ms or less and the person feels that response is instantaneous
  • 1 second or less and the person feels they are “freely interacting” and can enter what Mihaly Csikszentmihalyi called a “Flow State” in which concentration and productivity climb while errors drop.
  • 10 seconds or less and the person feels they are “attention focused,” meaning they are consistently engaged with the task at hand.
  • For interactions that take more than 10 seconds, humans become distracted and will try to multitask, and productivity will drop dramatically.

What I like best about this study is that it predates the Internet; it’s about how we’re wired. We’d scan the grasses to look for the sabre-toothed tiger for about 10 seconds before returning to the task at hand. And according to an MIT thesis, signals travel about 90 meters per second along a sheathed neuron, so we pretty much treat things that happen in under a millisecond as “right now.”[1]

So when we look at switching costs, on many sites we’re not worried about visitor churn in the traditional sense. What’s a lot more relevant is the cost of channel switching and engagement switching that can drive up the cost of serving a customer or the disengagement of the user’s attention and productivity.

[1] Interestingly, our clock speed is between 500 milliseconds and 4 seconds, or 250-2,000 HZ, so those Pentium chips are catching up on us.

How Microsoft broke Skype by accident


Monday, August 20th, 2007 Posted by: Alistair Croll

Skype broke.

This should serve as a lesson to us all. Sometimes the old ways are the best, and we ignore them at our peril.

The folks at Skype said:

On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.

Yep, that’s right. Microsoft sent out a patch, and it brought down Skype.

TCP is a great example of simple, elegant implementations. TCP is breaking at the seams — it doesn’t support enough ports; it’s a jack-of-all-trades transport that isn’t particularly efficient; it requires a lot of computation; and it’s redundant in a lot of encryption and compression systems. Companies like Netli (acquired by Akamai) built businesses on the inefficiency of TCP. Making TCP efficient is a major factor in how Application Front End products (like Citrix’s NetScaler) speed up sites and reduce the load on servers.

But TCP is elegant. One of the things it does best is recover from problems. Wikipedia tells us:

“Modern implementations of TCP contain four intertwined algorithms: Slow-start, congestion avoidance, fast retransmit, and fast recovery (RFC2581).”

Ethernet does this well, too. When congestion occurs, senders keep talking long enough to make sure everyone heard the congestion, then back off for a random length of time. From Wikipedia, again:

“This can be likened to what happens at a dinner party, where all the guests talk to each other through a common medium (the air). Before speaking, each guest politely waits for the current speaker to finish. If two guests start speaking at the same time, both stop and wait for short, random periods of time (in Ethernet, this time is generally measured in microseconds). The hope is that by each choosing a random period of time, both guests will not choose the same time to try to speak again, thus avoiding another collision. Exponentially increasing back-off times (determined using the truncated binary exponential backoff algorithm) are used when there is more than one failed attempt to transmit.”

Think about that for a second. The guys who built these protocols realized that congestion would happen, and built models for dealing with unpredictable situations by backing off a random time, and for detecting congestion and avoiding it. And this was back in the day when there were only a few nodes on the Internet. Yet they function reasonably well even today.

So why didn’t Skype work properly? Without getting into too many details, the folks at Skype explained:

Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly.

There are two important lessons to be learned here:

  • First, it’s critical to look at traffic volumes. Many of the people who buy our UPM equipment used to rely on synthetic testing to monitor their sites. Often, they couldn’t answer simple questions like, “how many users do you have on your site today?” Their marketing department might know, through web analytics tags, how many sessions were active; but there was no way to stitch together traffic levels and performance.
  • And second, the Skype incident is a great example of how complex systems can fail in unexpected ways, and how everything on the Internet is intertwingled. Microsoft’s practice of updating and automatically rebooting billions of computers independent of owner control creates tremendous traffic spikes — and this is true of web-connected services such as antivirus updates and desktop plug-ins. But the impact of these spikes isn’t tracked or understood.

Understanding the relationship between load and performance is critical for anyone running a production web application. Applications will break; and without the right information at your disposal, you won’t be able to detect problems or fix them effectively.

With billions of nodes on the Internet and millions of changes a day to production systems, Sod’s Law (a variant of Murphy’s law) is definitely true: “Anything that can go wrong, will.” But it’s also possible to invoke Hanlon’s razor, a corollary to Murphy, that says, “Never assume malice when stupidity will suffice.”

What do we really know?


Wednesday, March 28th, 2007 Posted by: Alistair Croll

Most of our customers have three different tools for determining the health of their web applications. Component monitoring tools look at the elements themselves, such as RAM or queue depth. There’s always a value there (even if it’s “0″). Testing tools generate activity, and look at the results. These are used primarily to load sites and measure capacity, or see if a known function is working. The number of measurements is directly related to the number of tests. And Traffic monitoring — whether it’s packets, weblogs, or, in our case, user sessions — looks at activity generated on a site. This can include real users as well as tests.

What’s interesting is how complementary these three are. Component tools are a lousy way to assess the user experience, but they’re absolutely essential in localizing to root cause. And the other two — tests and traffic measurements — serve very different purposes.

At first blush, synthetic testing and real user monitoring seem competitive. We’ve certainly had customers say, “with you, I don’t need to spend money on testing.” But let’s look a little more closely.

At this point I’m going to get a bit philosophical. I took some fascinating philosophy of language courses in university (from a professor who was freakishly like Stephen Colbert at the time), and much of it revolved around the precision of knowledge. There’s a huge difference between what we know to be true, and what we can’t say we know not to be true.

Try to answer two questions. The first: Is your website working? To answer it, you might try the site. If it works, you answer, “yes!” For you, at that time, in that place, with that page, it worked.

Now answer, “is your website broken?” It would seem at first blush that the answer is a resounding, “no!” After all, you just tried it, and it worked. But do you know that it isn’t broken for someone else, on some other page, from some other place?

Philosopher Bertram Russell used the example of a teapot, in orbit around Mars, to discuss the latter point. Do you know there isn’t a teapot orbiting Mars? British upbringing to the contrary, I doubt very seriously that there is a teapot flying around the red planet. But I don’t know there isn’t. In fact, I’m certain that I don’t know. In the same way, most web operators don’t know that a user isn’t stuck or a part of their site isn’t broken. And this is one of the crucial differences between testing and watching traffic.

Right now, I’m pretty sure my Blackberry (where I’m typing this) is working. I can see and touch it; and I know how to check if it’s working: press keys and see text appear. I would say I’m certain that the typing function on my Blackberry is working right now. (And I know what broken’s like; I busted my last keyboard a couple of months ago. Might come from blogging on it too often.)

Many websites have a “test page” built by the development team that exercises the application. When requested, it hits a number of back-end functions, writes to and reads from a database, and generally returns a comforting message that all is well. This is analagous to knowing my Blackberry can type. It’s within my control, and I can assess it at regular intervals.

Real users, on the other hand, are like teapots. They definitely exist; I’ve met some in person. They might be doing something unexpected (like orbiting a planet.) We won’t know they aren’t unless we watch every possible user in every possible place. This is the function Real User monitoring provides. At the same time, if there are no real users, there are no results to analyze; which is why we need repeatable tests of known functions to detect problems quickly (there are other good reasons, like baselining and measuring reachability, CDNs, and DNS resolution.)

Put another way, testing does the same, limited, specific thing over and over to see if it varies or is different somehow (test.asp is unreachable from Kansas.) And real user monitoring watches many different things — each of them unique — and tries to find patterns that indicate a trend or problem (everyone is suddenly getting 10 seconds of delay; all requests for page.jsp get a 500 error.)

Unfortunately for web operators, the chances that a user is doing something weird, getting stuck, or having a bad experience are much higher than the likelihood of interplanetary dishware. Which is why (according to a recent Forrester survey) more than 70 percent of problens are still reported by end users.

Who knew web operations could be so existential?

Have we solved the availability problem?


Tuesday, October 10th, 2006 Posted by: Alistair Croll

A few years ago, before the days of TrueSight, Coradiant used to be in the managed services business. This gave us lots of hands-on experience and a good deal of opportunity to see vendors from the customer’s perspective. We try to put that background into our products — lots of debates around technical issues are resolved with a firm, “would we have bought it if it did that?”

But not everything’s the same.

One of the things that’s changed a lot since then is the emphasis on availability and performance. Back then, nobody cared about performance-they were worried mainly about uptime. We hadn’t yet figured out how to make the world reliable.

  • People spent money on big, reliable servers. Then Google showed us how to make millions of cheap PCs run well.
  • Websites used to have dozens of redundant networks to overcome peering congestion. Then Internap raised the bar, and peering got better.
  • Load-balancers argued over which algorithms to use to detect outages. Then they got good at it, adopted best practices, and got in the middle of connections where they could see problems first-hand.
  • Browsers broke. Then the innovation turned to standardization and even 2-year-old clients could handle Javascript properly.
  • Monolithic servers were expected to do everything. Then the three-tiered model abstracted presentation, processing, and storage. (I actually found an old study I wrote in 1999 on “the emerging 3-tier model” of computing. Things weren’t always this way!)
  • But most importantly, we stopped thinking about device availability and started thinking about system availability. Web operators don’t care about the failure of a single server these days (although it’s a great way to get a free lunch from a supplier.) They watch the overall system.

With this change has come a major rethinking of priorities. Now that systems are highly available—despite their notoriously unreliable components—people have turned their attention to performance.

I’m not talking about the old “eight second” rule of product purchases. Frankly, if Kayak or Overstock or Lendingtree or Brassring is a little slow today, I’ll wait: I know they’re good for it. I have accounts there. The cost of switching is high.

What I am talking about is the impact of performance on everything from call center volumes, to lost productivity, to spontaneous coffee breaks, to adoption failures. As more and more companies focus on performance, they unearth skeletons in their closets.

One of our customers described a situation where performance was unbelievably slow for users at one office in Asia Pacific. On closer analysis, the users there were connecting to a U.S. proxy, completely bypassing the one in their branch office. The Internet was available to them—but unbelievably slow. And by switching them to the local proxy, the company avoided upgrades of around $60,000 a month.

Back in our days as an MSP, we used to build reports of availability. We used to alert and alarm on it. But today, when I talk to customers, few of them are concerned with availability and uptime. They want the forensics on a failure (to wave at the aforementioned vendor while ordering the Baked Alaska) to be sure. But most of the interest we get is in performance and traffic level analysis.

In particular, we get questions about second-order analysis. “Don’t just tell me how slow it was,” said one customer last month, “tell me how many users were dissatisfied by the performance.” Similarly, people ask me “can you show me whether this level of latency is normal for this level of load.” Answering these questions is far more meaningful. Customer support and capacity planning can use the answers to really tackle some fundamental issues they face. The answers involve quite a lot of computation, and they require that we measure performance, availability, and traffic levels.

I think we may have solved the availability problem. But the performance problems are just beginning.

What’s Real User Monitoring, anyway?


Monday, August 14th, 2006 Posted by: Alistair Croll

We use the term Real User Monitoring to explain what Coradiant’s technology does. The term sounds a bit nebulous, but it does the job. Of course, there are lots of people who think they do real user monitoring; so I’m going to try and explain the differences between us and some of the distinctions.

Synthetic tests

First off are the synthetic testing companies. Their tools—usually sold as recurring monthly services—run scripts at regular intervals from all over the world. These scripts simulate what an ideal user would do: Transactions like checking in, putting something in a cart, or getting an account balance.

Lots of people like synthetic tests because they’re repeatable and predictable. They’re great for baselining; in fact, Chris Looseley of Keynote Systems did a great job explaining this for us at the Webops sessions of Interop Las Vegas.

But they’re not monitoring real users. They’re simulating idealized users from controlled environments. Real users might be miserable while the synthetic tests work just fine.

Synthetic testing

These tools are essential to web operators; but they won’t tell you anything about the volume of traffic to a site, or whether end users are actually getting the performance that the tests report.

Web log analysis

A second way of collecting information on web health is via weblogs. Each time a server gets a hit, it writes down information on that hit in a logfile (usually following a format called ELF, or Extended Log Format.) The logfile tells you a lot about the request: Where it came from, what it requested, and when it occurred. It might even tell you about the timing of the request.

Web logs are monitoring real user activity. On their own, they’re not that useful. But feed them into a web log analysis tool (like Analog, Webtrends, Sane, or Sawmill) and you’ll find out lots of details: What people searched for, where they went on the site, what browser they used, and so on. More commonly, companies use a web analytics firm like CoreMetrics, WebTrends Live, Omniture, Clicktracks, or WebSideStory that collect activity based on Javascript. Often, activity is displayed in funnels of user activity by step, cross-referenced with search terms.

Web funnel view

Web log analysis doesn’t offer much performance data. It won’t split requests down into the elements of latency, or show network forensics. But it’s also aimed at the public-facing, B2C sites. Analytics products are seldom used to explain activity on an intranet or a back-end B2B application.

Sniffers

Technically, sniffing traffic is real user monitoring—after all, real users made all those packets. But even viewing the traffic in a sniffer screen doesn’t tell you much about users. WildPackets, Network General, Niksun and ClearSight are good examples of sniffers I’ve seen, but most people I know use Ethereal, which is free and amazing.

A sniffer screen from Ethereal

Flow monitoring products

Higher up the stack than sniffers are what I call “flow monitors.” These work in a variety of ways, generally by asking other devices about traffic they saw (using RMON or Netflow). A more open version of Netflow, called IPFIX, is making this more and more attractive to people.
Flow monitoring across TCP ports

Response time monitoring

Another way to measure application response time is to sniff traffic from span ports and measure the round-trip time of sessions (rather than collecting flow data from network devices.) For example, NetQOS’ SuperAgent measures the end-to-end time between networks and hosts by listening to span ports or taps.

We announced a partnership with NetQOS back in April. Their reporter/analyzer product collects NetFlow and IPFIX data; and their SuperAgent product is a response-time monitoring product that watches the TCP/IP sessions between networks and hosts. It assembles and aggregates these so you can see how much traffic flowed from what network to what port on a server. And it measures performance data—how long the packets took to travel, how long the server thought about them, and so on.

What does what

A flow monitoring product summarizes things at the time of collection (i.e. on the router) so it can’t peer within the flow. Response time monitors can look within the traffic, but are generally protocol-agnostic: They don’t “understand” a web, e-mail, or IM session across individual traffic flows. This means that if a protocol-agnostic monitoring tools sees that there were 50 Kbytes of data between a network and a web host, the operator still doesn’t know whether that was one 50Kbyte object, or 50 1-Kbyte objects.

As a result, I don’t know if this session was one user, or 10 users behind a NATting firewall. I don’t know how long individual pages took, or how many pages a user requested in a visit, and so on. And I can’t tell things like browser type or search string, or what they entered in a form.

On the other hand, flow monitors and response time monitors are great for comparing the amount of traffic across all kinds of applications. A sudden increase in Voice-Over-IP (VOIP) traffic might mean that web traffic takes longer to get through; someone running a backup late at night might inadvertently make late-night shoppers miserable. And this kind of activity is completely invisible to a product that’s watching HTTP. So if you’re trying to troubleshoot and measure networks, you need a flow monitor (preferably from our friends at NetQOS; they also have great SNMP monitoring tools to collect device health, and a centralized performance console.

Real User Monitoring products

TrueSight falls into this class. Basically, it’s able to discern individual users and page load times.

So a complete web operations team has a variety of monitoring tools at their disposal:

  • Synthetic testing to detect problems when there’s no activity and set baselines for controlled, known processes.
  • Web analytics to show conversion rates, funnels, search terms, and the like to marketing.
  • Sniffers to capture traces for network engineers.
  • Flow-based monitors to understand the breakdown of traffic across all protocols and how one application impacts the others.
  • Real user monitoring to measure the performance and availability experienced by actual users, diagnose individual incidents, and track the impact of a change.So when Coradiant partners with NetQOS, it’s a way of giving customers the best of both worlds: Deep web analysis, alongside broad multiprotocol monitoring.

Viewing performance deep and wide