Skip to content

Coradiant

Archive for November, 2008

The Benefits of Immediate Data


Friday, November 21st, 2008 Posted by: Jonathan Ginter

As the world moves on-line for most of its social and business interactions, it becomes more and more important for us to be able to react quickly when the systems that support those interactions exhibit problematic behavior.  Since problematic behavior is not always reflected by the health of your infrastructure, this has to be measured from the end user’s perspective.  In other words, if the end user’s experience degrades in any way, the application has become problematic.

This can present quite a problem on several fronts:

  • Measuring the end user’s experience
  • Being notified quickly that the user’s experience has degraded
  • Discovering that a potential fix has failed to address the problem so that it can be rolled back before too many users are negatively affected

As an example, let’s imagine that the on-line store on one of our web servers suddenly experiences an internal problem, which causes its performance to tank with no outward sign of distress (i.e., no log entries, etc).  Traditional methods of detection and notification will not work for us here.

Moreover, time is of the essence.  In our various field deployments, I have noticed that having 5000 users on your site every hour – on average – is quite common.  In fact, some sites have been known to average about 100k users every hour.  So if it takes you an hour to even notice that you have a problem, you have already upset quite a few users.  You need to be able to react quickly.

Challenge #1

How do we detect the problem?  We need to monitor the end user’s experience and it needs to happen in real-time.  This challenge has become less of a problem as various End User Experience Management (EUEM) tools have emerged to address this, some more successfully than others and each with its own unique feature set.  However, this is not the immediate focus of this article.  So, let’s assume that we already have such a tool in place.

Challenge #2

How quickly can we be notified that the user’s experience has gone south?  That depends upon the immediacy of the data and that depends largely on the tool that’s been chosen.

When an application starts to collapse, there are typically two major symptoms:

  • A drop in performance
  • A drop in volume as users abandon the application

If we typically receive 5000 users per hour in our on-line store, we can assume that 80 or more users are negatively impacted every minute.  Moreover, so far we’re only talking about being notified.  Once that happens, we will still have to analyze and deal with the problem.  All the while, the problem on the site is spreading to more and more users.

Assuming that the problem may only be noticeable as a trend, waiting for several minutes for enough data to be gathered to predict the trend might be necessary.  However, the lag time should be kept to that order of magnitude.  Waiting for an hour or more to be notified should be completely unacceptable.

Moreover, if the problem can be detected from a single hit on the site – e.g., the application is throwing back pages with error codes embedded in them – then the notification should be almost immediate.  The lag time from seeing a hit on the wire to the time that an alert can be sent about that hit should be within a few minutes, at most.

Challenge #3

Immediacy of data is also a concern when a fix is being rolled out and we need to validate that the problem has truly been addressed.  Rolling out a fix and waiting for an hour to gather the results is unacceptable in this day and age.  The only organization that should be willing to accept such a lag time is NASA (and at least they have a good reason for it).  If potential fixes cannot be validated within minutes, then users are being treated like piñatas.  Generally speaking, users don’t appreciate that.

You own that data.  You deserve to have access to it as fast as possible.  Your users will thank you.