Skip to content

Coradiant

Archive for the 'Data centers' Category

Another Traffic Jam – End-User Experience is Key


Thursday, August 20th, 2009 Posted by: Hon Wong

Another traffic jam this morning. As I inched forward, I brooded over the similarity between highway design engineering and the design of Web applications.  It may look good on paper, but when the real world intervenes, all bets are off.

In the world of the Web it’s the same way. Sure the testing and pre-deployment steps are critical, but as Web applications become more complex, it’s not enough. It’s important to thoroughly test your Web applications.  But it’s even more important to “test” your Web applications after deployment to make sure that they continue to run correctly.

In the pre-deployment world there is a lot to manage; make sure that all the application logic and algorithms are sound, have users run through the UI, hunt and kill a load of bugs, load-test the application, perform integration testing to ferret out conflicts with other applications, make sure all the system components required to run the application are there and that you’re running the right version in the targeted infrastructure, and so on and so forth. It’s a daunting task. But at least then you have full control.

But after it goes live, you’ve lost some of that control. That’s the unpredictable nature of the Web.  You don’t have much say in how networks perform, or which devices may be running your application on the user side.  And Web users are fickle, with hard to predict usage patterns. Turn on your perfect application and soon you’ll get a call from your IT operations people passing on a customer complaint relating to some totally unexpected thing.  Real users always do the unexpected. You can’t always predict every scenario. Similarly, no amount of testing using traditional IT tools can possibly find all the hidden problems that can pop out in live deployment situation.  Throw in the complexity of the Web application infrastructure, and you’ll be spending a lot of time looking for hidden problems that might have nothing to do with your code.

Knowing the end-user experience is key. Organizations that run web applications have discovered that the end-user experience is the key metric for success. By understanding what is being delivered in real-time, you can prevent dissatisfaction and application abandonment.  If you serve your users well, the application is perceived as successful. If you fail in delivering the service expected by users the application is deemed a failure.

OK – so we can’t really test our highways and make dynamic changes after it’s built  – yet, but you can certainly do that with Web applications.

The key is to provide the right tools. Web Application Performance Management solves the dilemma of gaining real-time visibility into end-user performance, and providing the actionable information you need to make decisions and changes to keep everything running at the speed limit.

Blind Spots in Web Application Performance Monitoring


Thursday, August 6th, 2009 Posted by: Jonathan Ginter

Contrary to popular belief, the brain is not a Personal Video Recorder, recording everything submitted by your various senses.  That would be too much data for any brain to handle.  Instead, it sifts through sensory input looking for relevant data points that it can trust and throws everything else away.  The important words in that last sentence are “relevant” and “trust”.

If a data point is not relevant, then it is considered to be a distraction.  There are well-known studies on Inattentional Blindness and Change Blindness which demonstrate that even large-scale events can be filtered out by the brain if they are considered irrelevant to the task at hand.  Similarly, if the data point cannot be trusted, the brain tosses it out as well (whether your senses can be trusted has been a heated debate in philosophy for centuries, but I digress).  Trust and relevance are crucial to the brain’s ability to eliminate useless noise and derive good results.

These same principles apply to monitoring your web applications.  Instead of monitoring the universe, you should be reducing your data flood to those points that are relevant.  Moreover, you should only be using the most trusted tools and methodologies to draw conclusions.

For web applications, the most relevant data is the data that directly describes or explains your user’s experience and places it in context.  In order to identify that data, you must be able to draw a direct line from your user’s experience to those data points.  If you cannot do that, you are probably chasing your tail and wasting a lot of valuable resources. It is important to realize that a lot of tools cannot draw a direct line from user experience to monitoring data without leaving a few gaps and logical leaps of faith.

As an example, operations teams love to know whether a database is down.  Although this is valuable data, is it relevant?  If users experienced worse performance around the same time, does that mean that fixing the database will solve the performance problem?  In fact, in a well-architected environment, the loss of a web server, app server or database should have little, if any, effect on the end user’s experience due to clustering and load-balancing. A lot of solutions love to use time correlation as a magnificent leap of faith, but it simply makes unreliable conclusions look enticing.

To draw that line between user experience and environmental monitoring, you need a tool that can see the actual users’ experience and is able to relate it directly to problems in your network, application design, deployment, code quality, etc.  Moreover, it must prove itself to be a trusted source of information, returning results quickly and reliably without drowning you in irrelevant data.  In other words, it must be trusted to extract and analyze relevant information and return high-quality results.

Handling the Truth


Wednesday, December 24th, 2008 Posted by: Jonathan Ginter

Coradiant’s TrueSight End-User Experience Management product evokes a number of interesting reactions when we first start monitoring a customer’s Web traffic.  One of the most common reactions is amazement at just how many bugs exist – even in the best Web applications.  One customer speaking at a luncheon described the experience as being similar to turning on the light in your apartment and seeing big ugly cockroaches everywhere – you are appalled, you are embarrassed … and you feel a strong urge to simply turn off the light.

This might seem like a damning statement to make about one’s own environment.  And yet we repeatedly hear how such confrontations with the ugly truth have provided insights that resulted in the correction of long-standing problems, some of which had never even made it onto the radar of Web Operations.  I think you would have to struggle to find a Coradiant customer that did not have a similar story to tell. One customer discovered that 30% of their traffic consisted of cache hits (where the server reports that nothing has changed) or redirects.  By simply tweaking the caching parameters returned by their web servers, they reduced the load on their servers significantly.  Another customer discovered that some pages were taking up to 1.5 minutes to be handled by the server before a response was being sent back to the browser.  Yikes.

How many users are hitting your site?  How many errors are being returned?  How slow are the pages?  How reliable is the network?  Some customers are clearly floundering without any real ability to answer these fundamental questions.  Other customers believe they already have a solid handle on such issues.  We have found that almost all of them have a real shock in store.  Some of our most loyal customers are those that firmly believed they knew the truth already.

Often, though, the insight doesn’t have to be that deep to be a revelation.  It never ceases to amaze me how often web sites are thrown over the fence to be supported by a team that hasn’t the first clue about what they have taken on.  We offer a fairly simple feature that reports lists of traffic attributes sorted by popularity – e.g., URLs, hosts, client IP address blocks, geographic regions, cookie keys, etc.  Our customers can define their own fields as well, pulling whatever they would like out of the traffic to do so (e.g., database error codes, product IDs, etc).  We use this feature to help populate configuration fields.  However, the contents of those lists proved to be such a revelation to our customers that we re-categorized the feature under “Reports”.  Using this simple feature, one of our customers noticed that we were seeing internal traffic that we should not have been able to see.  This led him to realize that his routers were improperly configured.

The customer that we had invited to speak at our luncheon finished off his presentation by advising others – somewhat jokingly – to consider carefully whether they were truly ready to handle the truth about their traffic.  Ugly as it may be, facing it can reveal real solutions to real problems.  I highly recommend it.

Analyzing the End-to-End Challenge


Friday, June 20th, 2008 Posted by: Jonathan Ginter

Julie Craig from Enterprise Management Associates published a very interesting article entitled “The End-to-End Challenge“. In this article, she reveals some disturbing statistics, among which were the following (I am paraphrasing here):

  • - 43% of application outages are still reported by users
  • - 37% of IT professionals lack the tools they need to support their business applications (even though unrelated research reports that IT organizations are using anywhere from 5 to over 25 management tools)
  • - 41% of IT organizations prefer to use “expert opinion” to diagnose problems

 

Although I believe the rate of user-reported issues is much higher, I note that she used the term “outages”, so it is possible that she is only referring to actual downtime and not slow performance or other types of errors. If this is, in fact, a correct interpretation of her meaning, it makes her estimate even more ominous for IT organizations. If Ms. Craig is correct, then an area where IT departments considered themselves to be fairly proficient – the detection of downtime – is proving to be more flawed than previously believed.

However, what caught my eye most were the subsequent assertions. More than a third of IT professionals feel that they are poorly equipped to monitor their web applications even though they are – for the most part – drowning in tools. Ms. Craig goes on to point out that almost half of the IT departments surveyed were relying on their resident experts to figure out what was wrong. I can’t help but feel that this is a direct result of losing faith in the wealth of available tools. When the tools are not doing the job, it is a natural reaction to fall back on the human factor.

So why are all of these tools failing to do the job? Ms. Craig clearly believes that the problem is with end-to-end visibility. However, I disagree for a very simple reason: this fails to address the rate of user-reported outages. Users cannot see the full end-to-end and yet they are more efficient at noticing problems than the IT department. If you want to be as good as your users, you have to be able to see how they are being affected by your applications. You need to see your users’ experience.

And that is what is wrong with most tools out there today. They look at the infrastructure instead of the users. If you can’t see the negative impacts on your users, then all of your other monitoring is rather pointless, since it doesn’t help to support the bottom line of making your users happy.

And let’s be clear. You want to see what is happening to all of your users, not just one or a handful. You have to focus on the forest and not the trees.

It’s nice to see the end-to-end picture, but that is only useful after you have won the war of finding more problems than your users do.

 

 

 

 

The times, they are a-changing …


Wednesday, May 28th, 2008 Posted by: Jonathan Ginter

I’ve been thinking a lot about a company that we met with recently. Their web site is one of many media that they use to promote their products. Each medium lends its particular talents to the promotion of the products. They drive their traffic across these multiple channels as a means of increasing sales.

Given the immediacy of the web and their other channels, their promotion campaigns are incredibly brief – typically on the order of a few hours. Moreover, they often run more than one campaign per day. They have engaged their customer base extremely well and are using the immediacy of their channels to drive revenue sky-high.

However, each campaign requires new content for their web site. Moreover, that content must be removed when the campaign is over. Think about that for a minute. They are altering the content of their site several times a day, every day. It’s almost an hourly release cycle. As an IT department, what would that rate of change do to you?
Oh look, here comes the tide …

 

And this trend is not just limited to web content. The release cycle is shrinking on all fronts – infrastructure changes, application updates, etc. As businesses try to tap into the immediacy of the internet, they are going to expect their IT department to be equally nimble, moving swiftly to add servers for increased capacity, deploy new content to support campaigns or apply software upgrades to resolve issues. Where this type of activity used be allotted several months to ensure quality before deployment, we are now seeing that reduced down to weeks or even – in the case of this one company – down to a matter of hours.

In the IT world, change = instability and instability leads to support problems and customer calls. As I’ve mentioned before (see Do you know who your users are?), IT departments will typically only catch 3% of the problems before their customers are affected by them. That’s not a great track record. If the rate of change continues to increase, so will the number of problems until the IT department is in danger of drowning completely. Many IT departments we have met with are already in serious trouble. They can’t afford to have things get any worse. What are they expected to do?
 

Perpetual beta

 

Welcome to the world of the “perpetual beta” where you can leave your obligations at the door.

The need to accommodate an increased rate of change was the main reason that the term “perpetual beta” was coined. It effectively announced to the world that “you should expect this product to have problems, so don’t get too upset or hassle us about it”. Techies love this term because it absolves us of our obligation to provide good service and reliable products, allowing us to focus on being “innovative” instead – as though the two concepts were incompatible. In my opinion, the need for such a term is actually a deplorable testament to our inability to find and fix problems when it comes to web applications.

The term is already being applied to public web sites (with Google leading the charge). However, we are increasingly at risk of applying this term to corporate web sites. Imagine if the site that handled your on-line banking were a “perpetual beta” site. “Whoops, we just lost $3000 of your hard-earned cash. We’re so sorry, but this is a beta. We’ll get right on that, assuming it does not stop us from rolling out our next great feature.”

Not exactly awe-inspiring, is it?
 

And it’s slow, too …

 

The other main issue that is not discussed very often is performance. Even if you are diligent about finding and resolving crashes and obvious flaws, constant change can also affect performance. And performance is the hardest aspect to monitor effectively. It is usually the last thing that is tested by a QA department. In an accelerated release cycle, it is typically the first activity to be reduced or entirely cut from a schedule.

We have an existing customer whose IT department does not have authority over the deployment of new content, although they are expected to support it once it is rolled out (sadly, this is a fairly typical arrangement). On one particular day, the IT department started getting an enormous number of complaints about the performance of the site. After consulting TrueSight, they noticed that the content had changed. So they called the marketing department. It turns out that the marketing department had rolled out a “new look” for the site that included a lot of high-resolution graphics. Suddenly, delivering the content of the site to the user was like shoving an elephant through a garden hose. The IT department, of course, had never been informed of this since “nothing important was changed, so it shouldn’t make a difference”. Well, that’s comforting. At least we can warm ourselves in that glow while the phones are ringing off the hook.

If you have signed Service Level Agreements with your customers, I’m looking at you. Rather pointedly. You can also include yourself if poor performance tends to cause an increase in support calls (although the SLA victims are worse off, believe me).

All of this to say that any change – any at all – can cause significant problems and cost you real money.
 

Slaying the beta beast

For obvious reasons, corporations cannot – and should not – condone “perpetual beta” status on their web presence. If you cannot declare “perpetual beta” status, what can you do in the face of such a rapid rate of change? How do you take your life back?

You cannot control the QA cycle, so you must work with the hand you are dealt. Given the trend in release cycles, it is increasingly likely – in spite of the best intentions – that you will be expected to support releases that have less and less quality. The onus will be increasingly on you to sniff out and address poor quality before anyone is affected.

To accomplish this, you need the following abilities:

  • - To visualize what is happening on your site as it happens, in real time
  • - To monitor changes in error rates, traffic levels and performance
  • - To provide proof of culpability to other teams and departments

 

With these basic tools, you can take a less-than-tasty rollout, find its primary flaws and schedule fixes before anyone picks up the phone.

Driving quality

Several of our customers have started to use TrueSight’s abilities in these areas to drive greater quality into the development process. One customer used TrueSight’s alerting capabilities to send an email to the entire development team whenever a customer clicked on a broken link. You can imagine that backlash that ensued at first, as developers demanded that the spamming cease and desist. The IT department stuck to its guns with the simple point that the developers could stop the email themselves by fixing the broken links. And guess what? Miracle of miracles, the links were fixed! More importantly, after the initial storm blew over, the developers were intrigued with the possibility of having more direct feedback from production. They now work more closely with that IT department towards producing better quality.

Simple exposure of flaws – with hard evidence to support those assertions – can be a powerful tool in changing the dynamics of departmental relationships and molding corporate policy when it comes to quality.

And we desperately need that type of change in this industry.

What makes a “must-have” IT product?


Tuesday, February 26th, 2008 Posted by: Tony Tissot

Patrick Gardella, of Discovery Communications, recently spoke to Network World and said about Coradiant TrueSight, “Basically, it allows us to identify very rapidly what is happening with actual users on the site, and then it helps us debug those things.”

“With our huge online shopping site — and other Web systems that require major user interaction – users have problems. When that happens, we get e-mails saying, ‘Your site is broken’ and not much more. The reason I like Coradiant is that it offers a very simple, easy-to-use appliance that can find out what’s happening with those individual users, as well as how many other people are having those same problems.”

For the full article see: Network World 

Interop New York


Thursday, October 25th, 2007 Posted by: Alistair Croll

I’m starting day two of the Data Center Summit at Interop, back-to-back with user group events in New York and Boston this week. The first day was a very interesting set of topics:

  • Steve Shah of Risingedge presented a session on the “State of the cage.” This fascinating presentation looked at the evolution from mainframes to clustered computers, and from local procedure calls to intra-data-center delays.
  • A panel of folks including Michael Baum (CEO of Splunk), James Sayles (CCO of Ecora) and Michael Weider (CTO of IBM’s recently-acquired Watchfire) discussed issues of compliance and privacy in data centers.
  • John Carton, formerly at Accenture and now the Senior Director of Web Services at Nature’s Bounty, presented a model for thinking about disaster recovery in data centers.

There was lots to learn from the sessions. Steve pointed out that with the adoption of service-oriented architectures, back-end procedure calls that used to take microseconds now take milliseconds, and that virtualization will make this worse since developers don’t know whether an often-reached machine is local or remote. Michael observed that compliance, which tells people to keep everything, conflicts with privacy, which says that more data makes a breach more risky. And John suggested that companies need to evaluate not only availability, but also how much data they can afford to lose, when setting recovery policies for data centers.

Today’s tracks look at the range of data center models (from on-demand to full colocation with a content delivery network); the issues of power, cooling, and efficiency in greening modern data centers; and automating changes in within data center environments.

A quick sidenote: Our San Diego headquarters was a busy place this week, with the wildfires consuming a huge swath of the Southwest. While many of our employees were evacuated, nobody was hurt; and with offices in Boston and Montreal, Canada, we were able to handle order shipments and provide continuous support to our customers worldwide despite the crisis. Thanks to everyone involved in keeping things running despite the crisis.