Skip to content

Coradiant

Archive for June, 2006

We have lots to do


Wednesday, June 21st, 2006 Posted by: Alistair Croll

European electro band Röyksopp have a recent track called Remind me. The video for the song (available as a Real Media stream from their website or posted on YouTube if you prefer embedded Flash) is a great example of visualization done right, designed by French media house H5.

Link to Remind Me on YouTube

Humans are great at finding patterns. The problem with most systems is more that they can’t represent the information in ways we can “grok” than that the data isn’t there. TrueSight, for example, sees every error, page, object, and user session. It peers into the HTML or XML body of HTTP messages. And it summarizes things in all sorts of useful ways using reference lists and Watchpoints. But it’s the session browser — with its quick, easy way to understand what a user experienced — that people seem to love the most.

Lots of companies exist solely to make visualization better. I’m constantly stressing about how to make this information more accessible and immediate. It’s a challenge that borrows more from information architecture and graphic design than from performance monitoring or analytics. (To that end I’m in the middle of the excellent Ambient Findability right now.) Here’s a great example of visualization applied to a problem: How to know when to pull a pitcher.

Public-facing websites have raised the bar. Smart Money, Google Finance, even a recently-launched Russian search engine are all driving expectations by users — particularly non-technical executives who want everything rolled up into a simple red/green dashboard.

(And while you’re at it, Eple is an addictive track too.)

The emerging discipline of Web Operations


Wednesday, June 14th, 2006 Posted by: Alistair Croll

In traditional IT environments, silos of domain expertise focus on the atomic elements that perform some kind of business function. These include the client; the WAN/network, the data center, the application cluster, and the back-end systems.

A traditional silo view of application responsibilities

The network and data center tiers are often represented by clouds (particularly when drawn by those whose responsibilities do not include them.) This is because, while they are often a mixture of technologies, they are treated as a single logical unit that forwards and processes packets in some way.

A second way of describing the divisions of labor that perform a business function is the “layered” model.

A tiered view of roles similar to the OSI model

While the tiered model borrows from the network topology, the layered model borrows from the OSI model of networking. Physical “facilities” teams, network connectivity teams, application developers, and other groups all play a role in delivering an application–but they’re seldom interconnected.

Of course, neither of these models transcends the tiers or layers from which organizational responsibilities are usually defined. The result is that the e-business group cares little for the impact of the platform, hardware, and application layers; or that the networking group seldom worries about end-user performance as long as the packets are flowing.

But this is changing. Companies have dramatically increased the amount of customer-, partner-, and employee-based interaction they do via computer applications. While the web is the leading platform for such interactions, other initiatives—from thin-client terminals to Flash- or AJAX-based applications—are more and more common.

There’s a significant gap in operational tools to manage web applications. While network, platform, and server teams have traditionally focused on operational tasks, their application and e-business peers have been worried about deployment and design. But very few solutions can work across silos or organizational boundaries.

The result is the emerging discipline of Web Operations, which blends the operational tools of server, network, and platform operations with the customer- and business-process emphasis of marketing and e-business.

Web Operations is the intersection of traditional IT operations and e-business tools

These two domains differ significantly.

Operations

  • Measurement of success: Performance and availability of the application or infrastructure
  • Clients identified by: Source IP, region, ASN
  • Unit of measure: Packet, query, hit
  • Performance problems from: CPU overload, insufficient network capacity
  • Availability problems from: Faulty hardware, data corruption

e-business

  • Measurement of success: Conversion versus abandonment
  • Clients identified by: Referring search engine, user account
  • Unit of measure: Session, user
  • Performance problems from: Large page size, improper cache parameters
  • Availability problems from: Bad navigational logic, missing content

It is only with a blend of both domains that we can answer some of the most costly and perplexing problems that web operators face today. Web operators often need to blend data from operations (such as performance and availability) with more user-centric information (such as customer or subscriber groupings.)

When I talk with IT teams at many of the e-business companies out there, they all have the same kinds of questions–questions for which there aren’t easy or immediate answers. Here are some of them:

  • What’s the performance and availability of key web functions like? I need to provide application performance visibility to executives in my organization and it takes too much time and effort to create reports from disparate data sources. Existing reports address tech level staff and are not exec friendly.
  • Which groups are best or worst off? Which groups are above or below an “acceptable” service level? What I mean by “group” will change based on who I am (network, server, platform) and what kind of user model I run (B2B, B2C, Intranet).
  • What’s broken on my site? I configured some tests but they’re stale because the site changes often; I don’t have a lot of time to manage and configure monitoring applications because I’m often in firefighting mode. Broken might be “elements” like states, service providers, servers, or application functions; or it might be user sessions that didn’t achieve some kind of goal.
  • Why aren’t users achieving their goals? When someone doesn’t complete a goal I’d like, is it because they didn’t like my offer? Because they couldn’t understand or use the application properly? Because they got “stuck” due to bad programming on my part? Had a hard error? Or simply because they lost their connection for no related reason?
  • How much traffic can I handle? Based on what I’ve seen in the past, how many users can I support before performance becomes unacceptably slow?
  • How are errors affecting my users? Is my slow performance causing me to lose money? Do users switch to my competition? Or to another, more costly form of interaction like a phone call or an in-person visit?
  • Why is my web application slow? I’m looking at a function or time period where the response time of my application is very slow and I don’t know why. I need to get to the root cause of this so I can fix it.
  • How big a problem is this? Is the complaint or incident I’m investigating affecting everyone or just a single user?
  • Is the problem I’m looking at my fault or someone else’s? My customers sometimes experience slow performance that is caused by issues outside of my own network but they blame me for it. I need to be able to help my customers see the light.
  • How do back-end web or database services affect site performance? I’m using web services extensively as part of my application delivery and I don’t have a good way to see how web service calls affect my overall performance.
  • How should I investigate this alert or incident? I don’t always have a high degree of expertise in house to resolve complex performance issues. I need solutions that will help my level 1 technical staff resolve real issues with a concrete workflow.
  • How did the recent change affect the health of my application? I need to quantify how changes to my application are affecting the overall quality of my service delivery. I can’t do this today without extensive effort.
  • Did I meet my service objectives for a particular subscriber group? Which contracts or agreements am I at risk of violating?

There are many important dimensions to consider when trying to understand what a particular web operations team will care about. While there are hundreds of ways to slice up web operations’ needs, I find that these three dimensions matter the most in terms of what kinds of problems a company has and how they’d like to solve those problems.

  • The company’s relationship with its users (B2B, B2C, or intranet)
  • The timeframe in which the team operates (tactical incidents, mid-tier reporting, or long-term planning)
  • The team’s organizational responsibility (network and CDN, server and platform, application development, or user/content owner)

At Interop in May we’ll be running the first WebOps Summit. It will include several companies whose businesses focus on the unique predicaments of WebOps teams, and should be an interesting event.

And Technorati does an interesting job of tracking the buzz for terms like Web Operations…

Web Operations 90-day history

The website is down


Wednesday, June 14th, 2006 Posted by: Alistair Croll

I often hear people say, “the website was down.”  Not only is it usually wrong; it’s also completely uninformative.

Look at the statement a bit. And look at what it doesn’t tell you.

  • The web: What part of the web? The whole site?  Or just the part you were trying to use? Most companies split their websites into applications by domain or URL; but end-users are seldom able to tell you things in that much detail.
  • is down: What do you mean by down? Unreachable? Unbelievably slow? Crashing hard? An apology page? Some images missing? Often this kind of information is key to understanding what happened and how to fix it.
  • (for whom?) The third part of this sentence, is down, doesn’t tell us anything about the scope of the problem. Who was affected?  Was it the same for everyone? If you try to visit coradiant.com from within our offices, you get an “under construction” page.  The reason’s simple — without the “www” DNS takes users to the domain server.  This would obviously be a key piece of the problem were it one we wanted to diagnose.
  • (for how long?) Also conspicuous in its omission, problem duration is a huge clue. Something that just broke is probably easier to troubleshoot. And something that hasn’t been fixed in a while may be costing us more and more.
  • (repeatably?) People seldom test applications scientifically. Being a geek, I might try several times, with several browsers, before reporting a problem. I’ll check to see if I can surf elsewhere.  I’ll traceroute to the site, and maybe check that I can resolve the DNS properly. Intermittent problems may indicate a load-related issue or one that’s sometimes hidden by load-balancing.

When companies get fed up with hearing “the website is down” and not being able to do anything about it, they often call us. Real User Monitoring means capturing every instance of a website failure, and knowing the things you need in order to fix the problem: What broke, how it broke, for whom, and for how long.

Operations vs. Engineering


Wednesday, June 14th, 2006 Posted by: Alistair Croll

Let’s define the two camps.

  • The operations teams are responsible for keeping things running. Their view is that change is the leading cause of problems. A support team member I know has a big sign marked, “What Changed?” on his wall as a constant reminder of this.
  • The engineering teams like change. They’re the agents of change. They get yelled at when they don’t change things fast enough.

So the battle lines are drawn. Operations abhors change in any form. This is true whether they run the infrastructure, the OS, the network, the data center, or a particular application.

At the same time, the engineers want to deploy fixes and enhancements as quickly as possible. They’re limited in their ability to test the changes before they go into production; then they have to throw them “over the wall” into operations.

At this point, lots of things go horribly wrong.

  • The change fails hard, causing operations to reject it.
  • The change doesn’t break, leaving everyone tip-toeing around and seizing on any anecdote that might confirm whatever suspicions they harbor.
  • The change exhibits some strange behavior with whatever monitoring hooks have been put into place. But if the engineers put the instrumentation in, operations insists it’s not valid for a production environment; and if the operations teams are monitoring it, then the engineers say it’s not getting them enough detail.

And of course, the United Nations of marketing sits by, wondering why the conflict is happening and lumping both operations and engineering into the “technical people” bucket.

How do we declare a truce between these two groups?

It starts with consensus before any of the deployments. Very few organizations set expectations before a change, and fewer still agree on metrics, monitoring methods, and thresholds. Any change will have an effect, but if the company is unable to measure that effect, fighting will follow.

Changes will almost certainly have one or more of the following impacts:

  • A change in capacity, where the application can handle more (or less) concurrent users.
  • A change in performance, either from the network (bigger content, more objects) or from the application (dynamic pages, back-end queries). This may be a complex effect: Switching to a content delivery network may speed things up for some users and slow down others.
  • A change in operating cost as a result of more or less self-maintenance. A new database might require more administration, or more frequent backups.
  • Additional support costs because the system is harder for end-users to operate or because they need assistance with upgrades
  • Different availability and reliability because of maintenance window changes, or the Mean Time Between Failure (MTBF) of new components.

Both engineering and operations must realize that change is a constant, but that the impact of changes needs to be verified and tracked. Without expectations set ahead of time, the effects of a change will lead to blame and recrimination. But if a company agrees on what kinds of impact are expected and how to measure them beforehand, we can at least keep the fighting down to regional skirmishes instead of an all-out battle.

A taxonomy of web performance metrics


Wednesday, June 14th, 2006 Posted by: Alistair Croll

I spend a lot of time (more than I probably should) discussing performance with people. And often, disputes are the result of disagreement in terms. Armed with a common syntax, adversaries can often find common ground.

There are four main dimensions that I’ve found useful in describing performance metrics. These include whether the metric is symptomatic or diagnostic; first or second order; direct or derived; and network- or server-based. It’s hard to compare different kinds of metrics: A direct metric may disagree from a derived one while they’re both still correct. Here’s a rough outline of what each means.

Symptomatic and diagnostic

A symptomatic metric is one that describes end-user experience. End-to-end page load time or measured availability are symptoms of a problem such as a slow app server or a broken database. On the other hand, metrics like CPU load, queue depth, or RAM usage are diagnostic metrics that may have nothing to do with how users experience the site.

First or second order

A first order metric is derived from the data itself. Measuring latency is an example of this: The delay is right there in the timing information of requests. Availability and traffic volume are two other first-order metrics. By contrast, something like capacity is a second-order metric. To measure it, you need to know two things: Load and performance. Capacity, after all, is a statement about how much activity you can handle without violating some kind of service target. User satisfaction is another second-order metric.

Direct or derived

A direct measurement comes from the actual activity — a user’s visit, a web log, a sniffer trace. As the metric gets farther from the actual end-user, it becomes a derived metric. A synthetic test from a particular region to a set of pages is an approximation of what end-users might have experienced.

Network or server

Network performance metrics include round-trip time, out-of-order segments, and retransmissions; higher up the stack they might also include the amount of time that the network contributed to a delay. Server metrics include the aggregate host time, or its component parts (app server, database, and web services.) A third element might include SSL time and redirect time — these fall in that fuzzy area of “protocols” that’s neither network nor server.

Using these four dimensions, it’s possible to classify most metrics that operations and engineering teams use to measure the health of an application:

  • Generally speaking, Line of Business audiences will care more about second-order, symptomatic, direct metrics such as actual end-user satisfaction — since they’re the kinds of measurements that trigger phone calls or make users angry.
  • On the other hand, IT teams will be more willing to look at derived metrics (which are less “noisy” because they’re artificial) and diagnostic measurements that get to the root of a problem.

Knowing which metrics to use with which audience can make all the difference. If you’re an IT guy, presenting the right data to the line of business makes you a valuable contributor instead of being dismissed as a theory-obsessed numbers freak. And if you’re in marketing, presenting data that includes some forensic detail and useful baselines can make the technical teams listen and act.

Dependency mapping in web applications


Wednesday, June 14th, 2006 Posted by: Alistair Croll

The Change Management Database (CMDB) is a map of the components in an IT environment and how they depend on one another. It’s at the core of the Information Technology Infrastructure Library (ITIL), a set of best practices developed by the UK government and broadly adopted in recent years.

Even if you’re not an ITIL shop, you may already be using some of the ITIL directives as a part of the Microsoft Operations Framework (MOF). And various schemas describe specific environments: The Common Information Model (CIM), and various iterations of it such as the Data Center Markup Language (DCML) which attempt to describe elements of an IT environment using a structured format such as XML.

Yikes. Okay, enough acronym soup. CMDB is particularly interesting when applied to a specific environment. In a data center, there are lots of pieces and lots of dependencies. A particular web application might depend on a set of servers, which in turn depend on power supplies and contain CPUs running operating systems.

One of the challenges of managing a web application is answering the question, “what happened?” Many times, a change to one component has far-reaching consequences. Consider a modification to a back-end web service. Lots of other parts of the application are related to that change:

  • The servers on which the service runs
  • The virtual IP addresses (VIPs) that forward traffic to the servers
  • Any TCP ports that provide the service
  • Any web pages that call the back-end service
  • Any transactions that involve that web page in a process
  • Any business processes or user groups that rely on those transactions
  • Contractual obligations affected by those processes or groups

Unfortunately, most of the dependency mapping tools and schemas have little or no visibility into these dependencies. They tend to tie an application (“E-Mail”) to infrastructure (“server 12″) without breaking the application into services, components, users, or processes.

The “holy grail” of application performance management continues to be an association between the content, application, network, and infrastructure; derived from user or user group experience; measured across performance, availability, and traffic levels.