Why are IT systems so unreliable?

By Tony Lock

perspective A recent report by Freeform Dynamics shows that IT systems fail. What's more, a quick glance at the chart below shows they fail far more frequently than one might expect in today's 'high availability' environments.

Such lack of resilience might be surprising if one just went by the content of press releases from many of the leading IT vendors, especially those in the virtualization markets.

So what's going wrong and who needs to step up and take responsibility? Are IT pros and business analysts getting something wrong or are the IT vendors selling lots of pups?


As we can see from the chart above, all components essential for application delivery are prone to failure. Most frequently cited are software failures followed in second place by network failures or performance degradation, with the failure of physical components trailing back in third place.

Despite much public beating of chests, power outages and brownouts have yet to cause application failure as often as any of the other three areas addressed. So the inference is that software, hardware and network failures account for the vast majority of systems interruptions.

One thing that is clear is that service disruptions also occur as a consequence of human interventions triggering an interruption to application availability. The figure below highlights that while hardware, software and networking considerations are important in ensuring service availability, it is essential that operational management processes and practices are also suitable to the quality of service desired for each application.
It must be said frankly that if the people and process side of system availability is not addressed, the chances are that systems will fall over, probably time after time.

This all brings up the question: in which area should an organization seek to add additional resilience to its application delivery?

Is it a question of spicing up the software side of things, acquiring better hardware or getting more resilient networking in place?

Well naturally enough, in the ideal world, all three factors would receive more than adequate attention long before an application is due to enter live service thereby allowing its requirements for performance, availability and recovery/protection to be more than adequately delivered.

Alas, as we know all too well, this is the real world where, as the figure below shows, such attention to service availability is not considered early enough in projects.

If we consider the area of employing software-based solutions to enhance application availability, a lot of attention, especially from the vendor and channel community, is focused on applying a number of virtualization technologies to the problem. It is fair to say the many flavors of virtualization currently on offer, especially in the x86 server space, are being promoted as an answer to delivering greater service availability.

A closer look at such offerings, essentially any of the operating system/hypervisor virtualization technologies or one of the different approaches of application virtualization solutions available, highlights that the simple solutions mostly deliver faster recovery after failure rather than preventing failure in the first place.

A good first step perhaps but it is only now that the effective management of virtualized solutions is beginning to offer the high levels of pro-active availability that such established platforms as the mainframe manage with ease. It is also worthwhile noting that these virtualization solutions should not overshadow the need to ensure that the application itself is written in a way that enhances availability and can interoperate with new virtualization solutions to raise service availability.

As has been mentioned, over the course of the last few years software systems, especially virtualization offerings, have rather stolen the limelight when it comes to adding resilience to applications. However it is still a fact that utilizing hardware platforms that are designed with application availability in mind can also deliver great value.

These solutions are frequently overlooked as today there is a tremendous volume of marketing claiming that industry standard systems are good enough. For many applications this is true but if you need to go that extra step along the road of application resilience, server and storage hardware, fault-tolerant or resistant systems deserve consideration, especially if keeping applications running rather than simply being able to restart them quickly in the event of failure is an issue.

A serious question needs to be asked of the vendors: do they really understand what is important for organizations when it comes to application availability? Or do many of them actually believe that 'virtualization' is the marketing storm that solves all problems at once?

At the moment it appears the latter is so. And beware--it appears cloud computing may soon be proffered as the next answer to application availability, life the universe and everything.

Tony Lock is program director at Freeform Dynamics. He contributed this article to ZDNet Asia's sister site, Silicon.com.