The Perils Of Problem Management (Root Cause Analysis)

Whether in IT or Business Process Improvement, it is inevitable that serious improvement efforts will embark on some sort of Root Cause Analysis work or program. I am a huge fan of what I call “Real Root Cause Analysis.” The reason I make this name distinction (which I made up myself) is that every so often, managers and employees alike fall into traps when implementing this particular type of process. Whether it is LEAN RCFA (Root Cause Failure Analysis), or ITIL Problem Management, the perils are the same. Here’s a quick list of Perils to avoid when implementing such a process, or even in doing a one-time analysis.

Peril #1: Not getting deep enough in the analysis to determine the actual root cause.
I have seen on many occasions, a team or group stop at an ‘easy to identify’ root cause, and call it a day. Example: An IT or business department (it’s getting harder to separate the two by the way), will experience some sort of server failure. In this particular example, let us assume a disk drive failed, which caused the server to be down for some period of time. The drive is replaced, server rebooted, and service is restored. Management asks for Root Cause to be done, and the responsible parties conclude that the root cause was a failed drive. What is missing from their analysis are several unanswered questions. In other words, the magic is in the questions, not the answer. They should be asking themselves the following:

How was it that when the drive failed, there was no redundancy?
Is our hardware aging?
And should we be looking at the age of these devices?
What was it about the way the system was constructed, that a single drive failure could bring it all down?
Etc.

As you can see in the example above, it’s easy to stop at the hardware failure, and one could make the argument that was indeed the root cause. However, there are still unanswered questions, and until these are answered, perhaps management should not check the box indicating root cause analysis and most importantly PREVENTION, has been completed.

Peril # 2: Analysis Paralysis: Going too deep.
The converse in the above scenario is for teams to dive too deep in the analysis, and thus end up with either multiple root causes, or singular root causes so big and daunting, they cannot affect change. I have personally witnesses 500+ hours being put into a single root cause analysis, for a single system failure. Granted the system was complex as was the organization. However, it is not reasonable that so many man hours would be spent without any resolution. In this 500 hour example, the team used Fishbone diagrams, 5 Y (why) analysis, and a myriad of other root cause analysis tools; but the end result was simply a lot of confusing documentation of the issue, and without conclusive root cause identified, and just as importantly, without clear recommendations for management action or investment.

Peril # 3: Going after the wrong problems
In any business, or IT function, there is no shortage of opportunity in terms of root cause analysis. So how do you know what to tackle? For me, the answer is simple; those items that had the greatest impact. And so then, how do you know what had the biggest impact? Is it a measurement of down time? Is it the business unit that screams the loudest? Is it the issue that is most politically visible? Most people would agree that the main determination of impact is cost. However, organizations, and specifically IT, have been very poor at calculating cost as a function of downtime. They typically use crude metrics such as number of minutes down, or number of help desk tickets called in, or some measure that was probably useful in 1995.

My experience has been that the” Real Business Impact” of system downtime is rarely understood, and even more rarely actually calculated. The primary reason for this failure is that getting this information is deemed to be too difficult. How does IT calculate Business Opportunity Cost? How would one figure out that Server Outage A costs more than Server Outage B for the same period of time. I submit that this information is NOT difficult to get, and that it is furthermore required, not just nice to have, to run an efficient root cause analysis process. I have studied a particular organization’s inventory of Problem (root cause) investigations and through the use of some Six Sigma tools was able to determine that 90% of the business impact, for a year’s worth of outages, were caused by 5% of the critical systems. This ability to focus on cost is critical to sort through the volume of problems.

There are more perils. Be sure to follow my blog to learn more next time…

- By Hayyal Ighneim, Manager, IT Consulting

COVID-19: Featured Insight - Click Here

CGN Edge Blog

The Perils Of Problem Management (Root Cause Analysis)