The Troubleshooting Procedure

Some time ago, my then employer decided to outsource most functions of the department of which I formed a part, including my own position. As a result, I was tasked with handing-over my responsibilities to the outsourcing company’s employees, who would take over from me.

I was not unhappy about this turn of events, as I received a generous redundancy package, and there were plenty of other interesting challenges awaiting me in the IT industry. So I took my hand-over responsibilities seriously, and attempted to perform them to the best of my ability. I also made friends with several of the outsourcing company’s employees, so wanted to help them as best I could.

One of my responsibilities was to maintain (almost single-handedly) a customer-facing web site implemented by a large, hideous mess of almost unmaintainable Perl code. This system had been written years ago by someone learning to develop software on the job. This system incorporated many popular anti-patterns, including all of those that a beginner programmer makes, and especially large amounts of copy’n’paste programming, but also much disgusting craziness unique to that system. The system even invited its end-users to mount SQL injection attacks on it.

Large sections had not been touched in years, even decades, apart from band-aid fixes piled on top of earlier band-aid fixes. The IT euphemism for this is a “mature system”. There was almost no up-to-date documentation; what little useful documentation did exist had been recently written by me based on time-consuming inspection of the Perl code. There was not even a development, test or certification environment – changes were made on the live system, with the inevitable customer-visible outages.

For large sections of this system, neither I nor any of my technical colleagues even knew that certain functionality existed, let alone how it was implemented or why it had recently ceased working. I frequently received reports of problems from the unfortunate users of the system, describing (sometimes semi-coherently) failures in areas of the system which neither I nor any of my technical colleagues had previously known existed.
Usually, the only way forward was to investigate the system from first principles, by inspecting code and data and completely disregarding nonexistent, out-of-date or incorrect documentation. This was difficult and time-consuming, but with perseverance and sheer bloody-minded determination, I eventually managed to fix the problem in all cases.

When doing handover for this system to the outsourcing company, they repeatedly asked me to provide a procedure to fix problems in this system. They appeared to be particularly procedure-oriented, and not as focussed on seat-of-the-pants flying as I had been forced to become when troubleshooting this system. More importantly, they seemed unable to modulate their extent of procedural thinking based on changing circumstances.

I kept explaining that:

  • There currently existed no such detailed, one-size-fits-all trouble-shooting procedure for this system;
    and that
  • If such a procedure could have been created, I would already have created it, quite likely in the form of a Perl script automating the problem resolution, so that they could have read that script or its documentation rather than emailing me;
    but that
  • No such procedure could be created and given to them, because the root causes and the necessary fixes were radically different each time, due to the size and poor quality of the system and the sheer number of distinct failure modes;
    and that
  • The only way out of this mess was refactoring and outright re-design and re-implementation of large parts of the system, requiring a significant investment, which I did not anticipate occurring.

They seemed not to believe me – they seemed to think I was withholding useful information from them, or perhaps they simply refused to believe that the system was quite as much of a mess as it in fact was. So they kept nagging me for documentation of this supposed “trouble-shooting procedure”.

So in the end, somewhat in desperation, I wrote a Word document entitled “<name of system> Troubleshooting Procedure”, which contained this text (almost verbatim) and nothing else:

  1. Find out what the system is supposed to do.
  2. Find out what the system is actually doing.
  3. Compare and contrast 1 and 2.
  4. Develop a set of changes to the system which you believe would cause it to cease behaving like 2 and begin instead behaving like 1.
  5. Apply the changes developed in 4.
  6. Verify that the system now behaves like 1 rather than 2.
  7. Ask the users of the system whether they are now satisfied.

To my amazement and amusement but also horror, the outsourcing company read this useless Word document, approved of it, and added it to their archive of handover documentation on the company file server, then stopped nagging me for procedure documentation!

I wonder to this day whether some poor employee of the outsourcing company ever attempted to apply that document. I would laugh or cry (or both) if I went looking for a procedure to solve some critical customer-facing problem, and found only that crappy Word document.

Lessons learnt:

  1. Some people are necessarily procedure-oriented, to such an extent that they cannot operate in the absence of a written procedure – even a useless/obvious one like that above;
  2. Rather than attempting to convince someone that what they ask for cannot be created or is not in fact necessary, it is faster and easier to just do the best that one can, even when one knows that it is useless. This avoids the potential for argument, and forces the requestor to be specific about what they want; the more specific they are, hopefully the more obvious it will become to all concerned that such cannot be supplied.

Leave a Reply

Your email address will not be published. Required fields are marked *