“Troubleshooting the Stack” as it Applies to Larger Computer Systems
HPC systems have both layered hardware and software. A great visual troubleshooting tool is to make a layer diagram of your system, include a couple commands and logs to health check the individual layers. Starting at the end-user client, transverse down through the layers toward the individual disks in your storage. Think of this as tracking an individual block of data moving from the client down to the data storage blocks on the disk drives.
Most likely, the first hint of a problem will be on the surface at the client layer. However, the problem will rarely be located there. This method involves testing for evidence of the problem going down one layer at a time. As you do this, you will reach a layer where the evidence of the problem disappears, and everything looks healthy, in fact…any lower layer will also look healthy.
At the boundary layer between the place you see the problem and where you no longer see it, is the most likely place where the root problem exists and where you should focus your analysis. As you fix this area you will see a ripple effect of health returning to the upper layers.
About the author: Ed Stack, RAID Inc.’s Sr. Storage Engineer has 34 years of troubleshooting experience on a wide range of computer and electronic systems. From component level repair of avionics systems to designing high performance computer (HPC) systems.