Always make sure your backyard is in order, before pointing your finger at other technical teams

How often have you seen IT professionals point their finger to another technical area, when a problem arises? To me it happens far too often. How do you think it will look to your peers, or even better, your management, if you were to point the finger at other groups and it turns out to be your problem? To me it wouldn’t look good!

Recently at one of my assignments, we had a performance issue on Oracle database servers that have SAN connected devices. At certain times during the day, the commit wait time would increase drastically and cause application connections to time out. The problem most closely resembled a SAN or storage-related issue. The application teams concluded storage as the culprit, as well.

On the infrastructure side of the house, we realized the problem could have been with Oracle, the operating system, device drivers, firmware, or as the application team suspected, the SAN/Storage. Each team did their job opening support tickets with the vendors (EMC, RedHat, HP and Oracle), to which each vendor provided recommendations for resolving the issue, in addition to validating that their supported areas or responsibility appeared healthy and functioning properly. Oracle suggested applying a patch and verifying storage health with the SAN/Storage team. RedHat suggested changing the I/O scheduler and verifying storage health with the SAN/Storage team. Dell EMC also suggested adjusting the I/O scheduler and to setting the HBA and Disk Devices max queue depth on the affected host system.

In an attempt to address the performance issue, we applied the Oracle patches, adjusted the Linux I/O scheduler and set max queue depth. While those changes were being implemented, we also worked closely with EMC to collect system logs for further analysis. In the end, the root cause turned out to be a recently identified bug/issue with the EMC VPLEX, and we ended-up applying new code to resolve the issue.

This is a perfect example of how each of the technical teams worked together, did their due diligence to rule-out their specific supported technology as the potential cause of the problem, instead of just taking the easy route, by assigning blame to other technical teams involved.