Avoiding Nightmare PACS Outages
Preparation is the difference between unexpected PACS downtime and a nightmare, Michael D. Toland told his audience in Seattle on May 17 at the 2008 annual meeting of the Society for Imaging Informatics in Medicine. Toland, who is PACS administrative team manager for the University of Maryland Medical System, Baltimore, presented PACS Worst Case Scenarios: Understanding the Implications of Major Downtimes and Avoiding Them. It is vital to understand the effect that PACS downtime has on the entire enterprise, not just the operation of a single department or service. It is particularly important to note the clinical, business, and risk-management implications of PACS downtime, Toland says, especially as PACS use spreads further beyond radiology. The effects of downtime should be limited in advance through the development of policies for communication and for escalation of staff intervention in the event of system failure. Of course, procedures should also be established to reduce downtime in the first place. The development of strong relationships with vendors will be a valuable step in this direction. The Cascade of Failure Toland recalls that his horror story began when he had been working for just 7 months as a PACS administrator at a 250-bed community hospital that performed about 70,000 exams per year. Toland had come to the hospital from an IT background and was without previous experience in a hospital setting. The hospital’s IT group, which had been active in the PACS acquisition, was still providing support for the system’s hardware. When one of the system’s storage-array controllers began generating errors, Toland called the vendor for support and was told that the controller should be replaced. Because there was a second controller in operation, and it was adequate to manage the demands placed on the whole system, there was no disruption in PACS operation. This allowed the replacement of the failing controller to be scheduled conveniently (along with appropriate data backups preceding it). Because the PACS hardware was covered by a service contract, no expense for the replacement was anticipated. Just in case there might be some unexpected difficulties, Toland scheduled the replacement for a Tuesday, after hours, when PACS use was likely to be lightest. After the controller had been replaced, its working status and correct configuration would then be verified. Unfortunately, Toland’s careful planning was to no avail. Without his knowledge, the vendor’s support technician dropped in at the hospital at 10 am on Friday, having decided not to wait until the following Tuesday to replace the controller. The hospital’s IT staff gave that technician access to the data center without informing Toland. When the bad controller was replaced—before the planned backups had been completed—its blank configuration overwrote the configuration that had been in place on the good controller. Because all array configuration was lost, all PACS data disappeared. Because Toland’s scheduled full backup had not yet been made when the controller was replaced, it was necessary to base restoration of the PACS data on the full backup of the previous week plus daily differential backups. The most recent of those backups had been done nine hours before the system failed, so data from exams performed during that time had to be restored by resending them from the modalities and the RIS to the PACS. The PACS had to remain entirely down during a from-scratch rebuilding of the storage array, which took eight hours. Data restoration from backup tapes took 27 hours, and 24 hours were required to complete study reconciliation. The outage affected not only the community hospital, but an outpatient facility, a large medical center, and a regional trauma center that were all linked to the PACS. More than strictly PACS functions were affected by the unscheduled downtime and the lack of access to prior studies that followed it, which lasted until data restoration was complete. Clinical care was affected, and this had an impact on the facility’s risk management. Likewise, business operations were compromised by lack of PACS availability. Becoming Prepared Because all hardware is capable of failing, PACS failures are inevitable, Toland says. While this is beyond the facility’s control, the impact of a failure can be controlled with adequate planning and good policies. When unscheduled downtime takes place, rapid response is vital; the problem must be identified quickly in order to reduce its overall effect on the PACS (and, therefore, on the enterprise). PACS users should be notified promptly when an unplanned event brings down the system, both to allow them to change their routines during the outage and to reduce the number of telephone calls that would otherwise swamp the PACS staff. In addition to informing PACS users, staff must let administrators know what has happened so that they can act to reduce the business impact of the outage. Strong escalation policies should be in place; during system troubleshooting, for example, certain trigger events should produce automatic notification of more experienced personnel, with higher-level staff increasingly being contacted as the situation becomes more severe or widespread. Preparation for downtime (and prevention of unexpected outages) should be conducted from three angles, Toland says. From the business perspective, vendor relationships and service contracts should be addressed so that reliable support will be available when the need arises. From the operations perspective, policies, procedures, and staffing levels should be kept up to date to ensure efficient performance. From the IT perspective, all subjects relating to systems architecture should be dealt with in advance to prevent unpleasant surprises. Systems and interfaces must be tested frequently to detect problems early; this proactive monitoring, Toland explains, should be based on a thorough awareness of the warning signs that might indicate impending system failure. At the same time, redundant architecture must be used to make the failure of any single piece of hardware invisible to the PACS user. If systems are distributed and enough redundancy has been built in, potential system outages will become component outages instead; ideally, most PACS users will never even know that they took place.