Abstract
An unexpected downtime in an electronic health record system can be dangerous to patients and result in data loss or compromise. This article examines a downtime of National Institutes of Health Clinical Center’s EHR system. It presents the methodology of review, a description of event, lessons learnt, and possible process and policy changes that can improve the facility’s response and smooth outage procedures.
Introduction
Electronic Health Record Downtime
The electronic health record (EHR), from the datacenter to the end-user, has many components that are integrated. To ensure the availability of the system to all users, each component must be operating at maximum capacity. Any failure along this path can affect performance or accessibility. This scenario makes it important to not ask when a downtime is likely to occur and how it will affect the organization or patient care.
Any unexpected outage of a clinical software system can have an immediate impact on patient safety. Care providers may lose or compromise information, forcing them to manually report patient data. During system downtimes, for example, orders can be handwritten, transcribed manually into a system, and results of tests are delivered to the floor by messenger or called out. An order form provides electronic guidance that can help to ensure critical information is not missed, such as allergies or medication lists. The side effects of a system outage include disruptions to the clinical workflow, restricted access to patient information, increased patient waiting times, overtime hours for staff to enter data and frustration from staff and patients.
It is essential to implement policies, procedures, and processes to ensure that the IT staff can recover from downtimes, while the clinical staff continues to provide care. It is also important to put in place processes and controls to minimize the possibility of a similar system not being available in the future.
Clinical Research Information System
The National Institutes of Health Clinical Center is often referred to by the term “America’s research hospitals”. The NIH Clinical Center is exclusively devoted clinical research and provides services and care to patients, principal researchers, and research teams who are a part the NIH Intramural Program. The NIH/CC has an EHR called the Clinical Research Information System. CRIS is used by approximately 3,200 users, and it interfaces with most ancillary software (e.g. laboratory information systems, radiology information systems). The Department of Clinical Research Informatics follows a methodology for project management that is based on the Project Management Institute’s fifth edition of the Guide to the Project Management Body of Knowledge.
Methodology
Lessons-learned is a process that identifies and evaluates experiences in an organization. To identify key events that could be improved, an after-action review, a change management process, or another continuous quality improvement method can be used. It is important to document and collect lessons learned following an event. This will help you improve your business. Every event can teach us something, no matter how big or small it is, or whether it’s planned or unplanned. This exercise is designed to determine what went well and where improvements could be made. These lessons can be used to inform future planning, or become a part of standard processes.
The team involved in the event is brought back together after the event to discuss lessons learned from their perspective. The facilitator’s role is to keep the meeting moving in a positive way, while no comments are dismissed. This is a retrospective review and the emphasis is on learning from the past. The process should avoid finger-pointing or assigning blame. It is important to be fair and honest, regardless of who was involved.
The facilitator may have to rephrase certain comments during the meeting to make sure they are well understood, that they have a positive tone and are clear about their importance. Lessons learned documents can be reviewed by staff who were not present at the original event years later. Including the impact will help define the context.
Original Comment: Since I didn’t hear from Jane, I didn’t know when I needed to be there or what was happening.
Rephrased Comment : By sending out regular status updates, including an estimated timeline during the event, and to all staff affected, everyone will be aware of the progress and schedule.
Include a description of what the lesson is or how it will be used. Also, include its importance or effect. The prolonged downtime of NIC/CC’s EHR led to many lessons that were applied across multiple processes. The improvements have been noticed by both technical staff and users.
The Event
The NIH/CC EHR was unexpectedly down on May 13, 2010. A hardware failure caused corruption to the primary and backup database. This event caused a sudden and unexpected loss of clinical information to all patients at the Clinical Center, which could have a negative impact on patient safety and care.
A conference line was established, and the key resources needed to help were brought in. Several attempts to restart the system were made over the next few days, but without success. A replacement part was ordered to resolve the hardware problem. The system was then restored by using the latest trusted backup, transaction logs, and the log with the corrupted data.
The team restored the database with the most recent transaction log to a different server, so they could compare it and determine if any data was missing that would need manual reentry. Users were provided with reports to help them complete the missing data entry.
The interface messages of all systems ancillary were then re-run starting at the moment the system went down. The system was handed over to end users 33 hours after failure, once all components had been validated.
During the event, the existing downtime policies were adhered to. The policies covered the documentation and communication for results, orders and vital signs as well as medication administration and other clinical documentation. The re-entry of the data collected during the downtime took more than a week. The departmental leadership and the quality assurance officers met daily to address all concerns about data reentry.
The environment at the time of the event
The NIH/CC has a primary data center and a secondary data centre, both with redundant power, cooling, network and fiber connections. This allows production systems to be run on servers at either location. The EHR continuity plans were available, but they did not cover the scenario where both primary and secondary databases (replicated) are corrupt.
A continuity plan was regularly reviewed, but a test was not performed due to the difficulty in coordinating an interruption with users. While the actual failover to the remote Storage-area Network was never tested, other elements of EHR’s business continuity plan were regularly tested as part of regular business operations. As part of the release management process and failover services, it was common to restore databases and create new environments.
Lessons Learned
The technical and organizational processes should be updated and reviewed where necessary. After the downtime, meetings on organizational and technical lessons learned were held to discuss what went well and where improvements could be made. These meetings led to the creation of multiple committees that promote process improvements during downtime.
The executive level and the user community were all involved in these initiatives to improve processes. All participants agreed that communication is a critical area for improvement. It must be timely and effective.
The DCET committee, a multidisciplinary committee, was formed to address the issues of communication, education, and training regarding downtime processes and procedures. The DCET Committee met regularly for an entire year to improve communication and educate staff on downtime processes. The DCET committee meets bi-annually to update and review the downtime policies, the contents of toolkits and forms to determine whether any changes are needed.
Process improvement from lessons learned
Communication
During downtime, announcements on overhead screens were made in the hospital. E-mails were also sent to the community updating them on the status of the system. The executive leadership set up a command center to provide centralized communication between the technical team, stakeholders and the executive leadership. Technical and clinical staff made their rounds in the patient care areas and ancillary departments, meeting with stakeholders and explaining the downtime procedures. They also provided the current situation and plan of action. Minghella8 understands that it is important for staff to see the support from the leadership team when they visit the patient care areas.
Downtime Policy Revision
Review of activities to support unplanned downtimes. The policy was updated to include processes and actions that stakeholders identified as being useful during downtime. Subject matter experts updated the downtime communication templates to ensure key information, if it was not already included, was captured. These templates addressed downtimes of various types in a standard format, and identified both user and system impacts.
Downtime Toolkit
The Health Information Management Department website had the forms available, but they weren’t grouped so that users could easily find them. The forms were not updated or reviewed in years, as the majority of documentation is completed using the EHR.
The idea of creating a toolkit for downtime was discussed during a DCET meeting. The Clinical Information Management Committee reviewed and revised paper downtime forms. The downtime toolkit consists of a plastic bin with folders organized by category, including orders, progress notes and nursing care. Also included are forms for each category.
A toolkit for inpatient units and a toolkit for outpatient clinics were developed. All patient care areas received toolkits. The Health Information Management Department can replenish forms and maintain the toolkits. Each area will be expected to maintain/restock the kit as necessary. Quarterly drills during downtime have been scheduled to continue the education and assess staff readiness.
Prepare to Handle Incidents
To respond to unplanned system failures, a DCRI incident response group was formed. Communication plan created for activities to be performed before, during and after incident. Section supervisors identify the on-call designees for their teams, create downtime contingency and disaster recovery plans and checklists and then conduct pre-incident activities. The documentation is regularly reviewed and stored in a central location.
In the event of an incident, the Systems and Monitoring team will notify the DCRI executive who is on call, contact the appropriate teams, inform the user community and provide an estimate time for resolution by e-mail and overhead page, and open a conference telephone line. The Project Management Office records the major unplanned outage as a series of tasks within a project plan. The DCRI executive informs various levels of NIH/CC leadership of the severity of downtime.
The following activities take place after an incident: the Systems and Monitoring team informs the user community via email and overhead page of the system’s availability; a technical contact records the downtime; and, the Project Management Office creates an incident report that is reviewed by the Technical Review Board. Finally, the Technical Review Board compiles a root cause analysis and presents it to the Chief Information Officer.
Manage System Availability
A 99.999 percent system availability is desirable from the perspective of users and executives. The current target for system availability is 99.9 per cent, which is approximately nine hours of downtimes each year across all EHR components. The executive leadership should review and determine the target system availability. This includes the risks and controls. The NIH/CC can accept a target of 99.9% because there is no emergency department or labor and deliver department.
The extended downtime at NIH/CC demonstrated that system availability management is a critical component of all operations and projects. The EHR continuity plan was also reviewed before any planned changes to the system. Upgrades and new modules are managed through the project management phase, while system updates and changes go through an official change management process.
Backup Systems
Reviewing the continuity plan revealed that an additional backup system was needed for both planned and unplanned disruptions. The Clinical Center required two new environments, a read-only environment and a warm location. The read-only environment would be available during scheduled downtimes and could also be used in the event of an unscheduled outage, depending on its expected duration.
A project team formed to outline the different technical options available for the read-only software. Each option was evaluated, documented and included a description of its effects, advantages and disadvantages. In the end, it was decided to use log shipping for the short-term and to evaluate mirroring or other technologies in the future. Each upgrade of the EHR system validates the read-only system and its activation and restore process.
This decision was made based on how long it would take to activate the system for users. It takes approximately 90 minutes for the read-only system to be available. This system is used for scheduled downtimes longer than two hours, or for unscheduled disruptions that could extend to that time.
A decision made by the EHR team to differentiate the View-Only system from the Production Environment EHR had a major impact on end users. The icon was the same, but in black and white and with a new name. It was made so that the icon would only be visible at the end, when the user had completed the required tasks to activate the system.
The warm site is an entirely redundant system that can be used in the event of an unplanned long interruption. The warm site was created using the same log shipping process. It is available to users in case of an unplanned interruption expected to last more than four hours.
This warm site has no direct connection to the production database and does not share hardware with it. This is done to prevent any impact on the backup system if the primary system fails. The potential for a longer than expected downtime, and the risk that the site will go offline, makes it difficult to test the warm site fully. The need for confidence in the system design and recovery procedure must be balanced against these risks. However, the production system uptime is of primary importance.
Future Technical Enhancements
The goal is to achieve a system availability of 99.99 percent in 2016. Microsoft SQL Always On is being implemented to reach this goal. It will be activated in spring 2016. This technology offers improved functionality in failover and redundant systems. The architecture has been approved, reviewed, documented and approved. Before activation, a full failover will be performed and documented.
The new architecture will include the virtualization of all EHR components except the database. Virtualization enhances the ability to quickly restore an environment at a new location.
Conclusion
The NIH/CC took advantage of the extensive downtime in order to evaluate policies and procedures, and then conduct a comprehensive exercise on lessons learned. The NIH/CC continuously reviews and enhances downtime procedures on a regular and ongoing basis, and when system functionality changes. Staff are familiarized with downtime procedure and can smoothly transition to downtime procedure during planned outages. Ng says: “Downtimes are inevitable and, while unplanned downtimes are never desired, they will always be needed.” To achieve a high level of system reliability, it is important to maintain a robust infrastructure, monitor tools, ensure a culture of high availability, conduct frequent reviews of business continuity documentation and test procedures regularly.
Notes
- Ng, S. “Meet an Informaticist: Downtimes Are . . . Inevitable.” HIMSS News. February 10, 2014. Available at http://www.himss.org/News/NewsDetail.aspx?ItemNumber=28123.
- Nelson, N. C. “Downtime Procedures for a Clinical Information System: A Critical Issue.” Journal of Critical Care 22, no. 1 (2007): 45–50.
- Project Management Institute. A Guide to the Project Management Body of Knowledge (PMBOK® Guide). 5th ed. Philadelphia, PA: Project Management Institute, 2013.
- Savoia, E., F. Agboola, and P. D. Biddinger. “Use of After Action Reports (AARs) to Promote Organizational and Systems Learning in Emergency Preparedness.” International Journal of Environmental Research and Public Health 9, no. 8 (2012): 2949–63.
- Ibid.
- Aiello, B., and L. Sachs. Configuration Management Best Practices: Practical Methods That Work in the Real World. Upper Saddle River, NJ: Addison-Wesley, 2011, p. 166.
- Fahrenholz, C. G., L. J. Smith, K. Tucker, and D. Warner. “Plan B: A Practical Approach to Downtime Planning in Medical Practices.” Journal of AHIMA 80, no. 11 (2009): 34–38.
- Minghella, L. “Be Prepared: Lessons from an Extended Outage of a Hospital’s EHR System.” Healthcare Informatics. August 30, 2013. Available at http://www.healthcare-informatics.com/article/be-prepared-lessons-extended-outage-hospital-s-ehr-system.
- Kilbridge, P. “Computer Crash—Lessons from a System Failure.” New England Journal of Medicine 348, no. 10 (2003): 881–82. Available at http://ehealthecon.hsinetwork.com/NEJM_downtime_2003-03-06.pdf.
- Nelson, N. C. “Downtime Procedures for a Clinical Information System: A Critical Issue.”
- Fahrenholz, C. G., L. J. Smith, K. Tucker, and D. Warner. “Plan B: A Practical Approach to Downtime Planning in Medical Practices.”
- Nelson, N. C. “Downtime Procedures for a Clinical Information System: A Critical Issue.”
- Fahrenholz, C. G., L. J. Smith, K. Tucker, and D. Warner. “Plan B: A Practical Approach to Downtime Planning in Medical Practices.”
- Minghella, L. “Be Prepared: Lessons from an Extended Outage of a Hospital’s EHR System.”
- Ng, S. “Meet an Informaticist: Downtimes Are . . . Inevitable,” p. 4.
Reader Interactions