Network Outage in East Data Centre

Incident Report for Arts Management Systems

Postmortem

We’ve received the Reason for Outage (RFO) from our eastern data centre. The outage at the data centre caused the outage on May 19, 2022.

The following is the content from the RFO:

‌

Root Cause Analysis

The Master Electrician and Senior Facility Technicians determined that due to a physical misconfiguration, a failing power supply had caused a cascading power failure across multiple electrical circuits. An automated transfer switch (ATS) intended to mitigate issues of power failures across multiple circuits had been incorrectly connected to an electrical circuit for a cabinet. That contained the secondary network devices providing redundancy. When the power supply failed in the core network switch in the first cabinet, it caused the breaker on the “A” circuit to trip. The ATS then switched this load to the alternate “B” circuit, that too caused the breaker in the “B” circuit to trip.

‌

Corrective Measures and Mitigation

The Master Electrician and Senior Facility resources immediately identified the incorrect electrical circuit management during the troubleshooting process and rerouted the equipment connections to individual dedicated circuits. Subsequently, it was determined that the ATS devices are not a necessary component of this design as the switches already contain redundant power suppliers, and the introduction of the ATS simply added complexity and resulted in the condition which led to the tripped breaker. As a result, they will be removed in an upcoming maintenance window. The impact of this change is expected to be minimal as all connected devices have redundant power supplies. Further details will be available in forthcoming change communications.

To further mitigate such scenarios, we will be assessing the relocation of the core network equipment to separate MMRs. Details will be communicated once available.

Although we believe this incident to be an edge case, we have initiated an audit of our critical infrastructure to confirm proper configuration across multiple facilities. We expect to complete the audit over the next three months.

Posted Jun 06, 2022 - 16:49 MDT

Resolved

The data centre provider has now acknowledged that the issue is fully resolved. A Postmortem will be attached to this incident, once we're provided with the Reason for Outage from our data centre provider.

Posted May 19, 2022 - 17:33 MDT

Update

We've been informed that 3 data centres, including ours, were affected in the Montréal area. Our provider will be creating a detailed RFO (Reason for Outage) once the incident has been fully resolved. That RFO will be posted here, once it's made available to us.

Posted May 19, 2022 - 16:32 MDT

Monitoring

Network connection has been restored. Continuing to watch.

Posted May 19, 2022 - 16:18 MDT

Update

We are continuing to monitor the issue. The network remains inaccessible.

Posted May 19, 2022 - 16:16 MDT

Identified

We are continuing to monitor the issue, while our data centre provider works on their network connection.

Posted May 19, 2022 - 15:58 MDT

Update

We've been advised from the data centre provider on the network provider.

"Please be advised that we are currently experiencing a network outage at our location
Our senior resources are working to resolve this as soon as possible.
We will provide an update as soon as possible."

Posted May 19, 2022 - 15:47 MDT

Update

We are continuing to investigate this issue.

Posted May 19, 2022 - 15:45 MDT

Investigating

We are currently experiencing a network outage at the east data centre. We're working with the data centre provider to determine what the source of the issue is.

Posted May 19, 2022 - 15:31 MDT

This incident affected: Web Sales (East) and TM Desktop Database Access (East).