Forums  |   Blog  |   Contact  |   Chat Now  |   My Cart  |   MyServerBeach   
 

  #1  
Old 2005-06-11, 13:28 PM
Mitch Mitch is offline
ServerBeach Forum Admin
Join Date: 2003 Jul
Posts: 78
Exclamation Post Mortem of Virginia Power outage

Dear ServerBeach customers,

We have spent much of the day Friday and Thursday night on the phone with our engineers in Virginia. While we cannot give a 100% certified Post-Mortem on what transpired Thursday, we wanted to communicate what can so far.

At approximately 2:15 CDT on June 9, 2005, we experienced loss of commercial power to our Virginia datacenter. Several other buildings in the area also experienced power loss. All customer servers and network equipment immediately failed over to UPS power. The air handlers, which are typically not supported by the UPS, remained off and the datacenter began to rapidly heat up to over 100 degrees Farenheit. In order to prevent hardware failure, ServerBeach technicians began shutting down all servers.

Our backup generators (3) started but never engaged to power the UPS. Therefore, the UPS batteries drained and went offline about 30 minutes later, causing a total power loss and network outage.

The commercial power was out for approximately 4 hours. During this time our engineers attempted to engage generator power, but due to a faulty automatic transfer switch (ATS) that was not possible. Power to all servers and network equipment was restored once commercial power came back online and we began the task of cleaning up, checking servers, replacing hardware, and queing servers for OS reloads where applicable.

A crew of 4 ServerBeach technicians from San Antonio, armed with spare parts and caffiene, were flown to Virginia Thursday afternoon for a 10:30 pm arrival. This was a precautionary move since we wanted to make sure we had plenty of staff on hand. Total technical staff on hand rose to over 10 with 6 crash carts ready.

At approximately 10:45 pm CST commercial power failed again. The UPS batteries had charged, but the air handlers shut off and generator power was still unavailable due to the faulty ATS. Again, ServerBeach technicians began to power down servers to prevent heat damage, but commerical power was restored shortly thereafter. Only a few rows of servers and some critical network equipment were shutdown at this point, and were powered on as soon as the air handlers lowered the temperature to a safe level. We never lost network connectivity in this second power outage.

The San Antonio and Virginia technicians worked throughout the night getting servers online, with about 90% of customer servers online by 2:00 AM CDT (June 10).

Due to the unreliable nature of the ATS and storms in the area, an engineer remains 24x7 to monitor power and to perform a manual power transfer if necessary. This means we should not have any loss of air handlers from a power loss in the near future.

On Friday afternoon at approximately 4:30pm CST, our engineer performed a successful manual bypass to generator power with the supervision of additional Critical Power engineers from a reputable local engineering contractor. Air handlers came on in about 5 minutes and there was no loss of power or network.

The manufacturer of the ATS has been contacted and will be on-site Monday morning to troubleshoot and repair/replace the faulty ATS. Until then, our engineers are on site 24x7 in case we need a manual transfer again.

The exact problem is with the logic inside the ATS. As we learn more Monday or Tuesday, we will post additional information and plans for resolution.

Once again, on behalf of the entire ServerBeach staff both in Texas and Virginia, we apologize for the outage and we thank you for your patience as we pursue final resolution.

If you have additional questions, please don't hesitate to contact us.


Sincerely,

Robert Miggins
COO ServerBeach
and the rest of the ServerBeach team (QT, knightfoo, etc.)
Reply With Quote
  #2  
Old 2005-06-20, 16:23 PM
Mitch Mitch is offline
ServerBeach Forum Admin
Join Date: 2003 Jul
Posts: 78
Final resolution of power outage

Excellent news to report today! Last Friday, the ATS technicians resolved the problems with the faulty switch. As it turns out, last week’s problems originated from the sync relay that was replaced last week however the PLC circuit board was not bad after all. When the relay failed, it caused the software logic to, in essence, get stuck in transition mode and would not reset itself.

The technician reset the logic and performed several tests of the system. Each time the system performed 100% as it was designed. Therefore the system has now been returned to normal and is in full automatic operation.

In addition to resolving this, we are taking the following steps to improve system operation and reliability:

¨ Our technicians have been trained on how to reset the logic. Should that situation occur again, a procedure is now in place to reset the logic.
¨ The ATS manufacturer will be providing copies of all 7 PLC software programs to us.
¨ One spare PLC circuit board will be kept on site in the event of a failure.
¨ Our engineering firm will have 3 technicians trained by the ATS manufacturer on their software and thus will have the local ability to diagnose and restore software issues with any of the PLC’s
¨ Our engineering firm will be submitting a proposal to install a power outage simulation switch for the purpose of conducting monthly power outage simulations. Thus the switchgear and its components will be exercised monthly to ensure the system is working. If it should fail, a manual transition procedure will be in place to transfer the power to generator.

Once again, we appreciate your patience while we pursued a final resolution. If you have any questions whatsoever, please let us know either by writing back or by submitting a ticket.

Thank you for choosing ServerBeach,

Robert Miggins
COO, ServerBeach
Reply With Quote
Reply


Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 13:10 PM.