ScoNet | IT and Cloud Hosting Services
Voip Solutions in Houston, Sconet

Summary of Obacks Service Event in Central Data Center

Issue Summary

On Thursday, 6/1/2023 at approximately 1PM CST, our team was alerted to servers going offline at our central data center location. Our data center is designed with fully redundant hardware. In the event a piece of hardware fails or goes offline, under normal circumstances there would be no services disruption. However, in this instance, we discovered there was a complete loss of connectivity on all redundant hardware which resulted in the entire data center environment going offline.

Service Impact

There are multiple services hosted at this data center location. This includes hosted servers, our hosted Exchange email platform, oDrive, as well as some of our voice over IP servers. The outage affected all of the hosted servers that are running primarily from this data center, Exchange email, and oDrive. While our VoIP service is hosted out of the same physical data center, this service runs on a completely separate network and was mostly unaffected by the service outage. However, there were two instances of brief connectivity loss due to some necessary troubleshooting steps our team was performing to resolve the outage. This impact to VoIP was very minimal.

Full Issue Description

Our team was in the process of working on a 3rd party monitoring system that is tied in to our data center, to provide additional proactive monitoring from one of our hardware support vendors. The hardware vendor provided our team with a set of commands to add into a core switch network to assist in allowing this monitoring to become functional. We now believe there was a miscommunication with the hardware vendor on how our system was configured, and the configuration we were provided were not the correct configuration that should have been entered into the system. Upon entering the configuration, it caused all of our redundant networking hardware to reboot simultaneously, which of course should never occur.

The reboot of all core hardware simultaneously caused a catastrophic event, which disconnects all internet connectivity, as well as disconnects the virtual compute layer from the storage layer, which essentially freezes all servers running in the environment. Once the hardware fully rebooted, services generally will not automatically restore, and our team had to further troubleshoot. Our compute environment is connected to our storage systems through multiple 40GbE redundant links. It was discovered two of these transceiver modules failed simultaneously. While the links should have been redundant and allowed us to begin restoring service with only one link coming back online, with both links, we were unable to restore service. We immediately contacted our support vendor to dispatch replacement parts, which are on the quickest 4-hour replacement window. Upon arrival, these parts were replaced, and our back-end connectivity was restored.

With the compute and storage layers being disconnected for so long, virtual servers continued to "run" in memory, however without the ability to read/write to storage, these virtual servers were essentially in a frozen state. Our team then began to manually power down all virtual servers in the environment, and power them back on. This is a laborious process as there are hundreds of servers running, with each one having to be touched individually. Due to the nature of the failure, some servers have issues booting and have to have additional work performed on them to ensure they fully boot and come back online. Our team worked diligently on this process for hours until all servers were verified online by approximately 11PM CST. The issue was then handed off to our after-hours team, to work on manually going into each individual server, testing client applications, and verifying file and database integrities to ensure there were no further issues that would arise at the start of business on today, Friday 6/2/23.

Event Communication

We understand this was a critical event and had a large-scale impact to our client's business needs. Due to such an unprecedented and unexpected failure of the system, there were delays in notifying our clients of the full impact while our team was attempting to gain a full understanding of what occurred. This resulted in a delay in notifying clients from our status system, as well as being able to provide a better ETA on restoration of services.

Future Planning

With this level of outage not being an event that was fully planned for, we will be revising multiple procedures and making additional changes internally to prevent future occurrences such as this. Our change management policy consists of certain pre-approved changes, however in this instance although this was a pre-approved change and this configuration was received from a hardware support vendor, the configuration provided for this pre-approved change was incorrect. Going forward, we will be reducing our list of pre-approved changes to be much smaller and minimal. Our change management process workflow will require additional levels of review and approval before any of these changes are implemented. While much work can be performed on our data center environment while live, without any service interruptions, we will be requiring scheduled maintenance windows for additional changes as well. Lastly, we will be evaluating the spare hardware we keep on hand, and purchasing additional spare hardware, so even in the event of multiple simultaneous hardware failures, we can still provide immediate replacements without the need to wait on a part to be dispatched.

If you have any further questions, please feel free to reach out to our team and we will be more than happy to discuss this matter further with you. We strive to keep our systems online at all times, and fully understand any outage, small or large, can affect your business. We will continue to work to provide the best level of service you have come to expect from ScoNet.

Thank you,

Kyle Maulden
Chief Technical Officer

~06/02/2023


GET IN TOUCH

CONTACT US

GET IN TOUCH

CONTACT US

ScoNet | Managed IT Services Phone Number 713.993.7660 713.993.7660

ScoNet | Managed IT Services Mail sales@sconet.com sales@sconet.com

ScoNet | Managed IT Services Location Houston, TX 5090 Richmond Ave, #111, Houston, TX 77056

Message successfully submitted, our team will contact you shortly.

*All fields are mandatory.

*Invalid email format.

ScoNet | Managed IT Services Contact Us

ScoNet | Managed IT Services Phone Number 713.993.7660 713.993.7660

ScoNet | Managed IT Services Mail sales@sconet.com sales@sconet.com

ScoNet | Managed IT Services Location Houston, TX 5090 Richmond Ave, #111, Houston, TX 77056

Message successfully submitted, our team will contact you shortly.

*All fields are mandatory.

*Invalid email format.

ScoNet | Managed IT Services Contact Us
ScoNet | Managed IT Services

ScoNet | Managed IT Services Phone Number 713.993.7660 713.993.7660

ScoNet | Managed IT Services Mail sales@sconet.com sales@sconet.com

ScoNet | Managed IT Services Location Houston TX 5090 Richmond Ave, #111, Houston, TX 77056

2024© ScoNet Inc.

ScoNet | Managed IT Services FaceBook ScoNet | Managed IT Services LinkedIn