Florida’s Weather and Your IT Infrastructure
Summer in South Florida means one of two things: sunshine or inclement weather, flip a coin. Many consider South Florida the lightning capital of the world for good reason. If that’s not bad enough, hurricane season spans from June to November, making summer and fall worrisome to many organizations with employees and workloads that depend on electricity and Internet.
Natural catastrophic events can cause loss of operating assets, data, loss of productivity and can even put companies out of business if the damage is severe enough or if recovery is prolonged.
Our weather patterns are changing and global warming is driving weather extremes. On July 19th, a heat wave in England knocked out a major data center’s cooling system. According to SiliconAngle.com:
“Temperatures in eastern England hit 40.3 degrees Celsius (104.5 F), the highest ever recorded in the country, causing cooling systems in data centers to struggle and some cases fail from the heat. As a precaution, both Google and Oracle shut down certain services to prevent damage.
The cooling failure “caused a partial failure of capacity in that zone, leading to VM terminations and a loss of machines for a small set of our customers,” Google said.
Oracle had a nearly identical issue with the cooling system in its UK South (London) data center. According to a status update from Oracle Cloud, the cooling issues resulted in service disruption with networking, Cloud Infrastructure Block Volumes, Infrastructure Compute, Infrastructure Object Storage and Integration.”
So, what can organizations to do avoid downtime?
While capital asset losses (machinery, real estate, etc.) are beyond the scope of this article, we will cover those that are within our field of expertise: information technology.
When planning for contingencies, take into account that each event has multiple possible outcomes. It’s important to sit down and analyze what can happen and how each scenario can affect your business and its different levels of operation.
It’s important to enumerate each possible scenario, the losses they may incur, how much each will cost to mitigate, and the likelihood of occurrence.
Take all these factors into consideration and discuss them with your financial staff to decide what is feasible. Break scenarios down into components so that you can decide whether you want to mitigate against an entire worst-case scenario or just against specific parts of it. This way you can apply different mitigations based on risk. Remember, everybody wants complete protection but unless you have a carte-blanche, there are compromises to be made.
Let’s say you have a manufacturing business with an in-house mail server, a file server and an ERP database, and you want to mitigate downtime. Potential weather-related events are: Internet outages, power outages, server failures and fire.
Internet outages may simply require adding a secondary ISP, power outages a generator.
Mitigating against fires may require an off-site backup, and mitigating against server failure may entail clustering, mirroring or cloud PaaS.
After analysis, you may decide that the few outages you have had over the past 5 years do not merit further mitigation, others may see things differently. Varied organizations have different budgets and levels of risk tolerance.
You should also consider what services you can and cannot be without. Do you need all of your services during an outage or just communication services? Your ERP may not do you any good if you can’t manufacture or ship product during an outage. How long can your different services tolerate being down?
Another thing to consider is the recovery time objective (RTO) which is the time it takes to recover from an event. Recovering from backup can take days or even weeks. If you cannot wait that long, you should consider server mirroring, clustering or PaaS (Platform as a Service) to mitigate against hardware/infrastructure failure.
Some industries are regulated and they must keep their communications channels available during catastrophic events. A paper mill may not need its telephone systems working immediately after a hurricane but a call center, medical facility, relief supplier or insurance company most certainly will.
The crux of the matter is that there is no right answer, it varies by organization and the risk acceptance level of the ownership. Sitting down and role playing a disaster scenario is the best way to discover what you can and cannot do without. It may even teach you a few things you were not aware of such as how you contact employees, vendors and customers to provide status alerts.
Each mitigation should be analyzed for options and benefits. Moving e-mail to Office365 is less costly than a data center but moving the e-mail server to a data center means you can also co-locate other servers and mitigate additional business processes.
Mitigation follows the rocket formula in the sense that it quickly reaches a point of diminishing returns. Going from 95% uptime to 98% is relatively inexpensive, it may entail just replacing your routers or switches with better grade equipment (or getting a better IT guy!) Going from 98% to 99% on the other hand is harder and more costly, and reaching 99.9% even more so.
Figure 1: Uptime percentages in relation to downtime.
For example: moving to the cloud or data center may increase your reliability from 98% to 99% but as the aforementioned outage in England proves, nothing is 100%. Getting past 99% means bigger price tags for only basis point increases.
According to Google’s SLA at the time of this writing, going from 99.5% to 99.9% means having to add load-balancing or multiple-zone instances.
Figure 2: Google’s SLAs
In conclusion, planning for contingencies can be complex. My recommendation is for Falcon IT Services to meet with your organization’s stakeholders, enumerate all the possible scenarios that could adversely affect your organization and their likelihood of occurrence, then plan contingency solutions based on requirements and budget. With good planning, budget friendly contingencies can be put in place to avoid extended downtime.