Many financial managers think of IT Disaster Recovery as something that never really happens to them and even if it did the business would be able to get back to business as usual inside a day or so. Assets are safeguarded, backups taken (usually), and unless you are forced to, having a plan and some good intentions are equally as good as having DR equipment sitting around doing nothing utilizing the earth's resources at great expense. This philosophy is green too !!!
There are a course tangible business reasons for disaster recovery, sometimes compliance with customer and supplier requirements, reductions in Insurance premiums (if you can get any) and badges of accreditation such as ISO27001 and BS25999 or industry regulations. Still it all costs money so doing this as slowly as possible using the least amount of resource and cash (any old equipment will do not it?) Means we can achieve almost what we want but without wasting money. No impact on results, job done!
As a former CFO for a small technology sector PLC in the UK I admit that these are some of the thoughts I have had when it comes to disaster recovery and I am not alone (a BT Global Services survey in May 2008 found that 73% of organizations rarely on the ad-hoc dedication of their staff rather than their business continuity plans to get them through a disaster). The thing is that unless I wanted to spend more than double my IT budget by having an all singing all dancing duplicate site I may as well be resigned to the fact that if we have a freak tornado come down the Thames Valley or have a jumbo jet fall out of the sky then so be it. I'll take my chances with the insurance company.
Up until a couple of years ago I would still have sympathy with this rather jaded approach to IT DR. If I wanted anything good that would work I could not afford it and if I made compromises on budget and do with less, in all probability it would not work (as we could not afford to test it regularly and risk impacting the business) so why spend anything except on writing the Business Continuity plan wrapped around some best intentions and still be able to tick the compliance box.
The world has however moved on and when we have a 'Disaster' in IT everyone knows about it as we have become more and more dependent on it always being there. Most of our systems are now considered critical (Up from 36% in 2007 to 56% in 2008 in a Symantec survey Aug'08). However IT Disasters which effect the end user are not limited to being a flood or an earthquake, these are quite rare. Most IT Disasters are caused by IT itself, namely hardware failure, software glitches, infrastructure issues and human error. Becoming more frequent with no less effect on our businesses in the same Symantec survey above it discovered that one of the companies had executed their DR plan, at least in part, during the past year.
My ignorance has been however how much of a risk to the business an IT disaster can be. Sure it would have a massive impact for a short while but that is based on my assurances by my IT team that all would be well. Perhaps I may have asked the wrong question? The fact however may be a little different. For example if we lost a number of our computer systems by way of a fire, theft, power surge etc. In our plan we would need to source some new kit via the supplier names listed. There is every likelihood that most of this would be delivered next day (except its Dell) but more probably the day after (because the delivery address has changed), however key peripherals such as Cisco firewalls and routers or tape drives are often not standing on the shelves and in some cases depending on how far up the enterprise IT food chain you could have a 4 -8 week delivery time. This is indeed worth checking this as well as the estimated time for tape restoration on whatever tape drive you plan to use. Remember tape does not restore instantly and can take many days if you have a lot of data.
Getting kit on site, if one still exists, is just the beginning. This is when the work really starts because all the brand new kit you have just bought is not the same as the original lost in the fire / flood / theft so restoring trouble free from your back up tapes which will hopefully have arrived complete and uncorrupted is not going to happen in all likelihood. The thing is that Windows operating systems become attached to a particular machine specification and without it's the same specification machine you really have to start from scratch. Let's hope all those build docs are up to date with valid license keys and that the hardcopies are not destroyed or stored worse, stored on the machines you need to rebuild. At this point there is no known timescale to get back to normal running, it may take a day, a week, even longer if they are interdependent systems. Remember tape does not restore instantly. Again in the same Symantec survey 47% of those with plans reported that it would take a full week to achieve 100% normal operations.
Meanwhile the business has ground to a halt. Customers can not be deal with, invoices can not be raised, salesmen can not sell and nothing is getting done. The main business focus is to keep customers in the boat who, although initially sympathetic drop faith pretty quickly if you are not back to normal within a couple of days. Competitors you did not know you had will be beating a path to their door with 'new customers only' deals. Reputation of a business is a very valuable asset and takes time to create but can become worthless overnight. Are you really the best supplier for them? Perhaps their previous inertia has been a bit bit too cozy.
The acid test to see if this scenario could become reality is to ask your IT team if they would not mind testing their DR plans next week with their annual bonuses riding on it. Only then will the caveats and favorable options be added to the 'a day or two' estimate. When shit happens timescales spiral and you really need a worst case estimate to make an objective assessment of the 'risk and reward' balance of provisioning and planning for an IT disaster.
So what is the answer? Whatever you have an expensive DR solution or just sitting out with your arcing showing, technology has moved on. With the introduction of virtual server technology breaking the bond between Windows Operating Systems and hardware, a business can be protected to a far higher level at a fraction of the previous cost. There are many different solutions as befits a fledgling industry. Some will be more appropriate than others but most will be better than the Lucky White Heather arrangement that many businesses have relied upon in the past. One thing is for certain though, astronomically high costs are no longer an excuse for accepting the business risk that it will not happen to you.
Ten requirements for a successful virtualised IT Disaster Recovery solution:
- Fast 'Recovery Time Objectives (RTO) # -you need to back managing the situation as soon as possible
- Recent Recovery Point objectives (RPO) * – depending on your business real time replication sounds good, but if the cause of disaster is corrupt software or data, it's now in your recovery platform too!
- Short Test Time Objective (TTO) $ – systems change all the time, if the DR solution can not be tested easily and regularly do not be surprised if it does not work
- Geographically / Infrastructure independent of the live platform – miles not yards and different utility providers
- Independent of key staff – because if the worst does happen you can not be sure they will be there
- Easily accessible when invoked – network issues are the most difficult resolve if not planned in advance and there is nothing more frustrating than having your machines working but not accessible to the outside world
- Performance – ensure that the DR solution can cope with the load which may be increased following a disaster
- Beware virtual production machines – being so simple to create, it is so easy to forget to back up, document and provision additional DR protection
- Automate, Automate, Automate – if you rely on a manual routine it's the easiest way to drop when faced with other short term priorities
- Green – Virtual technology makes best use of resources and economies of scale can be gained through shared infrastructure because baring an all out Nuclear attack not everyone has a disaster at the same time
– RTO – the length of time between a system disaster and when the system is operational again
– RPO – the time between the latest backup and the system disaster, representing the nearest historical point in time to which a system can be recovered
– TTO – measures the time taken and effort required to test a disaster recovery plan to ensure effectiveness.