Free Newsletter
Register for our Free Newsletters
Access Control
Deutsche Zone (German Zone)
Education, Training and Professional Services
Government Programmes
Guarding, Equipment and Enforcement
Industrial Computing Security
IT Security
Physical Security
View All
Other Carouselweb publications
Carousel Web
Defense File
New Materials
Pro Health Zone
Pro Manufacturing Zone
Pro Security Zone
Web Lec

Essential business recovery considerations.

Business Continuity Expo And Conference : 10 January, 2008  (Technical Article)
Carl Bradbury, Senior Consultant of Continuity Services at Siemens Insight Consulting asks if it is really possible to recover after a disaster
Is the true possibility of disaster recovery a question you have asked of yourself or your staff? Or is that a question you yourself have been asked? How often do you receive or give the answer as an irrevocable and resounding yes?

When this question is asked or answered what exactly is it we are saying? Well, that very much depends on who is asking or answering the question. If you ask the ICT manager they may well focus on ICT disasters, ask someone from the business and they may well focus on the destruction of the office. Basically every one immediately thinks of the "total loss" scenario the word disaster conjures up.
Now let me plant that element of doubt in your mind and ask the same question at the end of this article, will your answer be as confident as it was at the start? What if we changed the word disaster for "any service interrupting incident". Would your answer change?

What are your recovery objectives? - A question we should now all be familiar with. The recovery time (RTO) and recovery point (RPO) objectives. But have you considered your recovery priorities, what services or elements of the service must be recovered first? So let's re-ask the question "can you recover your services in a disaster within the recovery objectives?"

Is the answer still yes? Well consider this then, do you start the recovery clock ticking from the point of invocation or from the time the service was interrupted? A business person may well start the clock from the time the service became unusable. The ICT manager or recovery manager will start the clock from the time he is authorised to start the recovery, ie from the time of the actual recovery activities commence.

Now let's introduce the maximum tolerable outage (MTO). What is the MTO?. Well, this is your RTO as far as a business person / user is concerned, however for the ICT manager this is made up of two key components, the invocation lead time (incident reporting, investigation, and decision making process) and the recovery time.

The business, when defining their recovery requirements, probably stated the maximum time the service may be unavailable for any one time, i.e. the maximum tolerable outage. For arguments sake lets say the business have stated 24 hours as their recovery time objective / maximum tolerable outage. For the moment let's disregard the recovery point object and assume that the ICT recovery strategy deployed will meet both the RTO and RPO.

The ICT manager has performed numerous tests and has documented evidence that they can recovery the services within the 24 hour objective.

The ICT post test reports all state that the service can be recovered and resumed within 24 hours, the actual time from the last test is 23 hours 15 minutes, this is from the point the recovery actually commences. Therefore, the ICT manager when asked the question can you recovery our services within 24 hours the answer is going to be "yes" the caveat being that their recovery clock starts from when a disaster has been declared and the recovery activities have started.

Does this sound like your organisation? If so lets re-ask the question "can you recover your services in a disaster within the recovery objectives?" Now before you answer have you considered the "invocation lead time" and the stated capability of the ICT manager? Remember they can reinstate the services in 23h15m once the recovery has been started.

Typically many service incidents are reported, investigated, and resolved well within the required time frames and without any significant impact to the business. The majority of incidents are probably localised and affect just a handful of staff or business processes. Indeed, the majority of service desk calls are probably due to user error "help I have deleted a file" or "I have forgotten my password" or "just how do I encrypt and secure 25 million records so I can send them safely through the post".

Now let's look at each of the "invocation lead time" activities in a little more detail.
For minor incidents and normal business-as-usual reporting, the service desk will be the first point of contact.

Now lets look what happens when the ICT service has a significant service problem. Firstly, the volume of calls will be much greater; the operations staff that monitor the services will certainly be aware of the problem and are likely to notify the service desk of that fact. The incident is likely to be escalated as appropriate.
At this stage the cause will not always be apparent and some investigation will need to be performed.

The key question that the investigation must answer is "how quickly can the service be resumed?" Of course, there are other questions as well, for example, what is the problem? What caused it? What is the impact? But the "can it be fixed within my MTO?" is the key question the business will want answered.

For minor incidents this happens seamlessly without any major dramas by "forward fixing" (resolving the incident locally at the source location), however, what happens when it's your data centre that has been rendered unusable through fire or flood resulting in "total loss" and will take several months to repair and "forward fixing" is not an option.

Well for "total loss" it is obvious, the services will not be resumed by "forward fixing" at the primary data centre so invoking your disaster recovery plan and recovering / resuming services at your secondary data centre becomes your only option.

Now would you have made this decision within 45 minutes, remember your ICT manager needs 23h15m to perform the recovery, so in order to meet the MTO you have to make this decision in plenty of time to allow them to recover the services.
So lets re-ask the question "can you recover your services in a disaster within the recovery objectives" when you have to allow for incident reporting and investigation.

The decision making process "to invoke or not to invoke" the disaster recovery plan and restore the services at your secondary / DR site is fairly straight forward for the blindingly obvious incidents that result in the total loss of your primary site, however, what happens when the cause of the incident is less catastrophic in terms of damage etc. Well lets assume you have not invested in a "high availability" solution for one of your servers and this server is critical to your organisation, not as uncommon as some of you may think, especially in the SME arena, or indeed some of the larger more mature organisations who have overlooked that server in the corner of their data centre, which has given years of trouble free service.

Now let's assume this server fails, the calls have come into the service desk, and the investigation has identified the errant component / server. How long has this taken since the initial service interruption? Remember the clock is ticking.
It has been established that the system board has failed and needs replacing. What happens next? Well that will depends very much on the controls your organisation has. Are the various options for recovery discussed and assessed on their merits or does someone in the ICT department just go off see if they can "sort it". If it's the latter option then you could find yourself making a manageable incident unmanageable.

The question is "can I repair the server within the required recovery time scales?" Realistically the answer is probably no. Why? Well you will have to wait for an engineer, the parts will probably need to be ordered (if they are in the parts catalogue), the part needs to be fitted, and who can say if that really was the cause of the failure.

You could try and reinstate the operating system, applications, and data on another server. Good idea? Well not really, the 'new' server is probably of a different specification so any backups you have will not have the right drivers etc.
You could reinstall the operating system and application software from CD images and manually reapply any configuration changes, assuming somebody knows what these are. Then after that you may be able to restore the data from your backup. How long do you think this will take? Can you achieve this within the recovery objectives? What if you can't do this within the 24hours, what do you do? Do you invoke DR?

So let's re-ask the question "can you recover your services following a critical component failure within the recovery objectives?" How confident are you now?
Sometimes it can take longer to recover from a component failure than a full blown disaster.

The question is can you recover your services within the maximum tolerable outage, if so how. Would you really invoke your full disaster recovery plan for a component failure? So what is your decision? Remember the clock is ticking and you have not started your recovery yet.

The maximum tolerable outage is what the business sees as its recovery time objective. This is measured from the time the initial service outage occurs through to service resumption.

The ICT manager sees the recovery time objective as the amount of time he has to recover the service once the recovery has been authorised and the actual recovery tasks commence.

As in our example where the MTO is 24 hours and the recovery tasks for total loss take 23h15m then the recovery must start within 15 minutes of the initial service outage. Realistically this is unlikely to happen. So how do we address this discrepancy?

Well if the MTO really is 24hours, we need to apportion this between the invocation lead time and the recovery time. So let's assume that the invocation lead time is a maximum of 4 hours. This leaves just 20 hours for the recovery to be performed. In our example this is not possible as the best time we have achieved is 23h15m so what can we do?

The answer may well range from implementing a greater level of resilience and removing single points of failure (why have an outage if you don't have to?) through to redesigning your entire recovery strategy so your recovery phase can be completed within the 20 hours allowed.

Finally let's ask the question one more time "can you recover your services following an incident, either component failure or total loss, within the required timeframe"?

Siemens Insight Consulting is exhibiting at will be exhibiting at the Business Continuity Expo and Conference held at EXCEL Docklands from 2- 3rd April 2008 - the UK's definitive event for managing risk, resilience and recovery. This event will explore the solutions and best practice to ensure operational continuity and protect a company's interests before during and after an incident.

Bookmark and Share
Home I Editor's Blog I News by Zone I News by Date I News by Category I Special Reports I Directory I Events I Advertise I Submit Your News I About Us I Guides
   © 2012
Netgains Logo