How do we really test restore and recovery, instead of just letting backups run?

Your backups are running. Every morning the system sends a green email, sometimes a weekly overview. Nobody looks too closely because it’s green. The uncomfortable question comes at some point from outside — from the cyber insurer, from the auditor, from the tax inspector: “When was a restore last really tested?” We build a backup drill with you that can answer this question honestly.

Do you have this situation?

Your backup software has been reporting green jobs for years. Whether the data is really recoverable in an emergency, no one has systematically tested since the rollout.
Nobody can say off the top of their head how long a realistic restore of the ERP server would take. The answer swings between “a few hours” and “probably a day” — depending who you ask.
Microsoft 365 is in place, but there is no third-party backup for Exchange, SharePoint or OneDrive. If someone deletes a mailbox or ransomware encrypts a mailbox, the answer is “Microsoft handles that” — and that’s only half right.
The cyber insurer has sent a questionnaire asking about “tested restore procedures”. You’ve left the question open because you don’t know how to answer honestly.
There is no documented RTO (Recovery Time Objective) and no RPO (Recovery Point Objective) — that is, no written agreement on how long an outage may take and how much data loss is tolerated.

Why solve this now instead of postponing

Backup is one of the areas where promise and reality drift apart furthest. A green backup message means that data was written — not that it’s readable, that the right data is in it, and certainly not that a recovery works in acceptable time. Whoever postpones the drill only postpones the problem until the emergency.

Typical triggers that are now due:

Ransomware wave in the sector — a competitor was hit, in management the question comes up “Are we prepared?”.
Cyber insurer asks — the questionnaire demands proof of tested restore procedures, otherwise the premium rises or the insurer doesn’t want to write the policy.
NIS-2 preparation — business continuity and disaster recovery are mandatory topics, if you are directly affected or have to provide proof as a supplier.
Audit note — the auditor or internal audit has flagged the topic.
An M365 account was actually compromised — and you noticed that recovery is more complicated than thought.

How it would look at your company

Step 1 — Set RTO and RPO per system

We sit down with you and management and go through the important systems: ERP, file storage, mail, CAD, industry software. Per system we clarify: how long may the system be down at most before the business seriously suffers (RTO)? How much data loss is tolerable — one hour, four hours, one day (RPO)? That’s a business decision, not an IT decision. Delivery: a compact table in which a written RTO/RPO agreement stands per system. With that the “probably a day” guessing stops.

Step 2 — Define restore scenarios

Together with you we pick three to five realistic scenarios to exercise. Typical mix for a mid-market company:

Scenario A — restore a single file: someone has deleted a construction drawing, it’s needed in the version from the day before yesterday.
Scenario B — restore a complete VM or server: the ERP test server is dead, it is to be brought up on new hardware or in Azure from the backup.
Scenario C — restore an M365 mailbox after compromise: an employee was phished, the mailbox was manipulated by attackers, a state from 24 hours ago is needed.
Scenario D — restore a SharePoint site or OneDrive folder: a site was accidentally deleted or encrypted.
Scenario E — ransomware full outage: multiple systems are encrypted, in what order is what restored?

Delivery: a scenario list with a clear success definition per scenario. What has to work at the end of the drill for the test to count as passed?

Step 3 — Conduct the drill in an isolated environment

We set up an isolated test environment with you — a separate Azure subscription, a VM sandbox or a dedicated test area. There we play through the scenarios. In scenario B the ERP test server is really restored from the backup, started, validated against the database. In scenario C a test mailbox is restored, the content checked. During the drill a stopwatch runs — per scenario we record how long it actually took, where it snagged, what was unclear. Delivery: a drill protocol with real times, real data volumes, real obstacles. No theory.

Step 4 — Evaluate gaps and write a restore runbook

After the drill you have three kinds of findings: what worked (sometimes surprisingly well), what took longer than hoped, and what didn’t work at all. Together with you we evaluate the gaps. Some are technical gaps (too little bandwidth to the backup target, wrong retention setting, missing indexing), some are organizational gaps (nobody knew where the recovery password is, the responsible person was on holiday). From this a restore runbook emerges: a step-by-step instruction per scenario that a stand-in can also execute in an emergency. Delivery: a runbook of 5 to 15 pages, not a novel.

Step 5 — Set a repetition rhythm

A one-off drill is better than never — but it ages. We agree on a realistic rhythm with you: once a year a full drill, semi-annually a small spot check of individual scenarios. That fits into a mid-market company without blocking operations. Delivery: a calendar rhythm with responsibilities and a mini-template that halves the preparation the next time.

What you should look out for along the way

Have the drill conducted in an isolated environment, not on production. Whoever proposes “restore test live on the running system” has no plan B if the test goes wrong. A separate sandbox is mandatory.
Ask about the success criterion per scenario before the drill starts. “We test the restore” is not a criterion. “Mailbox with a state from 24 hours ago is available and readable in the test tenant” is one.
Caution with backup software that sells its own validation as a restore test. An integrated job verification is good, but it isn’t a full drill. It checks the readability of the backup file, not the functionality of the restored system.
Clarify before the drill who is informed in which order if the test goes wrong. Even an exercise can trigger an incident — e.g. if the backup license is limited during the restore attempt and no further jobs run.
Plan the runbook for the person with the least knowledge, not for yourselves. In an emergency the specialist is on holiday. If only she would understand the runbook, it isn’t a runbook, it’s a note.

What realistically changes afterwards

You have it in black and white how long a realistic restore per scenario takes — and which data volumes are really recoverable.
The cyber insurer’s question on tested restore procedures you answer with a protocol, not with “we’ll get to it soon”.
When the emergency hits, there is a runbook that a stand-in can also execute.
RTO and RPO are no longer “gut feeling”, but a written agreement between IT and management — with realistic values.
Gaps in the backup concept (missing M365 backup, too short retention, unfavourable restore targets) are known and prioritized, instead of slumbering unnoticed.

What you contribute

A person who knows the backup system administratively and is reachable during the drill.
Access to the backup console, to the target environment (Azure, local hardware) and to the test data.
About half to a full working day of the IT responsible person per drill, plus around 2 hours of stakeholder time for the RTO/RPO setting with management.
Willingness to accept honest results — even when the first drill shows that restore time is longer than previously assumed.

Risks and when it doesn’t fit

If your backup system is outdated and a version upgrade is due anyway — then first the upgrade, then the drill. It brings little to test an architecture that will be replaced in three months.
If you don’t yet have a systematic backup solution at all, but “a few scripts and external hard disks” — then first set up the solution, then the drill. We help with both, but in this order.
If management expects the drill as a “thumbs up” exercise in which nothing must go wrong — then better clarify beforehand that a drill exists exactly to find gaps. Whoever doesn’t want to find gaps shouldn’t do a drill.

How the conversation starts

30 minutes initial conversation, free of charge, by video or phone.
What we clarify: currently used backup solution, which systems run under it, most urgent question (insurance, audit, gut feeling), time window for a first drill.
Optionally useful in advance: a screenshot of the last backup overview, information on whether M365 backup is in place, a rough idea of which systems are business-critical.

Book an initial conversation

Frequently asked questions

Isn’t it enough if our backup software verifies the jobs internally? The internal verification checks whether the backup file is readable — that’s helpful but not sufficient. A real restore drill checks whether the recovered system also functions: whether the database starts, whether the application connects, whether users can sign in. That’s a different class of test.

Do we really need a third-party backup for Microsoft 365? Often yes, but not always. Microsoft secures the infrastructure — what Microsoft does not do is a long recovery horizon for accidentally deleted or attacker-manipulated data. The standard retention for deleted mails is 30 days, for deleted accounts similar. Whoever has to go back further needs their own solution. In the drill that quickly becomes visible.

How long does a drill take overall? Preparation and scenario definition around one week, the actual drill execution typically one day in the sandbox, evaluation and runbook creation one to two weeks. For the first iteration that’s about 3 to 4 weeks duration, with a manageable time commitment on your side.

What if the drill shows our backup concept has gaps? That’s exactly what the drill is for. Gaps are good news in the first instance — they are known, instead of surprising you in an emergency. Together with you we prioritize which gaps need to be closed first and which are acceptable. Not every gap has to be fixed immediately, but every one should be consciously decided.

Service: Cloud Operations (Azure-Hybrid)
Use Case: How do we build an Azure Landing Zone for the Mittelstand?
Use Case: How do we pragmatically introduce a Zero-Trust baseline in the Mittelstand?
Knowledge (German): Microsoft 365 backup — why a third-party backup is often necessary