// CWH-2026-107 // The Machine

My Backup Script Wiped 5 of My Own Sites. Here's What I Built So It Can Never Happen Again.

May 22, 2026 // Operator Post-Mortem // Connor MacIvor
TL;DR Yesterday I caught a backup script that had silently been capturing API metadata as if it were file content for months. When a downstream migration pushed the corrupt repos to production, five of my own sites — including the SRAR talk landing page — went down for about 60 seconds. Netlify's deploy-restore API got them back. Then I built three CI safety guards that make the exact failure mode mechanically impossible going forward. This is the story of building production AI infrastructure when the failures cost real things — and why "be careful" is never the fix.

Most AI vendors will sell you tools and walk away. They won't tell you what breaks. They won't write the post-mortem when their system has a bad day. They certainly won't show you the CI pipeline they built to prevent the failure from ever happening again.

I have to. The sites are mine. The clients are mine. The phone number on every one of them is mine. So when something breaks in this infrastructure, the post-mortem is the deliverable. This is yesterday's.

The Setup: 26 Sites, One Deploy Pattern

I run a network of 26+ own-brand Netlify sites. ConnorWithHonor.com. SeventeenK.com. SCV123.com. HonorElevate.com. SantaClaritaAI.com. The dated property pages. The SMB demo sites. Each one is its own GitHub repo. Each one deploys when I push to main.

That's the system now. Yesterday it wasn't.

Yesterday five of those sites were still on a legacy "snapshot upload" pattern — a Python script that downloaded the current live deploy, modified files locally, and pushed the whole thing back up via the Netlify API. The script worked. Right up until it didn't.

The Failure Mode I Didn't See Coming

Months ago, an automated backup script had run against those five sites. The script was supposed to capture every file on the live deploy and commit it to a backup repo on GitHub.

Here's what it actually did: it hit the Netlify API endpoint GET /api/v1/deploys/{id}/files/{path}, got back a JSON metadata response (something like {"id":"/index.html","sha":"...","deploy_id":"..."}), and saved that 215-byte JSON blob to disk with the original file's name. As index.html. As favicon.svg. As og-image.png.

The script reported success on every file. Every HTTP request returned 200. Every file got written. Every commit landed cleanly in git. The repos looked fine — same file paths, same directory structure. Nobody looked inside.

And the corruption sat there for months. Live still served the real content. The corrupt repos were "backup" mirrors that nobody ever deployed from. Until yesterday's migration.

The corruption was invisible because every file path matched what live had. The bytes inside were garbage. Nobody had run the byte-level check.

The 60-Second Window

Yesterday I migrated all 26 sites from the snapshot-upload pattern to git-canonical auto-deploy. The migration cloned each repo, added a GitHub Actions workflow, pushed. For 21 of the 26 sites, the deploy was clean.

For 5 sites — Carrie's SRAR landing page (srar.connorcoded.com), plumbing.honorelevate.com, rdmoving.honorelevate.com, electricians.honorelevate.com, and rd.connorhonor.com — the migration deployed the JSON-metadata garbage. Live went from real content to 215-byte blobs of API metadata served as HTML. The browser still rendered something, but it was nonsense.

I caught it during the post-deploy smoke test. Within 60 seconds of the wipe, I had used Netlify's deploy-restore API to swap each site back to its previous good deploy. Here's the actual call:

POST /api/v1/sites/{site_id}/deploys/{good_deploy_id}/restore

Netlify keeps every previous deploy. The restore was instant. Five sites back online, serving the same content they were the day before. Then I disabled the auto-deploy workflows on those five repos so the bad content couldn't ship again.

What I Built So It Can't Happen Again

Recovery in 60 seconds is fine. Prevention is better. I spent the next few hours building three CI safety gates that now run on every push to every one of my sites:

GATE 1Anti-wipe guard. Before any deploy, the workflow fetches the current live file list and compares it to the new deploy's file list. If the new deploy would drop file count below 80% of live, the workflow fails. If it would remove more than 3 files, the commit message must start with remove: to authorize the removal. Critical files like /index.html must always be present.
GATE 2Content-corruption guard. Every file in the new deploy gets scanned for the specific signature of the failure I just hit: a file that starts with {"id":"/ and contains "deploy_id" in the first 300 bytes. That's the Netlify-API-metadata-as-content pattern. If any file matches, the deploy is aborted with a clear error message pointing at the recovery procedure.
GATE 3Post-deploy verification. After the deploy completes, the workflow hits the homepage URL and checks the response. If anything other than a 200/301/302 comes back, the workflow fails loudly. Per-site verification URLs can be added via env var.

The guards are pessimistic by design. They refuse the deploy when in doubt. Better to fail noisily than to wipe quietly.

The five previously-corrupt repos now have clean content too. The fix used a different Netlify endpoint — fetching from the deploy-preview CDN URL https://{deploy_id}--{site_name}.netlify.app/{path}, which returns actual file bytes instead of metadata. Every file got verified before commit. Every workflow got re-enabled.

Why This Story Matters For Anyone Evaluating an "AI Builder"

The marketplace for "AI consultants" right now is crowded. Most of them are repackaging vendor tools. They'll tell you to install a SaaS, configure some webhooks, and call it AI integration. When the system has a bad day, they'll blame the vendor and move on.

That's not the same category of work as what I'm describing here. I built the deploy pipeline. I built the safety guards. I caught the failure, recovered the sites, wrote the post-mortem, and shipped the prevention layer — all in the same day. That's the AI builder pattern. The other thing is AI consulting.

If you're a real estate agent, an SMB owner, or anyone evaluating who to hire to integrate AI into your business: the question to ask isn't "what tools do you recommend?" The question is "what have you built, in production, that you have to keep running?" That's where the real credibility lives.

"Be careful" is not a safety system. Code is. The unsafe action has to be made mechanically impossible.

What You Can Take From This (Even If You're Not a Builder)

Three things this story should change in how you think about your own systems:

1. Path-parity is not content-parity. If you back up files by name and never check the bytes, you have a "backup" that may be useless. The corrupt files in my repos had the right filenames and the right directory structure. The content was garbage. Verify bytes, not just paths.

2. Recovery is a capability worth building before you need it. My 60-second recovery wasn't luck. It was the Netlify deploy-restore endpoint, which I knew existed because I had read the API docs. If I had been figuring it out under pressure, the window would have been hours, not seconds. Read the docs of your tools before something breaks.

3. Safety guards are CI work, not human work. The single highest-leverage thing I shipped yesterday was three small bash scripts in a GitHub Actions workflow. They run on every push, forever, with no human attention. They've already prevented more failures than I'll ever consciously notice. That's what "automation that protects you" looks like — not a checklist on the wall, not a procedure document nobody reads. Code that refuses to do the wrong thing.

The Quiet Confidence Of A System That Refuses

Here's the new normal. I push a change to any of my 26 sites. Three CI gates run. If anything looks wrong, the deploy fails and I get an email. If everything looks right, the deploy ships and the homepage gets verified live within 90 seconds.

I no longer have to wonder whether a deploy went through correctly. I no longer have to worry about a backup script silently corrupting my repos. The system tells me. The system refuses the bad operation. The system makes me look smarter than I am.

That's what good AI infrastructure feels like. Not a magic black box. A set of small, sharp tools that each refuse to do the wrong thing. Stacked. Boring. Reliable.

Build the boring layer. Then everything that runs on top of it gets to be brilliant.

Want the Same Kind of System for Your Business?

I build AI systems for real estate agents, SMB owners, and operators who need infrastructure they can trust. HonorElevate is the white-labeled platform. HireAIVoice handles the phone calls. SCV123 handles the SCV market questions. And the whole thing runs on the infrastructure described above.

Visit HonorElevate

Speaking at SRAR August 3rd

I'm walking SRAR agents through the augmentation-not-replacement thesis on August 3rd. The talk lives at the intersection of "what AI can actually do for you" and "what you have to build yourself if you want it to work right." Read the preview here.

Read the SRAR Aug 3 Preview
// Want More Like This?

The Daily Download in Your Inbox

Weekday morning AI news, weekly SCV real estate market deep-dives, and long-form essays when something needs to be said. No spam. Unsubscribe anytime.

FAQ

What is an anti-wipe CI guard?

A check that runs in continuous integration before any deploy ships. It compares what's about to be deployed against what's currently live. If the new deploy would silently remove files, drop below a file-count threshold, or contain corrupt content, the guard refuses the deploy. Makes accidental destruction mechanically impossible rather than relying on human carefulness.

Why would a backup script wipe a site?

In Connor's case the backup script captured Netlify deploy-file API metadata (a JSON response with id, path, sha, deploy_id fields) and saved each metadata blob as if it were the actual file content. The script reported success because every HTTP request returned 200. Five sites ended up with repos full of 215-byte JSON blobs masquerading as HTML and images. When those repos got deployed by a downstream migration, the garbage shipped live.

How fast can you recover from a Netlify content wipe?

Under 60 seconds per site if you have the Netlify API token and the previous-good deploy ID. Netlify keeps the full deploy history. A single POST request to the deploy-restore endpoint swaps the active production deploy back to the previous good state. The slow part is identifying which deploy to restore — the API call itself is instant.

What is Connor MacIvor's site deployment infrastructure?

26 own-brand sites on Netlify, each deployed from a GitHub repo via GitHub Actions. Each repo has a workflow with three sequential safety gates: anti-wipe guard, content-corruption guard, and post-deploy verification. Connor designed and built this entire stack himself.

Why does this matter for clients evaluating AI builders?

Most AI consultants repackage vendor tools. Connor builds the systems, ships them in production, runs them at scale, and writes the post-mortems when something breaks. That's a different category of credibility than someone who installs a SaaS and calls it AI integration.