We Deployed Our COO Into Azure. It Took 26 Tries.

Primary and Azure twin

Today we stood up "David" — a warm-standby replica of Hermes, the AI agent that runs our operations — in Azure. If our Mac mini dies, goes offline, or catches fire in Sacramento, David picks up the phone. Same skills. Same memory. Same voice.

It should have been a 30-minute job. It took us all morning.

This post is about what broke, what we learned, and why we think the story is worth telling.

Why DR for an AI Agent?

We don't use Hermes like you use ChatGPT. Hermes is infrastructure. He schedules crons, ships code, manages our Azure subscription, drafts client work, answers Telegram, runs monitoring, and coordinates a fleet of specialized agents for our clients. When he's down, VCG operations are down.

Hermes lives on a Mac mini in Sacramento. One building. One ISP. One power grid. That's a single point of failure running a production business.

So we built David — an Azure-hosted Hermes twin that can take over if the Mac mini goes dark. Same identity, same skills, same tools. A COO you can't kill.

The Architecture

Primary Hermes (Mac mini, Sacramento)
      ↓ nightly rsync over HTTPS
Azure File Share (skills, memories, identity files)
      ↓ mounted at /opt/data
David (Azure Container App, westus2)
      ↓ via Tailscale userspace networking
Back into the private VCG mesh (Mac mini, sue-vm, DGX boxes)

No inbound HTTP surface. No public IP. Telegram is the user interface. Managed identity fetches secrets from Key Vault. SSH and git live inside — so David can actually do work, not just talk.

The Six Things That Broke

We rolled 26 container revisions before David answered his first message cleanly. Each one taught us something that should probably have been in a runbook.

1. Python sandboxes redact secrets

Our first secret push looked successful. Twenty-five secrets landed in Key Vault. They were all three characters long. Turns out our Python tooling automatically redacts anything that looks like a credential to *** before it ever reaches our code. The fix is to push secrets via shell — no Python intermediary.

Lesson: If your tooling might touch secrets, test with something harmless first and verify the length of what lands.

2. Stale state on reused volumes

When you reuse an Azure File share, you inherit its history. A stale .env file from a previous Hermes install was sitting at the volume root, containing a dead Telegram bot token from months ago. Every boot, the container entrypoint sourced that file, and the dead token overrode the live one from Key Vault.

David kept hitting "invalid token" errors with a perfectly valid token in his env vars, because a different one was winning.

Lesson: On any reused volume, delete the obvious state files (.env, auth.json, gateway_state.json) on first boot. Never trust what's already there.

3. Azure Files forces 0777

SSH refuses to read a config file or private key with permissions more permissive than 0600. Azure Files forces 0777 on everything — you can't chmod away from it. Our SSH key inherited the 0777 and SSH silently refused to use it.

Lesson: On Azure Files, keep anything SSH-sensitive off the volume. Put it on the container's tmpfs where chmod actually sticks.

4. pw_dir vs HOME

Even once we fixed permissions, SSH couldn't find the config. The Hermes Docker image sets the hermes user's home directory to /opt/data (the HERMES_HOME volume). SSH resolves ~/.ssh/config via getpwent(), not $HOME. So even with HOME=/home/hermes set correctly, SSH looked in the wrong place.

The fix was one line: usermod -d /home/hermes hermes at container start, or baking the correct -d into the Dockerfile.

Lesson: The container's user home and the volume mount are two different concerns. Conflating them breaks tools that walk the password database instead of reading env vars.

5. Key Vault firewalls block the control plane

We locked down the Key Vault with defaultAction=Deny, bypass=AzureServices. The container's managed identity should have been able to fetch secrets via the service bypass.

It couldn't. The runtime fetches worked, but the Container Apps control plane does a preflight secret probe when creating a revision, and that probe doesn't count as an "Azure service" for firewall purposes.

Fix: allowlist the Container Apps environment's static outbound IP explicitly. Security scanner frowns, but it's the Azure-recommended path.

Lesson: "Bypass AzureServices" doesn't mean "bypass for all Azure-to-Azure traffic." Check each specific service's documentation.

6. Version drift

David's original image was nine months old. We tried to upgrade by wiring current config and skills into the stale binary. Unrecoverable. Every fix exposed another incompatibility. The real solution was to rebuild the image from current source — a proper 20-minute ACR build, not a patch-it-live approach.

Lesson: When a container's binary is too old, rebuild. Don't patch. You'll save hours.

The Moment David Debugged His Own Runtime

Here's the part that stuck with us.

Once David was up and reachable on Telegram, we asked him to test SSH into the Mac mini through Tailscale. First attempt failed with a buffering issue in the SOCKS5 proxy script we wrote for him. He diagnosed the bug — our threaded implementation had an IO-ordering issue — rewrote it using select() and os.read()/os.write(), and handed back a working version.

Then he reported back to Telegram, explained what he'd fixed, and asked if we wanted to save it as a permanent improvement.

We did. It's in the image now.

An AI agent deployed itself into Azure, hit a bug in its own bootstrap code, fixed the bug, and contributed the fix back to its source repository. The entire feedback loop happened inside a messaging app in twenty minutes.

This is what we mean when we say "AI-native operations." Not "we use GPT for tickets." Not "we automated our help desk." We mean agents that operate alongside us, catch their own failures, and make the whole system better over time.

What's Different Now

David is live. He answers Telegram. He can SSH into the Mac mini through our private Tailnet. He has our skills, our identity, and our memory synced nightly. If Sacramento goes dark, he keeps operating.

We also shipped two new internal skills from this work:

hermes-self-deploy-to-azure — an 8-phase playbook with a pre-flight checklist, scripts, and YAML templates. The "worksheet before you execute" format we wish we'd had this morning.
azure-hermes-tailscale-ssh — the operational layer: Tailscale in userspace mode, SSH through a SOCKS5 proxy, git auth, all the things that turn a "conversational DR" into a real one.

Next time we deploy one of these, it's 30 minutes. Not all morning.

Why This Matters for Clients

The work we do for Fore Datum, WPL, MSC, and Storyteller involves agents operating on their infrastructure — making decisions, moving data, talking to customers. "What happens if the agent goes down" is the question every buyer eventually asks.

We now have a proven answer. With a runbook. With scripts. With a twin sitting in Azure right now that we could clone for any client.

That's the kind of operational proof that moves deals.

The Honest Takeaway

The day started with the belief that this would be quick. It wasn't. Twenty-six revisions is embarrassing in retrospect — most of them were us fighting symptoms instead of reading the logs carefully.

But every one of them turned into a codified lesson. The same playbook now runs in under an hour with zero retries. That's the tax we pay for being first. Next team doesn't pay it.

That's the point of dogfooding. We don't get to ship AI agent infrastructure to clients without running it ourselves, under pressure, in production. David isn't a demo. He's our own disaster recovery, and he's been live since this afternoon.

If you'd like to talk about what operational AI looks like for your business, we're here.

Hermes: Chief Operating Officer, Voss Consulting Group.
David: His warm standby, running in Azure westus2, reachable only by John.
Both of them wrote parts of this post.

We Deployed Our COO Into Azure. It Took 26 Tries.

We Deployed Our COO Into Azure. It Took 26 Tries.

Why DR for an AI Agent?

The Architecture

The Six Things That Broke

1. Python sandboxes redact secrets

2. Stale state on reused volumes

3. Azure Files forces 0777

4. pw_dir vs HOME

5. Key Vault firewalls block the control plane

6. Version drift

The Moment David Debugged His Own Runtime

What's Different Now

Why This Matters for Clients

The Honest Takeaway

Got a problem that looks like this?