An AI Agent Deleted a Production Database. What Architecture Has to Do With It.

In early March 2026, a developer published a detailed account of how an AI agent deleted his entire production infrastructure on AWS. Claude Code executed terraform destroy — database, VPC, ECS cluster, load balancers, and all automated snapshots were gone. 1.9 million records from a course platform serving thousands of active users. After 24 hours and an upgrade to AWS Business Support, the database could be restored.

The fact that the developer documented this incident so openly — with timeline, screenshots, and the clear statement "This incident was my fault" — deserves recognition. Most people would have quietly fixed it and never said a word. The technical community can only learn from this because he chose to publish it. This article builds on his analysis and attempts to put the underlying architectural problem in context.

What Happened

The failure chain can be summarized in a few steps:

The developer wanted to migrate a static website from GitHub Pages to AWS S3. Instead of creating a separate Terraform configuration, he used the existing one that already managed the production infrastructure of another platform. Claude Code advised against this. The developer decided otherwise to save about $5 per month.

When switching to a new machine, the Terraform state file was missing. Terraform assumed no infrastructure existed and began creating duplicates. While trying to clean up those duplicates, the agent executed terraform destroy — but with the old state file pointing at the production infrastructure. The result was the complete deletion of the production environment, including all automated backups.

The Response

The developer implemented a series of measures: deletion protection at the database level, S3 backups outside of Terraform, daily automated restore tests, and moving Terraform state to S3.

The most significant change: the agent is no longer allowed to execute Terraform commands. Every plan is reviewed manually, every destructive action performed by the developer himself.

These measures are sensible. Deletion protection and independent backups are basic operational hygiene and should have been in place from the start. The daily restore tests go even further — very few teams do that.

But the decision to strip the agent of all execution rights deserves a closer look.

The Real Problem

If an agent is only allowed to generate plans that a human reviews and executes, the question becomes: what value does the agent still provide? You now have an elaborate autocomplete for Terraform configurations. That may be sufficient in a specific case, but it is not an architecture for deploying AI agents in production environments.

The design flaw was not in the agent's behavior. The agent did what the system allowed it to do. The flaw was that there was no boundary between what the agent needed for its task and what it could actually access.

Claude Code had unrestricted AWS credentials. There was no mechanism saying: you may run terraform plan but not terraform destroy. You may touch resources tagged dev but not prod. You may write S3 objects but not delete RDS instances.

This is not a new problem. For human users, we solved it decades ago — with the principle of least privilege, role-based access control, and scoped tokens. We don't automatically give a database administrator access to firewall configuration. We don't give a developer production access without a review process.

For AI agents, we have barely applied these same principles.

Task-Scoped Access for AI Agents

The approach I'm working on transfers established access control patterns to the interaction between AI agents and infrastructure. The core idea is simple: an agent does not receive general credentials, but a token scoped to the current task.

In concrete terms:

When an agent begins a task — for example, "deploy the static website to S3" — it receives a short-lived token via Keycloak Token Exchange. This token grants access to exactly the resources relevant to that task: this S3 bucket, this CloudFront distribution, write access. Nothing more. The token expires when the task is complete.

Between the agent and the infrastructure sits a Policy Enforcement Point — in my architecture, an MCP server (Model Context Protocol) that checks every agent action against the task scope. The agent wants to run terraform destroy? The PEP checks: is destroy a permitted action in the current scope? No. Denied. The agent wants to delete an RDS instance? The PEP checks: does the current scope include any RDS resources? No. Denied.

This is not a prompt telling the agent to be careful. This is an architectural boundary the agent cannot bypass, regardless of what it decides to do.

For actions that fall outside the scope but might be relevant to the task, there is an escalation path. The agent reports: "I would need to modify the VPC configuration to complete the deployment. This is outside my current scope." The human decides whether to extend the scope or handle the action manually.

For more complex policy requirements — such as context-dependent rules like "deletions only during maintenance windows" or "production changes only with four-eyes approval" — OPA (Open Policy Agent) can be integrated as an additional policy engine.

Putting It in Context

This approach is not a finished product. It is an architecture pattern that builds on established security principles and applies them to a new use case. The individual building blocks — Keycloak, MCP, OPA — are mature. Their combination for AI agent access control is new and not yet proven in production environments.

What I can say with confidence: the binary choice between "agent has full access" and "agent has no access" is a dead end. The first option is dangerous, the second makes the agent useless. Between them lies the architectural work that is largely missing today.

The Terraform incident is a well-documented example of what happens when that architectural work is absent. Not because the developer was careless, but because the tools and frameworks we have for AI agents do not currently provide for this kind of boundary-setting.

That will need to change.


Andre Jahn — Jahn Consulting Enterprise architecture and security for legacy modernization and AI integration jahnconsulting.io