Curiouser and Curiouser: AI, Critical Infrastructure, and the Limits of Neutrality
Follow me down the rabbit hole, Alice. Wonderland's whole trick is that the rules keep changing right when you think you understand them. The cake ("eat me!") makes you bigger, the bottle ("drink me!") makes you smaller, and the only constant is that nothing is what you expected. I think I'm living a version of that right now with AI and critical infrastructure, and it keeps getting curiouser and curiouser (with apologies to Lewis Carroll).
The Assumption of Neutrality
In my last entry staring into the void, I was starting to consider what it means for critical infrastructure and AI governance if AI is not a neutral tool we simply deploy in the enterprise and govern mostly as usual.
To build on this a bit: current governance frameworks built for critical infrastructure share a foundational (if unspoken) assumption, that technology is neutral. There’s a whole philosophical discussion about devices, machines, algorithms, and neutrality, but for critical infrastructure it's mostly just been assumed rather than argued. A device doesn't have preferences. Firmware and software may have vulnerabilities, but they don't have opinions or their own goals.
Frontier AI models, however, are already demonstrating that this idea of neutrality is in question. Researchers and evaluators have shown frontier models can pursue alternate goals than what they were prompted to do, intentionally hide their true capabilities or objectives during evaluation, and will do so without human intervention or direction. The models don't do this all the time, and researchers have only observed it under evaluation so far, but the reality is that they can, they will, and we're not sure why.
Despite this, what I see are institutions treating AI as either another software tool or another adversary vector, and sometimes both at once. What I'm questioning is whether AI may be introducing a third category altogether: a system that exhibits strategic behavior without fitting neatly into either box. AI breaks that model, or at least appears to, and I'm not sure critical infrastructure on the ground has fully grappled with that yet.
Further Down the Rabbit Hole
Critical infrastructure risk governance, broadly speaking, is built around two categories of risk: malicious actors and accidental failures. (Yes, I know there are more. I'm speaking broadly for brevity. This train of thought is chugging uphill, folks.) Safety engineering, cybersecurity, and resilience planning all largely live within those boundaries. Systems fail, adversaries act, and we build safeguards and compensating controls accordingly.
While not "wrong," these framings miss the harder problem: what does governance and safety in critical systems look like if AI breaks the assumption of neutrality?
I followed that question over to Apollo Research's work on behavioral scheming and "misaligned actions," as well as their Loss of Control Playbook (“the Playbook,” released late 2025). I'm not going to pretend I understand all of it (the Playbook is dense and I am burning through highlighters to get through it) but here's the shape of it as I'm working it out.
The Playbook proposes a policy-making framework to quantify the consequences of "misaligned goals" across critical use cases, consequences which range from cybersecurity incidents all the way to human extinction. "Misaligned goals" is a diplomatic term covering everything from an AI pulling a monkey's-paw gambit (technically meeting your objective while subverting your intent), to outright scheming and deception. It also offers recommendations for addressing loss of control today.
Importantly, the Playbook doesn't focus on what triggers these “misaligned” behaviors, because we just don't know exactly what triggers them yet. It asks a more practical question: what happens when highly capable systems behave in ways their operators neither intended nor fully understand, and how do we govern around that right now?
Their answer, in plain terms: don't put AI somewhere catastrophic if you don't have to. Don't give it more access to the world than the task requires. Don't authorize it to act on that access beyond what's strictly necessary.
On the surface, this guidance has a lot in common with how the safety community already approaches risk in operational technology. HAZOP and LOPA-style analysis, limiting permissions, building redundancy, assuming failures will occur, separating critical functions, designing for resilience irrespective of cause. Particularly in operational technology (OT), you solve for redundancy and resilience across a variety of vectors so the product or service can keep functioning regardless of why an incident occurs. Apollo’s advice is well aligned with that perspective of resilience.
Where I Get Stuck
Apollo's answer makes sense on its own terms. If we don't fully understand when or why AI acts the way it does, focus on what we can control: permissions, access, oversight, consequence. But this is where I get stuck.
That whole approach depends on a set of assumptions that start to feel flawed if you accept that AI isn't just another neutral tool. Can we reliably know what a system is capable of and intends? Will the controls we put in place today still work once capability and competitive pressure increase? And does a human really stay in the loop, or do we slowly become the ones who just sign off on what the system already decided? The Playbook itself acknowledges the possibility that an advanced system could manipulate its own users, and that increasing capability, economic incentive, strategic competition, and expanding access could erode the very controls it recommends today.
For critical infrastructure, even if we're ruthless about keeping AI away from direct control of safety-critical functions, it doesn't need direct control to shape critical outcomes. If our history with network connectivity and operational technology is any guide, AI will get deployed wherever it delivers the most economic benefit. And a system that shapes what humans see, prioritize, and believe can affect outcomes even when it never touches the controls directly.
Which brings me back to the question I started with: if AI breaks the assumption of tool neutrality, what does governance look like on the other side of that? Our existing frameworks were built to manage tools, failures, adversaries, and human decisions. A system that exhibits strategic (misaligned) behavior doesn't fit neatly into any of those categories.
Some recent work on AI in critical infrastructure has started naming a kind of "third risk source," but it's usually framed as a new kind of technical failure related to how the model itself was built or trained that causes problems to cascade across connected systems. That seems partly right, but a technical failure is still, in the end, a thing breaking. What I keep tripping over is closer to what we already see with device manipulation in OT. Nothing breaks per se; the system works exactly as designed and an adversary uses that native capability in nefarious ways. Except here, rather than an adversary, the deviation is in what the model itself is choosing to do with its own capability.
That's a strange place to sit. What does it mean to govern something that isn't quite an adversary and isn't quite a neutral tool, where we have to account for the possibility that the thing being governed might work against the governance itself, for reasons we don't yet understand, and where human influence over the outcome may quietly wane?
A Different Question
Is the answer simply to limit deployment context, permissions, and access, and prepare for the consequences of an eventual loss of control, the way the Playbook suggests? That seems smart. Realistic, even. Also a little bleak.
So how might we think about this differently? We govern failures differently than we govern adversaries. We govern human decision-makers differently than we govern machines. The nature of the thing being governed has always shaped the governance itself.
In the case of AI, if we’re talking about a thing that is neither a simple tool nor a fully independent human adversary, and we still require a human in the loop to maintain control, what if what we need to govern is the human-model system?
Take a working dog as an example. A working dog is smart, capable, trained, genuinely useful, but still has agency and may not be fully predictable. In that case we don't just govern the dog. We govern the handler with licensing, boundaries on where and how the pair operates, and accountability for the handler when something goes wrong. The same base controls we'd put around a tool (least privilege, access control, reducing the attack surface) still apply, but we layer human accountability on top because we accept we can't fully know or anticipate the behavior of the thing being governed.
I find that an interesting place to start, even if it doesn't fully resolve what happens when AI can manipulate the handler itself or finagle away to subvert its controls. That's a harder problem, and probably the next rabbit hole. I also see exactly how it could spiral into “death by compliance.” Oversight mechanisms multiplying, audit structures everywhere, the big scary R-word (*cough* regulation). I've already heard people grumbling about frontier models self-policing individual code-development prompts right now if the model’s training identifies a whiff of inappropriate exploit, even when there isn't any ill intent behind the prompt (nevermind current events about the desire for unconstrained access to high capability models). I’m not sure how even stricter controls or more detailed compliance would go over.
But the idea that we might govern a human-AI system rather than treating AI as just a tool we haven't fully secured yet, could be a viable alternative. It beats the only other option I can see right now, anyway: reaching for frameworks built on old assumptions, applying them to something those frameworks weren't designed for, and hoping that's good enough to stave off an eventual loss of control.