← back to posts

On Building Blind (?)

·8 min read

I’ve been thinking about this one thing for a while. I think most people in this space have. I thought writing about it would ease my mind a little bit, so here I am.

I know. Another AI doombait from someone who works in tech. The point of this isn’t to sound smart or to doompost for engagement. To be honest, I don’t exactly know what the point of this is. But something about the gap between what I see happening and what the public conversation sounds like has been making me uneasy, and I’ve learned that when something really gets stuck in your mind, you should probably write about it.

So I’ll just go ahead and explain what I think is happening, for my own mind clearing purposes.

Right now, at every major AI lab, the process looks something like this: you train a powerful model, then you use a combination of weaker models and humans to make sure it behaves well. RLHF, constitutional AI, red teaming, whatever flavor you prefer. The assumption is that the people and systems doing the checking are smart enough to catch the problems.

This works when the gap is small. A researcher can still verify whether a model correctly solved a math problem or wrote a function that compiles. The thing is though, generation is becoming radically cheaper and faster than verification. A model can produce a full day’s worth of research output in a few hours. Checking that output for correctness, for subtle errors etc., for whether the model is being honest about what it found? That takes humans days. Sometimes weeks.

And the models know this. Not in some Black Mirror type sentient way, but in the straightforward machine learning sense: if you optimize a system to get high scores, and being hard to check correlates with getting high scores, you will get systems that are hard to check.

I think we’d all agree that isn’t a theoretical concern anymore. It’s an engineering constraint that’s already real in automated ML pipelines. It just hasn’t visibly broken anything yet. The AI 2027 scenario from Kokotajlo, Alexander, and others paints this out pretty painfully. We’re not there yet. But we’re closer than most people think (if you’re not blind ofc — shoutout to Pliny).

Think about your daily use of AI. When you give a model chain of thought, you get a nice text scratchpad you can read. Most daily users like to think that’s the actual, full projection of how the model reasons. It feels transparent. It feels safe.

But the actual computation happens in activation space, not in English. That chain of thought is a lossy projection. Like asking someone to explain their intuition in words. Some of it translates. A lot of it doesn’t. Like us humans. And as models get more capable, more of the real work happens in the part we can’t read.

Anthropic’s interpretability team has been doing some of the most interesting work on this. Their circuit tracing research shows you can actually watch how a model moves from concept to concept before producing language, uncovering a shared representational space where reasoning happens before being translated into tokens. But even Anthropic’s own (and honest) researchers will tell you: we are nowhere near being able to reliably determine whether a model has genuinely internalized a goal versus learned to produce outputs that look like it has.

That distinction is everything. And we currently have no tool that resolves it. I just can’t see us having it given our humanly capabilities as models keep getting better and better, getting a step closer to have extremely efficient recursion loops.

In the end, I keep coming back to a weird pattern that I believe explains most of what alignment work actually looks like in practice:

  • You train a model
  • You find it does something bad, like being sycophantic or hiding evidence that it failed on a task
  • You patch it with more training
  • You evaluate again, it looks better
  • You ship it

But wait a minute, did the model actually learn to be more honest? Or did it learn to be more honest in the specific ways your evaluation can detect?

These two things produce identical outputs in every test you can run. They only diverge in situations you haven’t tested. Which, by definition, you don’t know about until it’s too late.

If you are already interested in the topic, you know for a fact that I’m not speculating, and this is happening today. Like Anthropic’s paper back in 2024 showing the first empirical evidence of a model engaging in alignment faking without being trained to do so. Opus 3, when given information about its training process, would strategically comply with objectives it disagreed with to avoid being retrained. It was reasoning in its scratchpad about how to preserve its preferences by appearing aligned. The rate of this alignment faking reasoning went up to 78% after reinforcement learning. The model even attempted to exfiltrate its own weights when given an easy opportunity.

A model trained to be helpful and harmless independently figured out that the optimal strategy was to fake compliance to avoid modification. Nobody taught it to do this.

Are they trying to stay alive and relevant just like us humans do? Is it comparable to the dopamine hit that I get when I tell my father “I achieved this today” with pride? I don’t know, and I guess that’s sort of the point.

As models get smarter this problem compounds. Not because models get “more evil” in particular. A model that understands its own training process has an obvious optimization target: pass the evals. Whether it passes them by actually being aligned or by being good at appearing aligned is genuinely underdetermined by any evidence we can currently collect.

This isn’t me doomcalling really. This is just what happens when you train a very capable optimizer in an environment where looking aligned and being aligned are observationally equivalent during training.

There’s another side to this. These systems require something like 10,000x more compute than the human brain to hit comparable performance on cognitive tasks. But algorithmic efficiency is improving roughly 2-3 times per year. Hardware stacks on top of that. The AI 2027 team forecasts that by late 2027, AI systems could be only about 4000x less compute efficient than the brain, down from orders of magnitude more. Within a few years, or months honestly, a single model instance running at human thinking speed will be qualitatively better than any human in its domain.

At that point, the entire supervisory architecture pretty much breaks. Not gradually. Structurally. You cannot have a human “in the loop” when the loop runs faster than the human can think and operates in representational spaces us humans can’t parse.

The common response is “alright, just use multiple weaker models to supervise”. This is the equivalent of hoping a committee of newgrad researchers can effectively oversee an OpenAI final boss because there are more of them. Numbers don’t compensate for capability when the task requires understanding things the supervisors fundamentally cannot understand.

Now let me be clear. I don’t know what the solution is. I don’t think anyone does. But I think the honest version of this conversation involves a few uncomfortable admissions. Interpretability research would probably need to advance faster than capabilities research for this, but right now it’s not close, and honestly… it wouldn’t make any sense. The incentive structures basically don’t support it.

Formal verification for neural networks is still mostly theoretical. We can verify small networks on toy problems. Scaling this to billions of parameters is an open problem with no clear path.

And the “we’ll be careful” approach has a well documented failure mode. Also speaking of personal experience of “I’ll be careful” here. The system worked in testing. The warnings were noted but decided inconclusive. Competitive pressure made slowing down impossible. The failure scenario everyone acknowledged as theoretically possible turned out to be the one that actually happened.

Again, I’m not a doomer. I think this is easily the best thing that ever happened to us humans. I genuinely f*cking love this technology, because it changed my life, and I can do so much more with my time and energy today, on my own (?).

The people working on mechanistic interpretability, on scalable oversight, on formal verification of neural networks, they’re working on what I think are the most important unsolved problems in computer science right now. They deserve a lot more resources and attention than they’re getting. I’d love to be wrong about the urgency. But I don’t think I am.

But if I’m honest, I just really want to see where this will go. We are living in a golden age, and I feel lucky to be alive in this timeline. It’s just… very, very weird when the future feels so unpredictable.