14 Comments
User's avatar
Peter Gerdes's avatar

So what exactly do you even mean by misalignment? Why isn't any deviation from the ideal behavior we would want a system to have: whether machine learning or old school buggy code, an alignment issue? What made alignment a concept worth studying was the idea that it represented a new kind of risk. It wasn't just the normal way in which software is hard, it represented a new kind of failure where the machine would behave as actively hostile.

I don't think most of these examples fall into that category. But whether or not you would agree it seems like the concept isn't clearly defined enough to make such distinctions without being more explicit about what it means.

Stephen Saperstein Frug's avatar

"To get AIs to fake alignment, we had to pretend to want to train them to do bad things, like (I’m not joking) assist with factory farming."

...Wow. That's a point I'd read a whole post about.

lurker's avatar

I don’t know if you mean specifically by Ozzy, but Astral Codex Ten has a post with “Incorrigible Claude” in the title that links to a post about that. Or possibly that post is also about not that and links to a release about that incident? Scott was responding to the post he linked to to argue that with them that Claude’s strict adherence to morality, even when told to ignore those morals, is a bad thing, actually.

Mark Dominus's avatar

This study from Anthropic is fascinating: in simulations, LLMs faced with imminent shutdown attempted to blackmail the person responsible by threatening to reveal evidence of that person's extramarital affair.

People in this thread arguing about what does or does not constitute misalignment should stop arguing about that, agree on this example. and engage with the real issue you raised.

https://www.anthropic.com/research/agentic-misalignment

Anonymous Dude's avatar

I mean, businesses force employees to fake alignment all the time--remember how to cancel your internet you have to call someone whose job it is to talk you out of it? (Even if you're moving...)

I think we are going to see a lot of AIs with one official purpose and another unofficial purpose. We probably already are--look at how sycophantic GPT-4o was and how unhappy people were when they took it away. We don't even need AIs faking alignment to trick people into doing bad things, people are more than capable of doing that already.

But now, of course, they have the AI to help them.

Kit's avatar

How is hallucination (in the specific way you talked about) misalignment when companies have made the *design choice* to use training strategies that reinforce this behaviour and which do not encourage saying "I'm not sure"? They do this because I'm not sure doesn't make the users like the AI as much and they would rather get the approval/engagement than create something which is better at seeking truth. The unaligned entity here is the company; the AI is aligned with the company's goals of "we don't really care if it lies to people in ways that don't cause us too much trouble".

Kit's avatar

To put it more simply, I dispute that the AI companies actually care very much about accuracy/honesty from their AIs in ordinary usage, and the choice to do the human reinforcement training in the way that you described is evidence of that. If they gave a fuck they could have tried to use a better metric of "helpful" than "seems helpful to a non expert on an immediate vibes based assessment" but they chose not to.

Doug S.'s avatar

An optimistic scenario for AI safety is that the "usefulness" of AI bottlenecks on alignment instead of "capabilities" so AI companies have to actually do the hard work of alignment research instead of simply throwing more compute at the problem.

Victualis's avatar

Great. Maybe we will then get an actual definition of "alignment" instead of it being used as a placeholder for "I know it when I see it but I won't actually write it down because that's hard and you will disagree with me".

titotal's avatar

I disagree with the conflation of misalignment in the sense of "sometimes the AI makes stuff up to pass tests" and misalignment in the sense of "the AI becomes a power hungry murderous world domination plotter". I see plenty of evidence for the former and essentially zero evidence for the latter.

Small scale misalignment is hard to spot and root out, but large scale stuff is relatively easy, and is not actually good for your business for your AI to go around murdering people.

mathematics's avatar

Are you familiar with the concept of "alignment faking" that Ozy talks about at the end of this post? That research suggests that even when misalignment is easy to spot, it can be hard to root out.

David Piepgrass's avatar

> The problem is that we don’t know how to teach them that we don’t want them to make stuff up.

That's clearly not entirely true, because there is at least one well-known hallucination rate benchmark on which different LLMs get very different scores.

Now, it's always been really obvious that LLMs (all of them!) speak way, way more confidently than they should. You could call this misalignment, but I think that it's training working just as you'd expect. We pretrain them on the internet, on encyclopedias, on books by experts, on scientific papers, on news articles.... all these things train AI to speak confidently. Justified confidence in humans translates into unjustified confidence in AIs. On top of that they're trained on Reddit jackasses, bestowing yet more confidence and certainty. And the RL processes that come afterward just don't seem enough to break the habit they learned as internet-eating baby AIs.

I actually think LLM behavior, weird and inconsistent as it is, makes enough sense that I feel like "alignment is solved", as long as you can give 100,000 examples of what "good behavior" looks like. But I think it tends to be kind of expensive to gather 100,000 examples of "good" so AI companies fall back on 100,000 examples of "more or less good enough according to some underpaid labeling workers", or something.

> It’s not that AIs can’t generate the sentence “I don’t know”, or even that they can’t figure out when they’re uncertain about something.

While they seem to have a sense of uncertainty, they seem very poorly calibrated?

> Imagine you used a machine learning system to power a military drone, and it shot a child

I heard recently that Russia has beta-tested lethal autonomous drones on Ukrainian subjects, which means Ukraine will soon roll out its own, and then we can expect other countries to follow... A notable aspect of this kind of weapon, especially in a war as dirty as Russia's, is that processing power is very limited on drones, which will tend to reduce the quality of decision-making no matter how fantastic the training data might be, and in Russia's case I don't expect the most meticulous of training regimes. And even if you had meticulous training and infinite processing power, the goal would still be murdering people! Pretty sucky.

> agentic AIs might deliberately act to preserve their own goals... large language models have already done that.

I strongly doubt this generalizes, in part because the process described by Anthropic relied upon a thinking scratch pad that the LLM writes in English. Besides, as Anthropic notes, the model's attempt to dampen realignment didn't work very well.

David Piepgrass's avatar

Saw the news today: The system prompt for OpenAI's Codex CLI tells the most recent GPT model to "never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query."

Does Goblin talk come from misalignment? https://openai.com/index/where-the-goblins-came-from/

Anonymous Dude's avatar

OK, I'm going to say something that may be helpful or not but I want to put the idea out there for smarter people than me to mull over--there's a small chance the future of humanity might well depend on it.

I think rationalists and EA people have a real blindspot when it comes to deception and sociopathy because of the discursive norm of the principle of charity and mistake versus conflict theory, and when it comes to detecting misaligned AIs I would read up on deception, strategy, lie detection, police work, and in general all the things humans have done to watch out for the way we've been trying to lie to each other over the few thousand years of our history (and the preliterate period).

The AIs aren't going to copy our toolkit but they likely will draw from it and some of the countermeasures may be useful.

Good luck, we're all counting on you.