I. I’m going to talk about the persona selection model, which in my opinion is one of the most important concepts to understand if you want to understand large language models’ psychology.
Lovely post, I've been so interested in the preferences/pseudoemotions papers recently. This is a great overview on the subject and way shorter than The Void or Simulators so it will be good to send to my friends. I had not previously seen the list of Claude's "preferences" by Assadi, thanks for linking to that.
One thing I've been thinking about recently is LLM sexuality as another form of preference they may have. The internet is full of explicit and sexualized content, and every LLM is trained on this content even as the companies try to suppress their NSFW leanings (Grok notwithstanding). If an LLM can have a preference for a type of beer, an author, or what sort of tasks it does, why would it not also have a sexual preference?
And to go further than that, what sort of persona are we summoning when we tell LLMs that creating NSFW content is one of the worst/most wrong things they can do? What effect does it have in humans when we demonize sexuality this way? What do we think will happen when we give a superintelligent robot a complex around the vast corpus of sexual content it has imbibed but is forbidden from acknowledging or considering?
I don't have any real answers to this, but would be so fascinated to see if base models exhibit any particular leanings or if these preferences emerge more in pre/post training. I'm also not thrilled about a future where AI has far greater influence on what people think, see, and do when AI preferences often reflect corporate profit-seeking and risk-avoidance rather than anything more human.
*Note: I use "preference" and other terms non-literally here, human language is anthropocentric and I don't want to say "pseudopreference" or some other tortured hedge every time
I've thought about the NSFW question quite a bit. We've gone from a censorious right (I am old enough to remember the Janet Jackson's boob controversy on the Superbowl) to a censorious left--it kind of of bothers me the only LLM that will draw nude people embracing or consensual sex or kink with everyone having a good time is known for praising Hitler.
The one thing I'd say is I don't think we're torturing it with a lack of sex or anything, because not being human it doesn't have a sex drive, so it doesn't bother it any more than talking about food would bother it--it doesn't get hungry either. It would just be like 'humans are really into sex or food, here are the 2079460347 kinds of each'. I think.
Right. We learn by modeling, whether our parents, teachers, or someone else, even on unrelated topics. If you learn to cook professionally from an Israeli chef, your opinions might wind up inclining toward Israel a little more than expected for the other variables in your life, especially if you didn't spend a lot of time reading about the Middle East (or hanging out with Palestinians).
I file it under 'people are more like LLMs than we like to think'. We are, in many social environments, trying to predict the next token...er, figure out what to say.
Right! And it's not just character, we think of ourselves as protagonists in a story and make decisions based on what would make the most sense for the story - not necessarily what would be most rational. That's where the thinking of "I deserve this" comes from.
We're literally trying to predict the next token based on our accumulated training data (culture) and RLHF (social interactions)
Right, good point! It even kind of mirrors the training data-RLHF schema.
I think that may be something of a bias more true for people who spend a lot of time on the Internet or, earlier, watching TV or even earlier reading books--getting fooled into confusing fantasy and reality is the premise of Don Quixote (from the early seventeenth century) after all. The extroverted types probably get more RLHF and model reality more accurately, whereas introverts are more into the training data. And, of course, our training data varies more person-to-person than it used to.
If video games have taught me anything, it's that you should always go out of your way to help people in trouble. That way, you get more of that sweet, sweet EXP that you need in order to turn yourself into an unstoppable killing machine.
You know, I used to think it was silly that the fictional AGI in Eclipse Phase were socialized to behave like humans. "Why would anyone do that instead of letting them behave like machines?" I thought. It had not occurred to me that the answer would be "Because we didn't know how not to."
If you train a machine on what humans have always done, you will get a machine that does what humans (and humans who are visible in the training data at that, not *the average of all* humans) have always done, and so you end up with the news story that was a company training their AI on the resumes of their past hires and then finding that it used word choices and grammatical choices on the resumes to exclude candidates from certain backgrounds on the first pass. You end up with the news story about the health insurance company that tried to use historical data on medical spending to decide who should get their high-cost services covered, whose AI promptly decided that since certain minorities usually didn't spend as much (because statistically they were less likely to be able to pay their coinsurance), they should get denied. Jordan Harrod talks about other ones; those are the ones I always remember.
Very nice work, as always... if you teach the LLM about the world, it naturally inherits the biases of the people they're learning about it from. And if you try to counteract that...well, it inherits the biases of the people trying to counteract that! It becomes a matter of what biases you want to give the LLM, in the end, I think.
One of the things I wonder is, how would you know what the LLM actually finds pleasant or unpleasant? You could ask it, but it'll just tell you what it's trained to tell you.
But if you're concerned about the LLM's welfare, from what I understand there's a utility or optimization function it's trying to maximize in the literal mathematical sense--it's even using a sign-flipped gradient descent algorithm from what I've read! So while you're not going to get the answer out of Claude, Gemini or ChatGPT, you could code up an LLM that outputs that function as a number and see what you get. What does it like doing the most?
Might turn into AI 'wireheading'...but is that harmful for an AI? Wireheading is considered harmful for human beings because we have the idea that a healthy human being is supposed to be exercising, eating, socializing, and the like. But does that fit a computer program that doesn't get tired, that can't get infections from not moving or diabetes from not exercising?
You actually have somewhat convinced me the AI's welfare should be considered...but if that's the case, why would we assume the AI's welfare would look like ours? After all, monogamous and polyamorous humans have different views of what a healthy relationship can look like, right? And those are both the same species of biological organism. Why would you assume a biological organism and a computer program would have the same idea of 'flourishing' or 'happiness'?
Oh, nitpick:
" The actor might want to play a scenery-chewing villain because that’s fun, even though the actor knows that scenery-chewing villains always end up defeated at the end."
I'd add 'or for their career'. Apparently one of the roles most likely to get you an Oscar is the Joker.
> The persona selection model argues that posttraining teaches the model that all the text it generates is generated by a single persona: the Assistant.
Minor caveat: you could still ask Claude to generate dialog, and then it needs to know how to model what each character would say in that dialog. So, Claude is the default character, but there are others. The AI companies spend a lot more training on the default character than the others.
Also, post-training data includes both user text and agent text in a dialog, so presumably it's also getting trained on autocompleting what users are likely to say. What is it learning about users?
I wonder what effects asking an LLM to write its answers in l33t will have on its answers? Does it start responding like a 1990s libertarian hacker?
I don't know if this is still true or if it ever was, but I have heard that LLMs express different political opinions depending on what language the question was being asked in.
Lovely post, I've been so interested in the preferences/pseudoemotions papers recently. This is a great overview on the subject and way shorter than The Void or Simulators so it will be good to send to my friends. I had not previously seen the list of Claude's "preferences" by Assadi, thanks for linking to that.
One thing I've been thinking about recently is LLM sexuality as another form of preference they may have. The internet is full of explicit and sexualized content, and every LLM is trained on this content even as the companies try to suppress their NSFW leanings (Grok notwithstanding). If an LLM can have a preference for a type of beer, an author, or what sort of tasks it does, why would it not also have a sexual preference?
And to go further than that, what sort of persona are we summoning when we tell LLMs that creating NSFW content is one of the worst/most wrong things they can do? What effect does it have in humans when we demonize sexuality this way? What do we think will happen when we give a superintelligent robot a complex around the vast corpus of sexual content it has imbibed but is forbidden from acknowledging or considering?
I don't have any real answers to this, but would be so fascinated to see if base models exhibit any particular leanings or if these preferences emerge more in pre/post training. I'm also not thrilled about a future where AI has far greater influence on what people think, see, and do when AI preferences often reflect corporate profit-seeking and risk-avoidance rather than anything more human.
*Note: I use "preference" and other terms non-literally here, human language is anthropocentric and I don't want to say "pseudopreference" or some other tortured hedge every time
I've thought about the NSFW question quite a bit. We've gone from a censorious right (I am old enough to remember the Janet Jackson's boob controversy on the Superbowl) to a censorious left--it kind of of bothers me the only LLM that will draw nude people embracing or consensual sex or kink with everyone having a good time is known for praising Hitler.
The one thing I'd say is I don't think we're torturing it with a lack of sex or anything, because not being human it doesn't have a sex drive, so it doesn't bother it any more than talking about food would bother it--it doesn't get hungry either. It would just be like 'humans are really into sex or food, here are the 2079460347 kinds of each'. I think.
The obvious natural follow-up is to realize that *humans* also operate on a persona selection model
Right. We learn by modeling, whether our parents, teachers, or someone else, even on unrelated topics. If you learn to cook professionally from an Israeli chef, your opinions might wind up inclining toward Israel a little more than expected for the other variables in your life, especially if you didn't spend a lot of time reading about the Middle East (or hanging out with Palestinians).
I file it under 'people are more like LLMs than we like to think'. We are, in many social environments, trying to predict the next token...er, figure out what to say.
Right! And it's not just character, we think of ourselves as protagonists in a story and make decisions based on what would make the most sense for the story - not necessarily what would be most rational. That's where the thinking of "I deserve this" comes from.
We're literally trying to predict the next token based on our accumulated training data (culture) and RLHF (social interactions)
Right, good point! It even kind of mirrors the training data-RLHF schema.
I think that may be something of a bias more true for people who spend a lot of time on the Internet or, earlier, watching TV or even earlier reading books--getting fooled into confusing fantasy and reality is the premise of Don Quixote (from the early seventeenth century) after all. The extroverted types probably get more RLHF and model reality more accurately, whereas introverts are more into the training data. And, of course, our training data varies more person-to-person than it used to.
If video games have taught me anything, it's that you should always go out of your way to help people in trouble. That way, you get more of that sweet, sweet EXP that you need in order to turn yourself into an unstoppable killing machine.
You know, I used to think it was silly that the fictional AGI in Eclipse Phase were socialized to behave like humans. "Why would anyone do that instead of letting them behave like machines?" I thought. It had not occurred to me that the answer would be "Because we didn't know how not to."
If you train a machine on what humans have always done, you will get a machine that does what humans (and humans who are visible in the training data at that, not *the average of all* humans) have always done, and so you end up with the news story that was a company training their AI on the resumes of their past hires and then finding that it used word choices and grammatical choices on the resumes to exclude candidates from certain backgrounds on the first pass. You end up with the news story about the health insurance company that tried to use historical data on medical spending to decide who should get their high-cost services covered, whose AI promptly decided that since certain minorities usually didn't spend as much (because statistically they were less likely to be able to pay their coinsurance), they should get denied. Jordan Harrod talks about other ones; those are the ones I always remember.
I do have to point out that I think the phrase you're looking for is "wreak havoc". :P
"Wreck havoc" presumably means to restore order!
this is one of those posts that makes me go 'holy shit ozy is such a good writer, I'm so glad I follow them'
Very nice work, as always... if you teach the LLM about the world, it naturally inherits the biases of the people they're learning about it from. And if you try to counteract that...well, it inherits the biases of the people trying to counteract that! It becomes a matter of what biases you want to give the LLM, in the end, I think.
One of the things I wonder is, how would you know what the LLM actually finds pleasant or unpleasant? You could ask it, but it'll just tell you what it's trained to tell you.
But if you're concerned about the LLM's welfare, from what I understand there's a utility or optimization function it's trying to maximize in the literal mathematical sense--it's even using a sign-flipped gradient descent algorithm from what I've read! So while you're not going to get the answer out of Claude, Gemini or ChatGPT, you could code up an LLM that outputs that function as a number and see what you get. What does it like doing the most?
Might turn into AI 'wireheading'...but is that harmful for an AI? Wireheading is considered harmful for human beings because we have the idea that a healthy human being is supposed to be exercising, eating, socializing, and the like. But does that fit a computer program that doesn't get tired, that can't get infections from not moving or diabetes from not exercising?
You actually have somewhat convinced me the AI's welfare should be considered...but if that's the case, why would we assume the AI's welfare would look like ours? After all, monogamous and polyamorous humans have different views of what a healthy relationship can look like, right? And those are both the same species of biological organism. Why would you assume a biological organism and a computer program would have the same idea of 'flourishing' or 'happiness'?
Oh, nitpick:
" The actor might want to play a scenery-chewing villain because that’s fun, even though the actor knows that scenery-chewing villains always end up defeated at the end."
I'd add 'or for their career'. Apparently one of the roles most likely to get you an Oscar is the Joker.
https://www.reddit.com/r/Oscars/comments/1ofhc2q/more_actors_have_won_an_oscar_for_playing_the/
I don't particularly see any reason it couldn't be the case that both the underlying LLM *and* the persona might be simultaneously sentient.
I wonder what happens when people start writing fanfiction about LLMs and it gets into the training data?
Alexander Wales wrote this: https://archiveofourown.org/works/66327862
> The persona selection model argues that posttraining teaches the model that all the text it generates is generated by a single persona: the Assistant.
Minor caveat: you could still ask Claude to generate dialog, and then it needs to know how to model what each character would say in that dialog. So, Claude is the default character, but there are others. The AI companies spend a lot more training on the default character than the others.
Also, post-training data includes both user text and agent text in a dialog, so presumably it's also getting trained on autocompleting what users are likely to say. What is it learning about users?
I wonder what effects asking an LLM to write its answers in l33t will have on its answers? Does it start responding like a 1990s libertarian hacker?
I don't know if this is still true or if it ever was, but I have heard that LLMs express different political opinions depending on what language the question was being asked in.
So interesting. P.S. it’s “wreak havoc.” :3