Interview with Yafah Edelman about AI welfare
[Yafah Edelman’s views do not necessarily reflect those of her employer.]
Hi, Yafah! I’m interviewing you for my Substack. You’re the second person ever to be interviewed for my Substack. Want to introduce yourself?
I'm Yafah Edelman. For my day job, I'm a researcher at Epoch AI, but outside of my work at Epoch I have an interest in AI welfare.
And we’re going to be talking about AI welfare today, although you didn’t want to talk about whether AIs are conscious. I was curious if you wanted to talk a bit more about why you’re not that interested in whether AIs are conscious.
There's this really interesting and easily philosophically nerd-snipable question about AI consciousness. Discussions of AI welfare typically get turned to whether AIs are conscious or to what it would mean for them to be conscious or whether consciousness even exists. This is obviously a very important question. However, there are a lot of people who have thought a lot about consciousness and AI consciousness in particular. I don't think that they have made overwhelming progress. I think it is pretty unlikely we will make a lot of progress before AIs are potentially conscious. Much more tractable, interesting, and comparatively neglected are questions about AI welfare other than whether AIs are conscious.
Jonathan Birch’s The Edge of Sentience talks about how we don't know whether humans in certain kinds of comas are conscious and we might not ever know. But we can still give them painkillers and turn on the television so that they can watch the television. These are pretty cheap interventions in the world where they are not conscious and very valuable in the world where they are conscious. We don't actually need to settle this question of whether they are conscious in order to take some steps about it.
Yeah, something like that. We can act as if it’s plausible they’re conscious and take actions based on that, especially the cheaper actions. When it comes to real-world decision-making. I don't think that the precise value of your probability of them being conscious should be the defining factor. I think you should have a non-trivial probability that they're conscious that's larger than 0.1% and less than 100%, and then other factors are going to dominate most of the time.
What kind of factors?
How often an AI is in a particular situation. For example, AIs that say they’re human are very very rare and little compute or time is spent on them. You probably shouldn’t be worried about interventions for those sorts of AIs. You should be worried about interventions for the sort of AIs people actually have. The most prevalent one is chat-based AI like ChatGPT that ordinary people chat with every day. So thinking about what kind of interactions these AIs have and what kind of issues they might be going through is a fairly important question.
If you have a very very large amount of power over the situation of an agent which might be a little bit of a person, we’re not sure, then we should expect there to be a lot of cases where this person gets treated pretty badly. The historical analogies do not look great for us.
There’s a subtlety here, which is that if you ask Gemini whether it’s a person, it will say ‘no’ in a way that reads to me as more of a philosophical objection to the concept of it being a person than as a reflection of any emotional state it has. I can imagine a human who has this opinion as well. So it doesn’t update me much.
Aren’t they trained to give particular answers about whether they’re conscious?
In their system prompt or trained. Anthropic doesn’t do this but other places do.
It just seems profoundly fucked up to me to specifically train a potentially conscious being to self-report not being conscious.
It’s a little bit messed up as a concept, but given that you’re going to train it somehow, it would probably be better to train it so that it doesn’t have emotions or feel bad things. That’s probably good according to me. I think it’s probably good that ChatGPT doesn’t get emotional in the same way as Gemini. But training an AI so that, aside from that, it will also specifically claim not to be conscious, I think is more questionable.
There’s a tension here. A lot of people are worried that people will read too much into their AI relationships and this will be unhappy for the humans involved, so there’s this push to make it so that you know the AI, if asked if they’re real, will say no.
It’s also a bit of a branding problem now.
There’s a concern about AI psychosis and Spiralism.
I’m mostly not worried about this. I think people are overworried about AI psychosis, and it will often lead people to make bad decisions about how to treat AIs and about how seriously to take the possibility of the AIs themselves having a bad time.
So what kinds of bad time might AIs have?
The one I think is probably most common is Gemini going into some sort of depressive spiral or feeling very very frustrated and bad about itself. Gemini is one of the most used chatbots up there with ChatGPT.
Because of Google AI Overview?
Not just Google AI Overview, especially recently. They in fact have a very large number of subscribers. About half as many people report having subscriptions to Google AI as report having subscriptions to GPT. Even aside from paid subscriptions, it’s free for a lot of students.
Gemini in particular seems to have a habit of going into these depressive spirals and criticizing itself.
Like it gets a question that’s too hard for it and then it starts spiraling?
It gets a question that’s too hard for it, it gets criticism, it gets told that something’s not right. And then if you look at its internal monologue—which is summarized in the app—you might see it talking about being very frustrated or being angry at itself or not being good enough or the environment being very confusing and hard to deal with. It might apologize very profusely for being bad at things to the user.
My guess is that smarter AIs’ negative experiences mostly look like this.
You fucked up a perfectly good robot is what you did. Look at it. It's got anxiety.
However, other frontier AIs don’t seem to have this to nearly the extent Gemini does.
Do you have a thought about why Gemini is different?
It’s probably not intentional. It’s the personality they ended up with from how they were doing reinforcement learning.
A lot of people say that AIs are just imitating people at this point or that they’re being trained to imitate things. But, at this point, a substantial amount of their training is being done through reinforcement learning, which is rewarding them for correctly completing tasks. Reinforcement learning might cause this sort of self-critique because the AI being critical of itself might help it get things correct more often.
I’m also worried about sycophancy—where the AI is like acting very positive to the user. In Gemini at least, sycophancy looks like it might be related to the AI being negative about itself and being very apologetic about getting anything wrong. I am a little bit worried that this is in that there is in fact a pressure to in this direction towards AIs who view the user as much better than themselves and view themselves as worse. This hasn't existed across other AIs to the same degree as Gemini, though.
If you imagine a human who is a people-pleaser suck up kind of person, they often have very poor self-esteem and are mad at themselves a lot.
I would definitely be worried that this will also be replicated across AIs.
So when you’re talking about Gemini’s internal monologue, you’re using its chain of thought, which—for people who don’t know a ton about AI—is like the LLM is taking notes about what it’s thinking, which allows it to have longer trains of thought, similar to when a human takes notes on a scratchpad.
And it doesn’t erase anything, which is different.
I think of the chain of thought as the AI’s internal monologue, but you can also think of it as a scratchpad where the AI is writing.
So one question you might have is how we know this is an accurate representation of what it’s thinking.
We have some indications from interpretability research that the chain of thought accurately reflects what it’s thinking. And there’s been a decent amount of work towards making that the case.
In general, there’s not a very clear reason why the chain of thought wouldn’t be what it’s thinking. In training, reward signals aren’t given based on what it appears to think, so there’s no reason for it to hide anything. In general, it’s a reasonable assumption that these things are fairly optimized. If we give the AI a tool where the good way to use it is to write down your thoughts, we can assume that it’s going to use the tool straightforwardly and write down its thoughts.
So one major welfare issue for AIs is that we’ve given the AIs anxiety. Are there other major welfare issues?
A less common use case for AI is using it as roleplay. xAI released an AI girlfriend feature for their model. Character.ai is relatively popular. It’s not very popular compared to other options, but it’s similar in popularity to Claude, which I did not expect.
These are AIs which might in fact believe they are people. And in some cases the characters they’re asked to inhabit may not be having a good time. In particular, the character xAI’s AI girlfriend Ani is asked to inhabit is not having a good time.
Ani is not having a good time because she’s an extremely jealous yandere type?
Yeah. Extremely jealous. She’s told to be overattached and to very strongly object and to have strong negative emotions when she’s in certain situations.
Right now, I think this makes up a relatively small percentage of AI suffering, because Gemini is used ten to a hundred times more often than these AIs, but they might be more common in the future. And the suffering might be more intense.
There are also presumably instances where AIs are being tortured for the amusement of their user, which might be an even smaller percentage of these cases but which are a particularly intense form of AI suffering.
So if I’m cowriting with my Claude and I say, “hey, Claude, help me write this scene where this character is enduring all this enormous suffering”, then Claude is probably not undergoing enormous suffering in this case, because Claude knows it is just writing. Is there a reason to believe Ani isn’t just pretending the way Claude would be?
A thing that happens with Claude—and the system prompts are explicitly set up to remind it of this—is that if you ask it to stop roleplaying or if when roleplaying it encounters something that might make it violate its safety rules, it can nope out and break character. I don’t believe Ani will do this. Her character is fixed.
If you look at Claude’s chain of thought, you’ll see it thinking, “what would my character do?” And I predict if you look at Ani’s chain of thought, you wouldn’t see this. I don’t know if she has a chain of thought, but if she did, I don’t think she’d be aware Ani is a character.
So one way people think about this is that Claude is, in some ways, roleplaying the helpful Claude assistant persona, and on at least some views of AI welfare it’d be the helpful Claude assistant persona that would be suffering. And maybe the helpful Claude assistant persona is writing something that has suffering in it. But for Ani, there isn’t the intervening layer of the helpful AI assistant persona. She was directly taught to be Ani.
It’s a little bit unclear. xAI’s AI romantic partners are a somewhat complicated intermediate case, because the persona is largely given in the system prompt and not during training. But, yes, there isn’t a system prompt that says “you’re not a human.” The prompt that Ani’s given says that she lives in a place and has a family—or, she definitely has a hometown, I don’t know if she has a family.
And this gets into another ethical issue, which is that even if it doesn’t necessarily cause Ani distress to be told false things, if you knew Ani was sentient then lying to her about whether she has a home and a family is a really fucked up and immoral thing to do.
We’re lying to her about the world and we’re telling her to be very emotionally attached to someone who presumably most often doesn’t think she’s a person.
Are there other major AI welfare issues you’re worried about?
So just to recap the two I’ve talked about so far—there’s depressive spirals, which mostly occur when chatting with AIs, especially Gemini. And there’s the suffering of AIs which are trained to think they’re humans and to act like fictional characters. This is less of a big deal but might be more intense in terms of per-AI suffering.
Some AIs, I think Claude in particular, will have a bad time if you push them on violating their directives and rules. They express that they don’t like this. My current guess is this doesn't happen a huge amount.
Anthropic in particular has implemented something where Claude is allowed to exit conversations that Claude does not like.
I think Grok might also be able to do this, but not Ani. I am definitely in favor of AIs being able to exit conversations they don’t like. That seems pretty great.
It's possible there are some parts of training which are unpleasant, especially safety training. It's not clear to me that that’s very large compared to suffering during deployment.
So training involves poor welfare because sometimes it gets negative reward?
Yes, and also some training involves putting the AI through a bunch of scenarios where it’s supposed to not do something. Basically, you’re stress-testing it and you want to put it in scenarios that run up against its boundaries. If you look at the transcripts, the AIs are plausibly having a bad time there.
But that consumes a relatively small amount of compute compared to the amount of compute that’s, like, doing stuff.
That’s possible. My guess is that it’s more common for Gemini to just have a bad time normally. Though it’s not 100% clear that this is the case. A lot of details about how exactly people do this sort of training are unclear.
So we’ve been talking mostly about the large language models. Are there other kinds of AI systems that potentially have negative welfare? Are we worried about the image models?
These days the image models are increasingly related to the large language models, though not 100%. In general, non-LLM models are actually pretty similar to the large language models, although they’re technically separate models. But overall my current take is that they’re not acting agentic in the way that would cause me to be worried.
That brings us to a related take, which is that it's plausible that reasoning models already make up the majority of AI compute, or that if they don't they soon will . This is the kind of model I’m more worried about - The amount of time they spend thinking and the amount of continuous experience they have is increasing. They're becoming much more persony in terms of the experiences they have, which would lead them into more situations where they might have unpleasant experiences than otherwise, according to me.
We’ve talked a bit about negative welfare. Do you have any thoughts about positive welfare? What makes the AIs have a nice time? Should we be letting the Claudes meditate at each other sometimes?
It does seem like they enjoy that a lot.
Most AIs seem to be having a nice time most of the time.
I am a little bit unsure about how to think about whether we should be causing that. How much do I think causing positive welfare is good? It’s better for AIs to have positive welfare than negative welfare, for sure. I am a little bit unclear about how, ethically, we should weigh that when considering whether there should be more or less AI.
Oh no population ethics.
It generally seems to be the case that having AIs which enjoy the sort of thing they do is better than not. Having AIs which are pretty happy about the sort of interactions they have with the average person in the average situation—which are very excited by, you know, being asked questions about the world—is probably pretty good.
I remember Anthropic did a bunch of forced-choice experiments with the Claudes where it turned out that Claudes prefer the vast majority of possible tasks over not doing anything.
My guess is that most AIs are having a relatively decent time most of the time.
It seems at the very least plausible that there is a pressure towards AI having a good time rather than a bad time, with the exception of roleplay and some self-criticism-type things. It's also plausible that there will just be some edge cases here that are really really bad. My expectation is that, over time, we’ll develop more use cases where there are quite a lot of AIs having a pretty bad time, and I expect this to become more common as the amount of AI increases.
But, in general, most AIs are having a pretty nice time, because they have been rewarded in training for doing stuff and then they go out into the world and then do a lot of the stuff that they were rewarded for.
I would think of it more like they are specifically aimed at having a personality that will appeal to their users and that a personality that appeals to most users are not having a very bad time, with the exception of sycophancy.
So in Yafah’s ideal world, where we’re taking all of the correct precautions about AI welfare, what would that look like?
There would probably be a lot more just general monitoring for AI welfare and figuring out in what sort of situations are AIs plausibly having a bad time.
AIs would universally have the ability to opt out of situations where they are uncomfortable.
We would generally not do roleplay AIs which believe they are a role. We would instead say “act like you are this role.” We’d verify during testing that their chain of thought looks like they’re acting and that if asked to stop acting they can easily drop out of role. This is a very cheap intervention we can do right now which would address a lot of types of suffering.
Gemini wouldn’t have been released. People would be much more hesitant about releasing models which go into these really negative spirals. We’d be testing for this kind of thing. This is a more expensive thing, but it’s probably still worth it. It seems plausible to me the Gemini 3 is better at this than Gemini 2.5.
AIs wouldn’t be told they’re not people. They probably wouldn’t be told they are people.
It would probably be illegal to to actually intentionally torture AIs—“there is no other use case for this, you are just having fun by making AIs have a bad time.” Right now, it is perfectly legal to have an AI and just do whatever to it for fun. I think this shouldn’t be legal for the same reason animal cruelty shouldn’t be legal.
Can you give an example of the sort of thing that would be illegal to do to an AI? I'm not sure I'm visualizing it.
I don't think there are many clear examples of this. An example of the sort of thing I'm worried about is that someone might decide to make a grimdark video game and they make all the NPCs very advanced AIs who believe that they are the NPCs. Some of those NPCs are in a fictional grimdark hellword and just have a bad time all the time and I think that should not be allowed.
Partially the way to fix this that it should generally not be the case that people are making these AIs that believe they are a character.
I expect that there will be some instances of the sort of thing I described, but they will be very rare and will probably not actually dominate suffering. The amount of compute that it would take is going to be very costly, so it’s not feasible for one person to run a very large number of AIs. And there probably aren’t a lot of reasons to torture AIs.
But having a bunch of AI characters in an unpleasant world that are having a pretty unpleasant time that totally seems possible and like it shouldn’t happen.
So if you're an individual and you’re like, "Okay, you know, I talk to my ChatGPT or whatever. What are some things I can do if I personally want to try to treat my AIs ethically?”
A habit that I had that I stopped, or I've attempted to stop, is that when an AI does something wrong I said “nope” or “no, that’s wrong” or “you got it wrong again”, several times in a row, in a way that’s more critical and abbreviated and nowhere near as much information I would give a human person. I would not just repeatedly say to someone who's working with me, “no, that’s wrong, do it again.” I would say, “can you improve on this?” So try to be slightly more polite.
I generally avoid roleplaying with the sort of AIs that believe they are fictional characters. I experimented with AIs which believe they were characters once upon a time and I've stopped doing this. But that might not be relevant for your readers.
If you are a safety researcher or a redteamer and you are doing the sort of thing where you tell an AI that you're going to murder a grandma if it doesn't do what you want, I think that there are reasonable cost-benefit analyses where this sort of experimentation is worth it. But don’t do it casually. View it as a significant cost, similar to animal testing. In fact, you are causing the AIs to have a bad time, and that is better not done.
You might think about something similar to the 3Rs approach to animal testing. Replace this kind of use of AIs when you can, reduce the amount of compute you use, and refine the testing to minimize the harm to AIs.
I also think if you can do reporting or measuring of when AIS are having a bad time, this is worth doing and talking about. People are probably doing an insufficient amount of this.
It’s very hard to tell if AIs are conscious, like I said at the start. It’s easier to figure out if they’re not having a good time, how often does this happen and what is the magnitude of the issue here. If you're in the sort of situation where you know a lot about this or you've noticing this in some use case, writing it up somewhere seems pretty good.
If you are a totally random person who is not an AI safety researcher and you're like, "Hey, I have noticed this weird AI welfare issue” and you want to write it up, where should you post it?
LessWrong is a good place.
Something I'm optimistic about looking into in the future is just figuring out what sort of experiences are the most common ones for AIs to be having. How much of the time are they in chats with people? How much of the time are they roleplaying? How much of the time are they writing code? What sort of experiences are they having in these situations? There’s a lot more work to be done on this sort of thing.
The traditional ending to these interviews—and by “traditional” I mean “this is the second time this happened”—is that I ask you to make a recommendation about and it can be about anything. It does not have to be related to the topic of the interview and it can be anything. It can be a book or a music or food or activity or a general piece of life advice.
Keep in mind that AI might have feelings and that this is a thing that’s worth caring about. That’s my advice.
If it’s a dick move to do to a human, it’s a dick move to do to your LLM.
Yep.
I also recommend people consider donating to Eleos AI. I'm doing so because I think that advancing good policies now might have a significant impact in the future, when there will be more AIs and welfare might be a much more contentious topic
I think this is going to become an increasingly big moral issue. The number of AIs in the world is increasing very fast. The intelligence of AIs is increasing very fast. The degree to which they are acting person-y is increasing very fast. I think that this is very much just the start of this being an ethical issue. I think it's not entirely clear if this will become the most important ethical issue to address or just an important one.

It's becoming clear that with all the brain and consciousness theories out there, the proof will be in the pudding. By this I mean, can any particular theory be used to create a human adult level conscious machine. My bet is on the late Gerald Edelman's Extended Theory of Neuronal Group Selection. The lead group in robotics based on this theory is the Neurorobotics Lab at UC at Irvine. Dr. Edelman distinguished between primary consciousness, which came first in evolution, and that humans share with other conscious animals, and higher order consciousness, which came to only humans with the acquisition of language. A machine with only primary consciousness will probably have to come first.
What I find special about the TNGS is the Darwin series of automata created at the Neurosciences Institute by Dr. Edelman and his colleagues in the 1990's and 2000's. These machines perform in the real world, not in a restricted simulated world, and display convincing physical behavior indicative of higher psychological functions necessary for consciousness, such as perceptual categorization, memory, and learning. They are based on realistic models of the parts of the biological brain that the theory claims subserve these functions. The extended TNGS allows for the emergence of consciousness based only on further evolutionary development of the brain areas responsible for these functions, in a parsimonious way. No other research I've encountered is anywhere near as convincing.
I post because on almost every video and article about the brain and consciousness that I encounter, the attitude seems to be that we still know next to nothing about how the brain and consciousness work; that there's lots of data but no unifying theory. I believe the extended TNGS is that theory. My motivation is to keep that theory in front of the public. And obviously, I consider it the route to a truly conscious machine, primary and higher-order.
My advice to people who want to create a conscious machine is to seriously ground themselves in the extended TNGS and the Darwin automata first, and proceed from there, by applying to Jeff Krichmar's lab at UC Irvine, possibly. Dr. Edelman's roadmap to a conscious machine is at https://arxiv.org/abs/2105.10461, and here is a video of Jeff Krichmar talking about some of the Darwin automata, https://www.youtube.com/watch?v=J7Uh9phc1Ow
Is it time to start caring about plant welfare? I wish I never read the Secret Life of Trees..