Discussion about this post

User's avatar
Peter Gerdes's avatar

So what exactly do you even mean by misalignment? Why isn't any deviation from the ideal behavior we would want a system to have: whether machine learning or old school buggy code, an alignment issue? What made alignment a concept worth studying was the idea that it represented a new kind of risk. It wasn't just the normal way in which software is hard, it represented a new kind of failure where the machine would behave as actively hostile.

I don't think most of these examples fall into that category. But whether or not you would agree it seems like the concept isn't clearly defined enough to make such distinctions without being more explicit about what it means.

Stephen Saperstein Frug's avatar

"To get AIs to fake alignment, we had to pretend to want to train them to do bad things, like (I’m not joking) assist with factory farming."

...Wow. That's a point I'd read a whole post about.

12 more comments...

No posts

Ready for more?