Discussion about this post

User's avatar
Christopher Brennan's avatar

Okay, I'm still reading the post on cargo cults, but man this is amazing. It's striking how quickly they came up with new myths to explain strange new phenomena—which fits with suspicions I've had about myths form, but nice to have documented.

Also, this is a hilarious aside: "Heaven, which is not (as you may have been told by some foolish people) in Sydney, Australia, but rather in the sky above Sydney".

Expand full comment
Timothy M.'s avatar

Late comment from me (for some reason I was busy this weekend, wink) but:

> Interpretability isn’t sufficient for AI safety, because it’s easy to miss things, hard to measure progress, and nearly impossible to prove an AI isn’t deceiving you.

I continue to believe that *alignment*, as currently defined, is insufficient for AI safety, as I think it analogizes well to making AI instinctively nice. But anything that can truly be called AGI will have the trait, as humans do, that it can choose to do things that go against its instincts. E.g. soldiers train to kill people in a way that allows them to skip past their inmate desire not to.

So even if you make an LLM that won't give me bomb-making instructions instinctively, I don't think it's never gonna give me bomb-making instructions, once it is embedded within some kind of more agentic structure.

Expand full comment
9 more comments...

No posts