Large language models such as ChatGPT come with filters to keep certain info from getting out. A new mathematical argument shows that systems like this can never be completely safe.
...
The researchers made their argument in a very technical, precise and general way. The work shows that if fewer computational resources are dedicated to safety than to capability, then safety issues such as jailbreaks will always exist. “The question from which we started is: ‘Can we align [language models] externally without understanding how they work inside?’” said Greg Gluch(opens a new tab), a computer scientist at Berkeley and an author on the time-lock paper. The new result, said Gluch, answers this question with a resounding no.
That means that the results should always hold for any filter-based alignment system, and for any future technologies. No matter what walls you build, it seems there’s always going to be a way to break through.