...
During internal testing, the classifier successfully blocked 95% of 10,000 synthetic jailbreak attempts, compared to just 14% for an unprotected Claude model. However, the system carries a 23.7% computational overhead, increasing costs and energy consumption. It also mistakenly rejects 0.38% of safe queries, a tradeoff Anthropic deems acceptable.
Despite these advancements, Anthropic acknowledges that no AI safety system is foolproof. The company expects new jailbreak methods to emerge but claims its classifier can quickly adapt to novel threats.
From now until February 10, Anthropic is inviting the public to test its defenses by attempting to bypass the class prompt Claude into generating restricted content on chemical weapons. Successful jailbreaks will be disclosed at the end of the test. ...
See the full story here: https://shellypalmer.com/2025/02/anthropic-challenges-hackers-to-jailbreak-its-ai-model/