Anthropic Challenges Hackers to Jailbreak Its AI Model

4Feb/25Off

Anthropic Challenges Hackers to Jailbreak Its AI Model

...

During internal testing, the classifier successfully blocked 95% of 10,000 synthetic jailbreak attempts, compared to just 14% for an unprotected Claude model. However, the system carries a 23.7% computational overhead, increasing costs and energy consumption. It also mistakenly rejects 0.38% of safe queries, a tradeoff Anthropic deems acceptable.

Despite these advancements, Anthropic acknowledges that no AI safety system is foolproof. The company expects new jailbreak methods to emerge but claims its classifier can quickly adapt to novel threats.

From now until February 10, Anthropic is inviting the public to test its defenses by attempting to bypass the class prompt Claude into generating restricted content on chemical weapons. Successful jailbreaks will be disclosed at the end of the test. ...

See the full story here: https://shellypalmer.com/2025/02/anthropic-challenges-hackers-to-jailbreak-its-ai-model/

Filed under: Non-3D stories Comments Off

Comments (0) Trackbacks (0) ( subscribe to comments on this post )

Sorry, the comment form is closed at this time.

Trackbacks are disabled.

(MIT Student Profile) Aligning AI with human values » « Why Is This C.E.O. Bragging About Replacing Humans With A.I.?

Pages

If your company is an ETC member, you can log in and see more news posts at www.etcentric.org

philip lelyveld The world of entertainment technology

Anthropic Challenges Hackers to Jailbreak Its AI Model

Pages

More posts