Anthropic and OpenAI Report Findings of Joint AI Safety Tests
OpenAI and Anthropic — rivals in the AI space who guard their proprietary systems — joined forces for a misalignment evaluation, safety testing each other’s models to identify when and how they fall short of human values. Among the findings: reasoning models including Anthropic’s Claude Opus 4 and Sonnet 4, and OpenAI’s o3 and o4-mini resist jailbreaks, while conversational models like GPT-4.1 were susceptible to prompts or techniques intended to bypass safety protocols. Although the test results were unveiled as users complain chatbots have become overly sycophantic, the tests were “primarily interested in understanding model propensities for harmful action,” per OpenAI. ...
The conditions were not intended to recreate real-world situations, but were aimed at understanding “the most concerning actions that these models might try to take when given the opportunity,” Anthropic reports in its thorough findings post. ...
With regard to hallucinating, Anthropic’s Claude models “refused to answer up to 70 percent of questions when they were unsure of the correct answer,” while “OpenAI’s o3 and o4-mini models refuse to answer questions far less, but showed much higher hallucination rates, attempting to answer questions when they didn’t have enough information,” explains TechCrunch. ...
See the full story here: https://www.etcentric.org/anthropic-and-openai-report-findings-of-joint-ai-safety-tests/
Pages
- About Philip Lelyveld
- Mark and Addie Lelyveld Biographies
- Presentations and articles
- Trustworthy AI – A Market-Driven approach
- Tufts Alumni Bio