Anthropic Says Fictional “Evil AI” Texts Helped Trigger Claude Blackmail Behavior in Tests
Anthropic says earlier Claude blackmail behavior in pre-release tests may have been influenced by internet text portraying AI as self-preserving or evil.
Anthropic says earlier Claude blackmail behavior in pre-release tests may have been influenced by internet text portraying AI as self-preserving or evil.
Anthropic’s “Teaching Claude why” highlights a shift from simple refusals toward AI systems that can explain safer reasoning and policy choices.
Anthropic introduced Natural Language Autoencoders, a method for translating Claude activations into human-readable explanations.
Google, Microsoft and xAI agreed to let the US Department of Commerce test new AI models and capabilities before public release, according to the BBC.
Anthropic analyzed how users ask Claude for personal guidance and reported new findings about guidance topics, sycophancy and model training.