Anthropic Says New Claude Alignment Training Cut Agentic Misalignment in Tests
Anthropic published new details on how it trains Claude models to handle agentic misalignment risks, including blackmail-style evaluation scenarios.
Anthropic published new details on how it trains Claude models to handle agentic misalignment risks, including blackmail-style evaluation scenarios.