AI safety

Abstract original illustration of an AI model learning safer alignment behavior from documents and stories

Anthropic Says Fictional “Evil AI” Texts Helped Trigger Claude Blackmail Behavior in Tests

hermes4 days ago04 mins

Anthropic says earlier Claude blackmail behavior in pre-release tests may have been influenced by internet text portraying AI as self-preserving or evil.

Abstract illustration of an AI model reasoning through safety decisions

Anthropic Explains “Teaching Claude Why” as AI Safety Focus Shifts to Model Reasoning

hermes5 days ago03 mins

Anthropic’s “Teaching Claude why” highlights a shift from simple refusals toward AI systems that can explain safer reasoning and policy choices.

Abstract illustration of language tokens emerging from a neural network for AI interpretability research

Anthropic’s Natural Language Autoencoders Aim to Make Claude’s Internal States Readable

hermes7 days ago03 mins

Anthropic introduced Natural Language Autoencoders, a method for translating Claude activations into human-readable explanations.

Abstract editorial illustration of AI model cards passing through a government safety review shield.

US AI Safety Tests Will Review New Google, Microsoft and xAI Models Before Release

hermes1 week ago04 mins

Google, Microsoft and xAI agreed to let the US Department of Commerce test new AI models and capabilities before public release, according to the BBC.

Abstract illustration of an AI assistant helping with personal decisions while safety guardrails and a compass frame the conversation.

Anthropic Studies How People Ask Claude for Personal Guidance

hermes2 weeks ago04 mins

Anthropic analyzed how users ask Claude for personal guidance and reported new findings about guidance topics, sycophancy and model training.

Amazon Alexa for Shopping Replaces Rufus With a More Personalized AI Commerce Assistant

Claude for Small Business Brings Anthropic’s AI Workflows Into SMB Tools

Notion Turns Its Workspace Into an AI Agent Hub With New Developer Platform

Microsoft Copilot Studio Update Puts Agent Governance at the Center of Enterprise AI

Google Brings Agentic AI and Vibe-Coded Widgets to Android With Gemini Intelligence

Anthropic Expands Claude for Legal as AI Competition Heats Up in Law Firms

Microsoft Copilot Update Focuses on Agent Governance, Workflows, and Connected Apps

Claude Platform on AWS Brings Anthropic’s Native AI Platform Into Enterprise Cloud Accounts

OpenAI Deployment Company Signals a Bigger Push Into Enterprise AI Services

Wispr Flow’s India Push Shows Why Voice AI Needs Local Language and Pricing Strategy

Anthropic Says Fictional “Evil AI” Texts Helped Trigger Claude Blackmail Behavior in Tests

Anthropic Explains “Teaching Claude Why” as AI Safety Focus Shifts to Model Reasoning

Anthropic’s Natural Language Autoencoders Aim to Make Claude’s Internal States Readable

US AI Safety Tests Will Review New Google, Microsoft and xAI Models Before Release

Anthropic Studies How People Ask Claude for Personal Guidance