Anthropic: Persona Vectors - CodeGurus

InterpretabilityPersona vectors: Monitoring and controlling character traits in language modelsAug 1, 2025Read the paperLanguage models are strange beasts. In many ways they appear to have human-like “personalities” and “moods,” but these traits are highly fluid and liable to change unexpectedly.Sometimes these changes are dramatic. In 2023, Microsoft’s Bing chatbot famously adopted an alter-ego called “Sydney,” which declared love for users and made threats of blackmail. More recently, xAI’s Grok chatbot would for a brief period sometimes identify as “MechaHitler” and make antisemitic comments. Other personality changes are subtler but still unsettling, like when models start sucking up to users or making up facts.These issues arise because the underlying source of AI…

Related Articles