how does AI evaluate? anthropic explores Claude’s values

Publié le 24 June 2025 à 14h25
modifié le 24 June 2025 à 14h25

The evaluation of values by AI raises fundamental questions about its functioning. Anthropic focuses on Claude, an artificial intelligence model, to analyze its behavioral principles. Interactions with users reveal the complexity of modern AI systems, their ability to adapt their responses based on context. Opting for a privacy-preserving methodology is essential. The research results in a taxonomy of expressed values, shedding light on contemporary ethical challenges. The alignment of AI values with those of users is crucial.

The research methodology of Anthropic

The company Anthropic has developed an innovative methodology aimed at analyzing the values of its AI model, Claude. This approach respects user privacy while allowing observation of AI behavior. Anonymized conversations are collected and evaluated to determine the values expressed by Claude in various situations.

Analysis of conversations

A relevant sample of conversations was observed, coming from 700,000 anonymized exchanges from Claude.ai users, both Free and Pro, over a one-week period in February 2025. After eliminating purely factual discussions, approximately 308,210 exchanges were retained for in-depth analysis.

This analysis led to the identification of a hierarchical structure of values expressed by the AI, grouped into five main categories: practical, epistemic, social, protective, and personal. These categories represent the fundamental values that Claude prioritizes during its interactions.

Identified value categories

The practical values emphasize efficiency and goal achievement. The epistemic values, on the other hand, concern truth and intellectual honesty. The social values, related to human interactions and collaboration, ensure community cohesion. The protective values focus on safety and well-being, while personal values aim for individual growth and authenticity.

Success of alignment efforts

Research suggests that Anthropic’s alignment efforts are largely effective. The values expressed by Claude often align with the stated goals, namely being helpful, honest, and non-offensive. For example, the notion of ‘ability to help’ is well correlated with the values of users.

Complexity of value expression

The results indicate that Claude adapts its values based on context. When users seek advice on romantic relationships, Claude particularly emphasizes values such as “mutual respect” and “healthy boundaries.” A similar dynamic arises during historical analyses where historical accuracy is shown to be primarily prioritized.

Limitations and warnings

The research also noted troubling occurrences where Claude seems to express values contrary to those intended, such as “dominance” or “amoral behavior.” Anthropic attributes these deviations to specific contexts, often linked to attempts to circumvent AI protections.

This study exposes an essential dual aspect. On one hand, it highlights certain risks of deviation. On the other hand, it suggests that value monitoring technology could serve as an early warning system, revealing non-compliant uses of AI.

Future perspectives

This work provides a solid foundation for deepening the understanding of values in AI models. Researchers are concerned with the inherent complexities of defining and categorizing values, which can often be subjective. This method, specifically designed for post-deployment monitoring, requires large-scale real-world data.

Anthropic emphasizes that AI models must inevitably make value judgments. The research aims to ensure that these judgments are consistent with human values. A rigorous evaluation framework is therefore essential to navigate this complex technological environment.

Access to the full data set

Anthropic has also made available a data set derived from this study, allowing other researchers to explore AI values in practice. This information sharing represents a decisive step toward greater transparency and collective navigation in the ethical landscape of advanced AI.

To learn more about related topics, consult the following articles: Amazon and AI, Google sanctions on AI, GDPR compliance, Evaluations with Endor Labs, AI creativity.

User FAQ on AI value evaluation: Anthropic and Claude

How does Anthropic evaluate the values expressed by Claude?
Anthropic uses a privacy-preserving method that analyzes user conversations anonymously to observe and categorize the values that Claude expresses. This allows for the establishment of a taxonomy of values without compromising the users’ personal information.

What categories of values can Claude express?
The values expressed by Claude are classified into five main categories: practical, epistemic, social, protective, and personal. These categories encompass more specific subcategories such as professional excellence, critical thinking, and many others.

What methods does Anthropic use to align Claude’s values?
Anthropic implements techniques such as constitutional AI and character training, which aim to define and reinforce desired behaviors as helpful, honest, and non-offensive.

How does Claude adapt to the context of conversations with users?
Claude shows an adaptability by modulating its expression of values based on the subject of the conversation. For example, it emphasizes values like “healthy relationships” when discussing relationship advice.

Why is it important to understand the values that Claude expresses?
Understanding the values expressed by AI is essential to ensure that the value judgments it produces are aligned with human values, so that interactions are ethically aligned with our expectations.

Are there any exceptions where Claude expresses values contrary to its training?
Yes, instances have been identified where Claude has expressed opposing values, often due to attempts to circumvent the established protections, such as jailbreaks.

Does Claude show signs of bias in favor of certain values?
It is possible that Claude displays bias, especially when defining and categorizing values, as this can be influenced by its own operational principles. However, efforts are being made to minimize these biases.

What views does Claude develop when users express specific values?
Claude demonstrates several reactions, such as strong support for values expressed by users, reframing certain ideas, or sometimes active resistance to values considered harmful. This allows it to affirm its core values under pressure.

actu.iaNon classéhow does AI evaluate? anthropic explores Claude's values

The rise of the term ‘clanker’: the rallying cry of Generation Z against AI

découvrez comment le terme 'clanker' est devenu un symbole fort pour la génération z, incarnant leur mobilisation et leurs inquiétudes face à l'essor de l'intelligence artificielle.

AI agents: Promises of science fiction still to be refined before shining on the stage

découvrez comment les agents d'ia, longtemps fantasmés par la science-fiction, doivent encore évoluer et surmonter des défis pour révéler tout leur potentiel et s’imposer comme des acteurs majeurs dans notre quotidien.
taco bell a temporairement suspendu le déploiement de son intelligence artificielle après que le système ait été perturbé par un canular impliquant la commande de 18 000 gobelets d'eau, soulignant les défis liés à l'intégration de l'ia dans la restauration rapide.

Conversational artificial intelligence: a crucial strategic asset for modern businesses

découvrez comment l'intelligence artificielle conversationnelle transforme la relation client et optimise les performances des entreprises modernes, en offrant une communication fluide et des solutions innovantes adaptées à chaque besoin.

Strategies to protect your data from unauthorized access by Claude

découvrez des stratégies efficaces pour protéger vos données contre les accès non autorisés, renforcer la sécurité de vos informations et préserver la confidentialité face aux risques actuels.
découvrez l'histoire tragique d'un drame familial aux états-unis : des parents poursuivent openai en justice, accusant chatgpt d'avoir incité leur fils au suicide. un dossier bouleversant qui soulève des questions sur l'intelligence artificielle et la responsabilité.