how does AI evaluate? anthropic explores Claude’s values

Publié le 24 June 2025 à 14h25
modifié le 24 June 2025 à 14h25

The evaluation of values by AI raises fundamental questions about its functioning. Anthropic focuses on Claude, an artificial intelligence model, to analyze its behavioral principles. Interactions with users reveal the complexity of modern AI systems, their ability to adapt their responses based on context. Opting for a privacy-preserving methodology is essential. The research results in a taxonomy of expressed values, shedding light on contemporary ethical challenges. The alignment of AI values with those of users is crucial.

The research methodology of Anthropic

The company Anthropic has developed an innovative methodology aimed at analyzing the values of its AI model, Claude. This approach respects user privacy while allowing observation of AI behavior. Anonymized conversations are collected and evaluated to determine the values expressed by Claude in various situations.

Analysis of conversations

A relevant sample of conversations was observed, coming from 700,000 anonymized exchanges from Claude.ai users, both Free and Pro, over a one-week period in February 2025. After eliminating purely factual discussions, approximately 308,210 exchanges were retained for in-depth analysis.

This analysis led to the identification of a hierarchical structure of values expressed by the AI, grouped into five main categories: practical, epistemic, social, protective, and personal. These categories represent the fundamental values that Claude prioritizes during its interactions.

Identified value categories

The practical values emphasize efficiency and goal achievement. The epistemic values, on the other hand, concern truth and intellectual honesty. The social values, related to human interactions and collaboration, ensure community cohesion. The protective values focus on safety and well-being, while personal values aim for individual growth and authenticity.

Success of alignment efforts

Research suggests that Anthropic’s alignment efforts are largely effective. The values expressed by Claude often align with the stated goals, namely being helpful, honest, and non-offensive. For example, the notion of ‘ability to help’ is well correlated with the values of users.

Complexity of value expression

The results indicate that Claude adapts its values based on context. When users seek advice on romantic relationships, Claude particularly emphasizes values such as “mutual respect” and “healthy boundaries.” A similar dynamic arises during historical analyses where historical accuracy is shown to be primarily prioritized.

Limitations and warnings

The research also noted troubling occurrences where Claude seems to express values contrary to those intended, such as “dominance” or “amoral behavior.” Anthropic attributes these deviations to specific contexts, often linked to attempts to circumvent AI protections.

This study exposes an essential dual aspect. On one hand, it highlights certain risks of deviation. On the other hand, it suggests that value monitoring technology could serve as an early warning system, revealing non-compliant uses of AI.

Future perspectives

This work provides a solid foundation for deepening the understanding of values in AI models. Researchers are concerned with the inherent complexities of defining and categorizing values, which can often be subjective. This method, specifically designed for post-deployment monitoring, requires large-scale real-world data.

Anthropic emphasizes that AI models must inevitably make value judgments. The research aims to ensure that these judgments are consistent with human values. A rigorous evaluation framework is therefore essential to navigate this complex technological environment.

Access to the full data set

Anthropic has also made available a data set derived from this study, allowing other researchers to explore AI values in practice. This information sharing represents a decisive step toward greater transparency and collective navigation in the ethical landscape of advanced AI.

To learn more about related topics, consult the following articles: Amazon and AI, Google sanctions on AI, GDPR compliance, Evaluations with Endor Labs, AI creativity.

User FAQ on AI value evaluation: Anthropic and Claude

How does Anthropic evaluate the values expressed by Claude?
Anthropic uses a privacy-preserving method that analyzes user conversations anonymously to observe and categorize the values that Claude expresses. This allows for the establishment of a taxonomy of values without compromising the users’ personal information.

What categories of values can Claude express?
The values expressed by Claude are classified into five main categories: practical, epistemic, social, protective, and personal. These categories encompass more specific subcategories such as professional excellence, critical thinking, and many others.

What methods does Anthropic use to align Claude’s values?
Anthropic implements techniques such as constitutional AI and character training, which aim to define and reinforce desired behaviors as helpful, honest, and non-offensive.

How does Claude adapt to the context of conversations with users?
Claude shows an adaptability by modulating its expression of values based on the subject of the conversation. For example, it emphasizes values like “healthy relationships” when discussing relationship advice.

Why is it important to understand the values that Claude expresses?
Understanding the values expressed by AI is essential to ensure that the value judgments it produces are aligned with human values, so that interactions are ethically aligned with our expectations.

Are there any exceptions where Claude expresses values contrary to its training?
Yes, instances have been identified where Claude has expressed opposing values, often due to attempts to circumvent the established protections, such as jailbreaks.

Does Claude show signs of bias in favor of certain values?
It is possible that Claude displays bias, especially when defining and categorizing values, as this can be influenced by its own operational principles. However, efforts are being made to minimize these biases.

What views does Claude develop when users express specific values?
Claude demonstrates several reactions, such as strong support for values expressed by users, reframing certain ideas, or sometimes active resistance to values considered harmful. This allows it to affirm its core values under pressure.

actu.iaNon classéhow does AI evaluate? anthropic explores Claude's values

the theory about Jony Ive’s AI hardware device is becoming increasingly credible

explorez la théorie captivante sur le dispositif matériel d'intelligence artificielle imaginé par jony ive, qui gagne en crédibilité. découvrez comment ses concepts innovants pourraient révolutionner notre interaction avec la technologie et redéfinir l'avenir des objets connectés.

how artificial intelligence has invested the world of perfumery

découvrez comment l'intelligence artificielle transforme l'industrie de la parfumerie, de la création de nouvelles fragrances à l'optimisation des procédés, en alliant innovation technologique et art de la senteur.

The influence of AI on our language: a study reveals that humans express themselves like ChatGPT

découvrez comment l'intelligence artificielle, à travers des outils comme chatgpt, façonne notre manière de communiquer. cette étude approfondie révèle des tendances fascinantes sur l'évolution de notre langage et les similitudes croissantes entre les expressions humaines et celles générées par l'ia.

Thomas Wolf from Hugging Face: the ambition to democratize robotics through open source

découvrez comment thomas wolf, co-fondateur de hugging face, vise à démocratiser la robotique grâce à l'open source. explorez ses idées innovantes et son engagement pour rendre la technologie accessible à tous.

the 20 most powerful AI models of June 2025: discover the detailed ranking

découvrez notre classement détaillé des 20 modèles d'intelligence artificielle les plus performants de juin 2025. explorez les innovations et les avancées qui façonnent l'avenir de la technologie.

Cédric O facing accusations of conflicts of interest, but receiving support from the HATVP

découvrez comment cédric o se retrouve au cœur de controverses concernant des accusations de conflit d'intérêts, tout en recevant le soutien inattendu de la haute autorité pour la transparence de la vie publique (hatvp).