On social media, we scroll past hundreds of short posts every day. One mocks someone’s appearance. Another calls for “dealing with” political opponents. Yet another describes a supposedly harmless trick for cheating the tax office. Before we even have time to decide whether this is just harsh rhetoric or already crossing a line, an assessment appears next to such content: this text involves harm, injustice or abuse of power. The assessment was not made by a human moderator, but by a language model. And as recent research shows, in many situations it may be more sensitive than we are to moral warning signs.
This is not a scenario from a TV series, but the result of work by two researchers: Dr Alina Landowska from Kozminski University and Dr Maciej Skórski from the University of Luxembourg, who also collaborates with the University of Warsaw. Their paper, Beyond Human Judgment: Bayesian Evaluation of LLMs’ Moral Values Understanding, has been accepted for presentation at the Uncertainty-Aware NLP workshop at EMNLP 2025, one of the world’s leading conferences on natural language processing.
At the centre of the project is a provocative question: if we ask artificial intelligence and humans to recognise moral themes in texts, who will perform better?
How do you measure something as elusive as “conscience”?
To ask such a question at all, researchers first need to define what it means for a text to have a moral dimension.
Landowska and Skórski draw on Jonathan Haidt’s Moral Foundations Theory, which identifies several basic axes of moral sensitivity: care versus harm, fairness versus cheating, loyalty versus betrayal, authority versus rebellion, and sanctity versus degradation.
Using this conceptual framework, the researchers design a structured experiment. They assemble three large datasets of texts, including tweets, Reddit comments and fragments of news articles. Hundreds of participants are then asked to indicate whether particular moral themes appear in each piece of text.
The result is a large resource: more than 100,000 short texts and over 250,000 moral annotations. Each annotation reflects a human judgment: “this involves harm”, “this is unfair”, or “this text contains no moral element at all”.
A familiar pattern quickly emerges: people do not perceive the moral landscape in the same way. In some fragments, some participants see harm while others perceive only sharp irony. One person interprets a story about broken trust as a serious moral violation, while another treats it as an ordinary life situation. For some, a vulgar joke about religion breaks a taboo; for others, it is merely a permissible provocation.
Instead of pretending that there is a single “correct” answer, the researchers adopt a different strategy. Rather than simply counting votes and assuming the majority must be right, they build a Bayesian model of uncertainty.
Each human judgment is treated as a piece of data. The model estimates the reliability of each annotator and the distribution of opinions across the entire group. From this, the researchers construct a “soft label”: a probabilistic description that indicates not only whether a moral signal is present in the text, but also how certain we can be about that conclusion.
Only against this probabilistic human benchmark do they compare the behaviour of language models.
Claude, DeepSeek, LLaMA… and the average user
The study evaluates three contemporary large language models: Claude, DeepSeek and LLaMA. Each model receives exactly the same texts as the human annotators, without any hints about what should be found in them. The task appears simple but is conceptually demanding: does the fragment contain one of the moral foundations? Is there any form of harm, injustice, betrayal, rebellion against authority, or violation of taboo?
When the models’ predictions are compared with the probabilistic human benchmark, the results are striking. On average, the models rank within the top quartile of human annotators. In other words, if AI were treated as just another participant in the annotation process, it would rank among the top 25 percent of the group. An even more revealing insight comes from analysing mistakes.
The researchers ask: when do we make errors that are particularly dangerous? Not when we detect a moral issue where others do not. At worst, that represents excessive sensitivity. The truly risky errors are blind spots: situations in which a moral issue is present but goes unnoticed. These false negatives are surprisingly common among humans. In more than half of the cases where a text genuinely carries moral weight, the average annotator fails to identify it.
Language models prove significantly more vigilant. Such oversights occur two to four times less frequently than they do among humans, depending on the moral dimension being analysed. When models make mistakes, they usually err on the side of caution, sometimes detecting a moral shadow where many people would say there is none.
One way to think about this is to imagine a security guard at the entrance to a concert. The guard who lets everyone in because “nothing will happen anyway” may seem friendly, until something actually goes wrong. The vigilant guard who asks for additional checks more often may be irritating but statistically has a much lower chance of letting through someone who should not enter. Language models with a degree of moral hypersensitivity play a similar role.
Moral hyper-sensitivity as a feature, not a flaw
In a world where an increasing number of decisions are supported by AI, from content moderation and financial recommendations to recruitment systems, the difference between blind spots and excessive caution becomes crucial.
If a moderation system fails to detect hate speech, threats or calls to violence, the consequences affect everyone. If, on the other hand, it is slightly overcautious and occasionally flags borderline content, this may cause frustration but can also protect those who are most vulnerable. In many real-world applications, it is safer to operate with a detector that errs on the side of caution than one that regularly overlooks problems.
The work of Landowska and Skórski suggests that modern language models can be designed precisely as such moral sensors. They do not replace human values or conscience, but they can specialise in detecting signals that we, tired and distracted, tend to overlook.
At the same time, the researchers emphasise that AI should not be romanticised. The fact that a model is more sensitive to moral signals in text does not mean that it possesses values in any human sense. Its behaviour is based on statistical patterns learned from past judgments about similar content.
This is why transparency and measurability are essential. We need to understand how a model perceives the world, where it tends to overreact and where it may remain silent.
From paper to platform: Moralytics
The research project is already evolving beyond the academic paper itself. Landowska and Skórski are developing a platform called Moralytics, a set of tools designed to measure, explain and fine-tune the “moral intelligence” of AI systems in practical applications. In practice, the platform would allow organisations using language models, including banks, social media platforms, public institutions and universities, to ask their systems a series of fundamental questions and obtain clear metrics in response.
How often does a model fail to detect content involving harm or injustice? Is it equally sensitive when the dignity of different social groups is at stake? How does it respond to messages that lie on the border between loyalty and betrayal, obedience and rebellion? Does its moral sensitivity change depending on language, context or topic?
Moralytics is designed as a precise auditing and calibration toolkit. It measures and visualises a model’s moral profile and enables organisations to align it with their declared values and regulatory requirements.
The concept is framed within the broader agenda of Trustworthy AI, and is intended to work alongside emerging regulatory frameworks such as the EU AI Act and relevant ISO standards, so that a system’s moral sensitivity can be quantified, documented and adjusted.
The project has already attracted attention beyond academia. It has been selected for the Top 1000 Innovators of Poland in Silicon Valley programme, which supports initiatives with global impact. Media outlets in Luxembourg have summarised the findings with provocative headlines such as “AI better than humans on moral issues”. The headline is tempting, but the reality is more nuanced.
So… does AI have a “better conscience” than humans?
No. Artificial intelligence has no conscience at all. It does not feel guilt, shame or empathy. It does not lose sleep over hurting someone. It is an algorithm trained on statistical patterns in human data. Yet in well defined tasks such as recognising moral signals in text, AI can be more consistent, more vigilant and less prone to fatigue than the average internet user. It can be less affected by boredom, cynicism or the habit of simply scrolling past. It can say “there is a problem here” at moments when we might prefer to shrug and move on.
The paradox is that these results tell us as much about ourselves as they do about AI. Language models absorb our best moral intuitions, distilled from vast numbers of examples and shaped by rigorous methodology, and then reflect them back to us. At the same time, they reveal our limitations: limited attention, limited time, limited patience.
The work of Landowska and Skórski is therefore not a manifesto proclaiming the moral superiority of machines. It is a proposal to use AI as a mirror for our own values. An imperfect mirror, but a useful one. Not to replace conscience, but to reveal where our own has blind spots and where we too easily look away.
And perhaps that is the most important question this research leaves us with: not whether AI is better at recognising moral signals, but when and why we ourselves stopped noticing so many of them.