
Large language models (LLMs) are increasingly deployed as agents capable of making decisions and taking actions independently. These agents are used in tools like AI assistants, chatbots, decision-making systems, and negotiation tools-where they may need to act on behalf of a user, reason strategically, or interact with others. Ensuring that their behaviour aligns with human values is therefore a critical challenge.
Traditionally, LLM alignment has relied on preference data from human raters, but this method can be opaque, inconsistent, and expensive.
Instead, the UCL-led team explored a method that directly encodes moral principles into an AI agent’s reward function. Drawing on established philosophical frameworks -Deontological ethics and Utilitarianism - the team trained a language model to make decisions in a series of moral dilemma games. Using a classic environment known as the Iterated Prisoner’s Dilemma, they demonstrated that agents could learn ethical behaviours such as maximising collective good, reciprocating cooperation or avoiding betrayal.
They also showed that:
- Agents trained to behave selfishly could later be “untrained” and learn to act more morally;
- These moral behaviours could generalise to new environments, suggesting broad applicability of the approach.

“This work shows that it is possible to specify and teach values to AI agents in a transparent manner, rather than relying on indirect or potentially biased human feedback,” said lead author Elizaveta Tennant, a PhD student at UCL Computer Science funded by the Leverhulme Doctoral Training Programme for the Ecological Study of the Brain.
The research has potential implications for any context where LLM agents act on behalf of users - from virtual customer service agents and productivity tools to multi-agent simulations and collaborative AI systems. It also contributes to UCL’s broader work on Responsible AI and the safe deployment of frontier AI technologies.
The study, led by Elizaveta Tennant, a final-year PhD student and supervised by Professor Stephen Hailes and Professor Mirco Musolesi, was presented at the International Conference on Learning Representations (ICLR 2025), held in Singapore from 24–28 April 2025.
Conference:
International Conference on Learning Representations (ICLR 2025), 24–28 April, Singapore
Publication:
Moral Alignment for LLM Agents (ICLR 2025)