AI Alignment

The Basic Idea

Imagine that you are trying to teach a young toddler to behave properly in public. You want the child to understand and uphold your values and ethical judgments and avoid any inappropriate behavior. If the child behaves as you intended them to, you could say that they have aligned with your values. By the same token, if the child misbehaves and follows their own objectives, the toddler is misaligned with your values. 

A similar process occurs in the field of artificial intelligence (AI). AI alignment refers to the goal of designing artificial intelligence systems in such a way that their objectives and behavior are aligned with the values and goals of human users or society at large. Experts working in the field of AI often refer to the ‘alignment problem’, a concern that as AI systems become more sophisticated and autonomous, there is a risk that they may act in ways that are inconsistent with human values or intentions. Achieving AI alignment is crucial to prevent unintended consequences, risks, and ethical concerns associated with AI technologies.

As large language models, such as Open AI’s ChatGPT or Google’s Lamda, become more powerful, they start to exhibit new capabilities that weren’t initially programmed into the system. The goal of AI alignment is to ensure that these new emerging capabilities align with our collective goals and that AI systems continue to function as intended. 

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire.

- Norbert Weiner, American computer scientist, mathematician, and philosopher.

Theory, meet practice

TDL is an applied research consultancy. In our work, we leverage the insights of diverse fields—from psychology and economics to machine learning and behavioral data science—to sculpt targeted solutions to nuanced problems.

Our consulting services

Key Terms

Emergence: Actions, patterns, or behaviors that weren’t explicitly programmed into an AI system but that have subsequently developed due to its increasing complicity and interactions. 

Alignment: An AI system is aligned if it advances its intended objectives. 

Misalignment: An AI system is misaligned if it pursues objectives that are not the intended ones.


Although it may often appear to be the case, artificial intelligence (AI) doesn’t ‘think’ like humans. On the contrary, if we want AI to ‘think’ and behave like us, we need to explicitly tell it how to do this. In reality, however, translating complex and subjective human desires and values into the objective, numerical logic of computers is a significant challenge.

While the problem of aligning AI behavior with human values is as old as AI itself, the concept of AI alignment has only gained prominence since the early 2010s, coinciding with the widespread adoption of AI technologies. Nick Bostrom’s seminal book Superintelligence1 sparked serious debates about AI safety and the risks faced when AI systems aren’t fully aligned with human values, goals, and purposes. 

Since then, AI alignment has become an ongoing area of research and development. Although dealing with AI alignment in current systems is an important concern, research tends to focus on hypothetical future AI systems which are far more advanced than today’s technology. Many experts believe that Artificial General Intelligence (AGI), an AI system that is capable of doing anything that humans can, could be developed in the near future. If such a system is developed, it could keep improving itself without human input, underscoring the need for it to align with our intentions. 

A range of different approaches and frameworks are being explored to address the complexities associated with aligning AI behavior with human values. One of the most widespread approaches is through reinforcement learning from human feedback (RLHF). This technique involves getting a system to provide responses to a range of prompts and then getting a human to determine which is best. 


Experts and researchers have hypothesized and warned about the possible day when AI systems become more powerful than humans and present an existential threat to mankind. While this may appear to be the stuff of apocalyptic movies, many argue that superintelligence is inevitable and that we need to devise ways to control it. 

In the meantime, however, AI alignment is important in addressing more immediate harms, such as AI-driven misinformation and bias, which can have significant consequences on individuals and society. Without proper alignment, there is a risk that AI systems may pursue objectives that are harmful, ethically questionable, or contrary to the interests of individuals and society at large. This concern becomes particularly pronounced as AI systems become more autonomous and capable of complex decision-making.


One of AI alignment's most glaringly obvious problems is defining ‘human values.’ Who decides which values are important, and what happens when humans disagree about these values? In a world defined by diverse and contrasting values, deciding how to train AI systems is an ethical conundrum in itself. 

By the same token, there are also debates about who should be addressing the issue of AI alignment in the first place. Viewing AI alignment as a technical problem puts all the power in the hands of technologists, when many believe that the rules governing AI systems should be determined by the public and democratic institutions. In other words, if AI systems are going to play a central role in our lives moving forward, we should have a say in how they are governed2.

Case Studies


In July 2023, OpenAI announced a new research program called ‘Superalignment’, which aims to solve the issue of AI alignment by 2027.3 The main objective of the initiative is to ensure that AI systems much smarter than humans—known as Superintelligence—follow human intent.4

At present, current alignment techniques, such as RHLF, rely on humans’ ability to supervise AI and will not scale to superintelligence. Consequently, OpenAI aims to build a roughly human-level automated alignment researcher which will iteratively align superintelligence on a colossal scale. The catch? The alignment researcher itself needed to be aligned to human values first. 
Paperclips or humans?

In 2003, Nick Bostrom, a philosopher at the University of Oxford, conducted an eccentric but provocative thought experiment called ‘Paperclip Maximiser.’5 He proposed that if you ask an intelligent machine to make as many paperclips as possible, it could potentially destroy the whole world and humankind in its quest for raw materials to complete its objective. Unless you explicitly teach it, the system will have no concept of the value of human life and will try to fulfill its goal in whatever way necessary. 

Related TDL Content

The AI Governance of AI 

Many of us are comfortable with allowing AI to shape our daily decisions, despite not knowing much about how and why algorithms make decisions themselves. This article explores the issue of AI governance and accountability and raises urgent questions about the future of our decision making. 

Combining AI and Behavioral Science Responsibly

Although dystopian depictions of superintelligent AI wiping out humanity may seem like science fiction, it’s important to acknowledge that AI can have bad outcomes for society if used unethically. The authors of this article explore how problematic outcomes can occur when the use of AI is obfuscated from the public and AI machines develop the same biases of their human creators. 


1.   Bostrom, N. (2014). Superintelligence. Oxford University Press. 

2.   Ockel, L. (2023, July 12). What is ‘AI alignment’? Silicon Valley’s favourite way to think about AI safety misses the real issues. The Conversation.

3.   Strickland, E. (2023, August 31). OpenAI’s Moonshoot: Solving the AI Alignment Problem. IEEE Spectrum.

4.     Leike, J., & Sutskever, I. (2023, July 5). Introducing Superalignment. OpenAI.

5.     Marr, B. ( 2022, April 1). The Dangers of Not Aligning Artificial Intelligence With Human Values. Forbes.

About the Author

Dr. Lauren Braithwaite

Dr. Lauren Braithwaite

Dr. Lauren Braithwaite is a Social and Behaviour Change Design and Partnerships consultant working in the international development sector. Lauren has worked with education programmes in Afghanistan, Australia, Mexico, and Rwanda, and from 2017–2019 she was Artistic Director of the Afghan Women’s Orchestra. Lauren earned her PhD in Education and MSc in Musicology from the University of Oxford, and her BA in Music from the University of Cambridge. When she’s not putting pen to paper, Lauren enjoys running marathons and spending time with her two dogs.

Read Next

Notes illustration

Eager to learn about how behavioral science can help your organization?