AI Alignment

The Basic Idea

Imagine that you are trying to teach a young toddler to behave properly in public. You want the child to understand and uphold your values and ethical judgments and avoid any inappropriate behavior. If the child behaves as you intended them to, you could say that they have aligned with your values. By the same token, if the child misbehaves and follows their own objectives, the toddler is misaligned with your values.

A similar process occurs in the field of artificial intelligence (AI). AI alignment refers to the goal of designing artificial intelligence systems in such a way that their objectives and behavior are aligned with the values and goals of human users or society at large. Experts working in the field of AI often refer to the ‘alignment problem’, a concern that as AI systems become more sophisticated and autonomous, there is a risk that they may act in ways that are inconsistent with human values or intentions. Achieving AI alignment is crucial to prevent unintended consequences, risks, and ethical concerns associated with AI technologies.

As large language models, such as Open AI’s ChatGPT or Google’s Lamda, become more powerful, they start to exhibit new capabilities that weren’t initially programmed into the system. The goal of AI alignment is to ensure that these new emerging capabilities align with our collective goals and that AI systems continue to function as intended.

“

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire.

- Norbert Weiner, American computer scientist, mathematician, and philosopher.

Key Terms

Emergence: Actions, patterns, or behaviors that weren’t explicitly programmed into an AI system but that have subsequently developed due to its increasing complicity and interactions.

Alignment: An AI system is aligned if it advances its intended objectives.

Misalignment: An AI system is misaligned if it pursues objectives that are not the intended ones.

History

Although it may often appear to be the case, artificial intelligence (AI) doesn’t ‘think’ like humans. On the contrary, if we want AI to ‘think’ and behave like us, we need to explicitly tell it how to do this. In reality, however, translating complex and subjective human desires and values into the objective, numerical logic of computers is a significant challenge.

While the problem of aligning AI behavior with human values is as old as AI itself, the concept of AI alignment has only gained prominence since the early 2010s, coinciding with the widespread adoption of AI technologies. Nick Bostrom’s seminal book Superintelligence¹ sparked serious debates about AI safety and the risks faced when AI systems aren’t fully aligned with human values, goals, and purposes.

Since then, AI alignment has become an ongoing area of research and development. Although dealing with AI alignment in current systems is an important concern, research tends to focus on hypothetical future AI systems which are far more advanced than today’s technology. Many experts believe that Artificial General Intelligence (AGI), an AI system that is capable of doing anything that humans can, could be developed in the near future. If such a system is developed, it could keep improving itself without human input, underscoring the need for it to align with our intentions.

A range of different approaches and frameworks are being explored to address the complexities associated with aligning AI behavior with human values. One of the most widespread approaches is through reinforcement learning from human feedback (RLHF). This technique involves getting a system to provide responses to a range of prompts and then getting a human to determine which is best.

Consequences

Experts and researchers have hypothesized and warned about the possible day when AI systems become more powerful than humans and present an existential threat to mankind. While this may appear to be the stuff of apocalyptic movies, many argue that superintelligence is inevitable and that we need to devise ways to control it.

In the meantime, however, AI alignment is important in addressing more immediate harms, such as AI-driven misinformation and bias, which can have significant consequences on individuals and society. Without proper alignment, there is a risk that AI systems may pursue objectives that are harmful, ethically questionable, or contrary to the interests of individuals and society at large. This concern becomes particularly pronounced as AI systems become more autonomous and capable of complex decision-making.

behavior change 101

Start your behavior change journey at the right place

What is behavior change? ⮕

Learn the tools of the trade ⮕

Frameworks for lasting change ⮕

Behavior change in action ⮕

Making a career of beh change ⮕

See real world examples ⮕

Advanced concepts & trends ⮕

Controversies

One of AI alignment's most glaringly obvious problems is defining ‘human values.’ Who decides which values are important, and what happens when humans disagree about these values? In a world defined by diverse and contrasting values, deciding how to train AI systems is an ethical conundrum in itself.

By the same token, there are also debates about who should be addressing the issue of AI alignment in the first place. Viewing AI alignment as a technical problem puts all the power in the hands of technologists, when many believe that the rules governing AI systems should be determined by the public and democratic institutions. In other words, if AI systems are going to play a central role in our lives moving forward, we should have a say in how they are governed².

Case Studies

Superintelligence

In July 2023, OpenAI announced a new research program called ‘Superalignment’, which aims to solve the issue of AI alignment by 2027.³ The main objective of the initiative is to ensure that AI systems much smarter than humans—known as Superintelligence—follow human intent.⁴

At present, current alignment techniques, such as RHLF, rely on humans’ ability to supervise AI and will not scale to superintelligence. Consequently, OpenAI aims to build a roughly human-level automated alignment researcher which will iteratively align superintelligence on a colossal scale. The catch? The alignment researcher itself needed to be aligned to human values first.
Paperclips or humans?

In 2003, Nick Bostrom, a philosopher at the University of Oxford, conducted an eccentric but provocative thought experiment called ‘Paperclip Maximiser.’⁵ He proposed that if you ask an intelligent machine to make as many paperclips as possible, it could potentially destroy the whole world and humankind in its quest for raw materials to complete its objective. Unless you explicitly teach it, the system will have no concept of the value of human life and will try to fulfill its goal in whatever way necessary.

References

1. Bostrom, N. (2014). Superintelligence. Oxford University Press.

2. Ockel, L. (2023, July 12). What is ‘AI alignment’? Silicon Valley’s favourite way to think about AI safety misses the real issues. The Conversation. https://theconversation.com/what-is-ai-alignment-silicon-valleys-favourite-way-to-think-about-ai-safety-misses-the-real-issues-209330

3. Strickland, E. (2023, August 31). OpenAI’s Moonshoot: Solving the AI Alignment Problem. IEEE Spectrum. https://spectrum.ieee.org/the-alignment-problem-openai

4. Leike, J., & Sutskever, I. (2023, July 5). Introducing Superalignment. OpenAI. https://openai.com/blog/introducing-superalignment

5. Marr, B. ( 2022, April 1). The Dangers of Not Aligning Artificial Intelligence With Human Values. Forbes. https://www.forbes.com/sites/bernardmarr/2022/04/01/the-dangers-of-not-aligning-artificial-intelligence-with-human-values/?sh=7210fb23751c

Case studies

From Insight to Impact: Our Success Stories

See Case Studies

Is there a problem we can help with?

See how we work

About the Author

Dr. Lauren Braithwaite

Dr. Lauren Braithwaite is a Social and Behaviour Change Design and Partnerships consultant working in the international development sector. Lauren has worked with education programmes in Afghanistan, Australia, Mexico, and Rwanda, and from 2017–2019 she was Artistic Director of the Afghan Women’s Orchestra. Lauren earned her PhD in Education and MSc in Musicology from the University of Oxford, and her BA in Music from the University of Cambridge. When she’s not putting pen to paper, Lauren enjoys running marathons and spending time with her two dogs.

Consulting

Industries

Resources

AI Alignment

The Basic Idea

Key Terms

History

Consequences

behavior change 101

Start your behavior change journey at the right place

Controversies

Case Studies

Related TDL Content

References

Case studies

From Insight to Impact: Our Success Stories

Is there a problem we can help with?

About the Author

Dr. Lauren Braithwaite

About us

We are the leading applied research & innovation consultancy

Our insights are leveraged by the most ambitious organizations

OUR CLIENT SUCCESS

Annual Revenue Increase

Increase in Monthly Users

Reduction In Design Time

Reduction in Client Drop-Off

Read Next

Machine Learning

Algorithm

Cognitive Science

Game Theory

Eager to learn about how behavioral science can help your organization?

Consulting

Industries

Resources

AI Alignment

The Basic Idea

Key Terms

History

Consequences

behavior change 101

Start your behavior change journey at the right place

Controversies

Case Studies

Related TDL Content

References

Case studies

From Insight to Impact: Our Success Stories

Is there a problem we can help with?

About the Author

Dr. Lauren Braithwaite

About us

We are the leading applied research & innovation consultancy

Our insights are leveraged by the most ambitious organizations

OUR CLIENT SUCCESS

Annual Revenue Increase

Increase in Monthly Users

Reduction In Design Time

Reduction in Client Drop-Off

Read Next

Machine Learning

Algorithm

Cognitive Science

Game Theory

Eager to learn about how behavioral science can help your organization?

Get new behavioral science insights in your inbox every month.