Beyond Alexa and Towards Her: Psychological Principles for Designing Voice AI Interactions
It’s early May 2024 and the internet is ablaze with rumors that OpenAI, the leader in LLMs (for laymen, AI chatbots), is preparing to showcase a major advancement.
On May 10th, the CEO of OpenAI Sam Altman shed (a bit) more light on what would be revealed through a post on X (formerly Twitter), dashing any hope that the next big version of ChatGPT (GPT-5) would be released. Instead, the focus of the announcement would be more on how we interact with pre-existing models.
A few days later, in a room filled with warm wooden undertones that looked more like a fancy coffee shop than a stage for a keynote, Mira Murati, the Chief Technology Officer of OpenAI, revealed two new important developments. The first was that the company would release a new version (GPT-4o) one step ahead of the existing one at that time (4), free for all non-premium users.2 This was significant, as most users who had tried LLMs had only experienced non-paid versions, which have vastly diminished capabilities compared to the frontier models.
The second announcement—the one we will be discussing here—is that users would soon be able to interact with ChatGPT, the company’s flagship product, via voice alone. Let’s dive into what made this innovation special and explore its psychological implications on users.
But voice assistants suck…right? What’s the big deal with this one?
The demo that Mira and the other two OpenAI employees demonstrated during the keynote was impressive and vastly different from the typical voice interactions we’re used to for three main reasons.1
1. Real-Time Response Rates
Until now, every interaction we’ve had with a machine has been characterized by delays before and after we gave the command. Siri, Alexa, and pretty much all AI assistants require a significant pause after they’ve been prompted to process the command before outputting the result.
OpenAI’s new voice model has vastly improved this time response, bringing what we recognize to be real-time interaction capabilities. This means that for the first time, voice interactions feel more immediate—and thus, more natural, tapping into our strong innate preference for fluid conversations.
2. Enhanced Voice Recognition and Generation
OpenAI’s new voice model employs advanced algorithms that enhance both voice recognition and generation beyond anything that came before.
On the recognition front, the model uses deep learning techniques to better process and understand speech patterns, improving its ability to handle various accents, dialects, and even speech impediments (to various degrees). Most importantly, for those of us who have tried barking orders at Siri while on the go with a high degree of failure, OpenAI’s model features have enhanced noise cancellation and contextual understanding. This allows for a much better interpretation of spoken languages in noisy environments and conversational contexts.
Meanwhile, on the speech generation front, GPT-4o manages to sound remarkably human-like, complete with nuanced changes in tone, cadence, and emotion that adapt to the context of a given conversation. What is truly astounding is that the generated speech seems to exhibit natural prosody, along with good use of pauses and emphasis, making it sound very lifelike and quite expressive.
These improvements in recognition and generation drastically reduce the gap between human and machine communication, leveraging our innate tendency to anthropomorphize and create an interaction that builds more trust.
3. Multimodal Capabilities
This model extends beyond just voice—it supports multimodal capabilities, meaning it can understand and respond to text and voice inputs interchangeably. This degree of flexibility enables a more versatile application of the model, opening up a whole world of possibilities for how it can be used.
An impressive (and applicable!) example of this was demoed during the keynote by a student who shared their screen with ChatGPT and asked (via voice) to guide them through how to solve a math problem. The student could ask questions, dig deeper into items he didn’t understand, as well as scribble down notes to discuss them with the chatbot.
Having a multimodal approach aligns well with cognitive load theory, allowing users to process information through multiple, distinct sensory channels which could potentially enhance their learning and problem-solving capabilities.
These three components—real-time response rates, enhanced voice recognition and generation, and multimodal capabilities—are important building blocks in creating a chatbot experience that feels natural and intuitive for users. More specifically, the real-time responses create a sense of immediate engagement, the improved voice recognition and generation allow for more human-like interactions, and the multimodal capabilities provide flexibility in how users can communicate with the model.
…but is this enough?
Most of the LLMs released in the last year have some sort of voice interaction capability. But keep in mind, these responses are not in real-time—the LLMs basically just transcribe what the user says before inputting it as text with a pretty long delay. OpenAI is the first organization to design voice interactions with their LLM that seem, at least to the user, to be real-time.
However, as the field matures and others follow suit, many of us here at The Decision Lab believe that experts who want to create interfaces that mimic human interactions need to go beyond this advancement. In particular, they need to take into account a more advanced level of conversational design and the cognitive load that the user experiences.
Conversational Design
Until now, models such as Siri or Alexa incorporated a very basic level of conversational design that mainly focused on accurate speech recognition, clear articulation, and responses that directly address the user’s query. Even with the text-based interface, new LLMs can go beyond that—they can leverage context retention across multiple turns and basic personalization, enhancing the natural conversation flow, especially when migrating from a text interface to a voice interface.
We believe that the next frontier of these models will have advanced dialogue management abilities (the process of handling the conversational logic of a system) that ensure coherence, relevance, and most importantly, progression toward users’ decision-making goals. Meanwhile, what we’ve seen so far is that after a long series of interactions between a user and a chatbot, the chatbot seems to forget the initial context provided when the chat started, resulting in the user not being guided toward their decision-making goal.
Psychological Principles
At The Decision Lab, we're all about leveraging decision science principles to make AI interactions work better for people in real-world situations. When it comes to designing such effective interactions, understanding key psychological principles isn't just helpful—it's pivotal.
Such concepts can be leveraged to guide how information is presented, manage cognitive load (so users' brains don't feel fried), and create experiences that feel natural and engaging. Let's explore some of the key psychological principles that can be applied.
- Cognitive Load Management: The model should be deliberate in how information is structured to avoid overwhelming the user. This can be achieved by breaking down complex information into smaller chunks, as well as revealing information gradually, as needed, instead of at all at once.
- Anchoring and Adjustment: The model should “think ahead” by being mindful of how initial information can disproportionately influence subsequent decisions that the user makes.
- Framing Effects: The model should consider how the manner in which options are presented can significantly impact the choice that the user makes.
- Reciprocity: The model should try to replicate human interactions by building rapport through a give-and-take interaction style, while still being efficient in how they’re guiding the user to the final decision-making goal.
These psychological principles can be applied to "resolve" not only the last user query but the actual goal behind the query—or in other words, the underlying intent that prompted the question in the first place. For example, if a user asks about the weather in their city, the LLM should be able to infer that they're planning an outdoor activity and provide relevant suggestions along with the weather information. This should provide a more comprehensive and valuable interaction by anticipating the user's needs and providing more holistic assistance.
Chatbot Chronicles: Real-World LLM Magic
Let’s run through a scenario that demonstrates some of the psychological principles we discussed above with a focus on how they apply specifically to voice interactions.
Imagine that a user asks a chatbot to help them choose a health insurance plan appropriate for their situation. The chatbot’s goal, programmed by the insurance company, is to guide the user toward a decision that balances comprehensive coverage with affordability. Let’s call the chatbot… Bluey!
Opening with a neutral tone
Bluey: Welcome to our health insurance plan explorer. I'm here to help you understand your options. Would you like to start by hearing a brief overview of our plans, or do you have specific questions in mind?
User: I'd like to hear a brief overview, please.
Bluey: Certainly! I'll provide a concise summary of our plans. Please feel free to interrupt me at any point if you need clarification or have questions. Are you ready for me to begin?
User: Yes, go ahead.
By starting with an open-ended question in a neutral tone, Bluey allows the user to direct the conversation, giving them a high degree of freedom and respecting their autonomy. This type of approach taps into a user’s self-determination, the psychological principle that explains why people are more engaged when they feel they have a say over the process. In a voice interaction, Bluey's invitation for interruptions acknowledges the fluid nature of spoken dialogue and encourages active listening.
Gathering Information with open-ended questions
Bluey: To help find a plan that best suits you, could you tell me a bit about your healthcare needs? For example, how often do you typically visit the doctor and what other health practitioners do you see on a regular basis?
User: I see my primary care doctor about twice a year for check-ups, and I've been seeing a physical therapist once a month for the past six months due to a sports injury.
Bluey: Thank you for sharing that. It sounds like you have regular check-ups and some recent physical therapy needs. Let's explore this a bit further. Do you anticipate needing ongoing physical therapy, or do you think it's a temporary treatment?
User: I'm not sure. The doctor said it could take anywhere from a few more months to a year.
Bluey: I see. It's good to consider both short-term and long-term needs. Let's keep that in mind as we look at plans. Is there anything else about your health situation you think would be important for choosing an insurance plan?
Through this line of questioning, the user is encouraged to reflect on their actual needs without pushing them in any particular direction. Since this is a voice interaction, Bluey's follow-up questions demonstrate active listening and help to create a more natural, conversational flow. This approach is particularly important in audio formats where users can't easily review previously stated information.
In addition, these questions make use of the availability heuristic, where people tend to heavily weigh their judgments toward more recent information. By making any recent medical experiences more salient, Bluey can help the user decide what kind of coverage they would need based on past expenses. Meanwhile, LLMs can extrapolate both types and amounts based on the user’s answer regarding the doctor and practitioner visit frequency.
However, there is also the danger here that the user will perceive the recent expenses as being more risky than the actual needs themselves. One way to prevent this is by helping the user go through their expenses, before asking them whether they believe their treatment is short-term or long-term. (This is exactly what Bluey did when inquiring about how long the user’s physical therapy will last!) This process helps to counteract the recency effect and promotes more balanced decision-making by…
- Explicitly distinguishing between recurring and temporary health needs, encouraging users to look beyond their recent medical experiences and consider their health patterns in the long run.
- Prompting users to critically evaluate the relevance of recent health events to their overall healthcare needs, reducing the tendency to overemphasize recent experiences when making insurance decisions.
Presenting options in a balanced way
Bluey: We offer three main plans: Bronze, Silver, and Gold. Each has different levels of coverage and costs. Would you like me to explain them one at a time, or compare them side by side?
User: I think it would be easier for me to understand if you compared them side by side.
By giving options regarding how information is presented, Bluey learns the user’s preferences—how they would like to receive the information as well as “process” it. This consideration takes cognitive fit into account, which suggests that performance improves when there's a good match between the way the problem is described and the task itself.
This is where a difference starts to emerge between information that is presented in a text format as opposed to audio. For example, while it’s much easier for a user to compare text next to each other such as in a table, during a verbal conversation, the way in which Bluey describes the options needs to take on a different format.
Let’s assume the user would want the information to be presented as a comparison—what would this look like in action?
Bluey: Let’s compare the plans for you by first going through the main types of coverage each one offers. We'll be looking at five main areas: monthly premiums, basic health coverage, coverage for additional practitioners like chiropractors, dental coverage, and vision care. Does that make sense?
User: Yes, that sounds good.
After that, Bluey would be able to dig into each dimension of coverage one at a time, breaking the information up to make it more digestible.
Bluey: Okay, let's begin with monthly premiums. The Bronze plan is $200 per month, Silver is $300, and Gold is $400. That's a $100 increase between each level. How does that sound to you?
User: That makes sense! Please continue.
Bluey: Next is basic health coverage. Actually, this is the same for all plans: they all cover 80% of the costs for doctor visits, hospital stays, and prescription medications. Any questions about that?
This type of conversation demonstrates how Bluey keeps in mind the end goal of the conversation while modulating each prompt based on the user’s preferences and selected interface. It also showcases how chunking (breaking down information into manageable pieces), anchoring (starting with price information that serves as a reference point), and the primacy effect (presenting important information first) can play a role in creating successful voice-only interactions with LLMs.
Beyond Alexa and Towards Her
As voice interfaces for LLMs continue to evolve, it's crucial that designers, developers, and product managers focus not only on the cutting-edge technological advancements that help models resemble humans speaking to other humans but also on the psychological principles that are key to effective human-computer interaction. This way, we can go beyond what voice assistants like Alexa currently offer — but also, feel closer to them.
Here at The Decision Lab, we believe that by incorporating these principles, experts can create voice interfaces that are not just technologically impressive, but also intuitive, engaging, and truly helpful in guiding users through complex decision-making processes. Get in touch with us today to see how we can help you achieve just this.
References
- OpenAI. (n.d.). Hello GPT-4.o. OpenAI. https://openai.com/index/hello-gpt-4o/
- King, C. (2024, May 14). iPad AI tutor demo: Apple shows off impressive AI-powered teaching tool. 9to5Mac. https://9to5mac.com/2024/05/14/ipad-ai-tutor-demo/
- Frontitude. (n.d.). What is conversational design. Frontitude.https://www.frontitude.com/glossary-posts/what-is-conversational-design
About the Author
Dan Pilat
Dan is a Co-Founder and Managing Director at The Decision Lab. He is a bestselling author of Intention - a book he wrote with Wiley on the mindful application of behavioral science in organizations. Dan has a background in organizational decision making, with a BComm in Decision & Information Systems from McGill University. He has worked on enterprise-level behavioral architecture at TD Securities and BMO Capital Markets, where he advised management on the implementation of systems processing billions of dollars per week. Driven by an appetite for the latest in technology, Dan created a course on business intelligence and lectured at McGill University, and has applied behavioral science to topics such as augmented and virtual reality.
About us
We are the leading applied research & innovation consultancy
Our insights are leveraged by the most ambitious organizations
“
I was blown away with their application and translation of behavioral science into practice. They took a very complex ecosystem and created a series of interventions using an innovative mix of the latest research and creative client co-creation. I was so impressed at the final product they created, which was hugely comprehensive despite the large scope of the client being of the world's most far-reaching and best known consumer brands. I'm excited to see what we can create together in the future.
Heather McKee
BEHAVIORAL SCIENTIST
GLOBAL COFFEEHOUSE CHAIN PROJECT
OUR CLIENT SUCCESS
$0M
Annual Revenue Increase
By launching a behavioral science practice at the core of the organization, we helped one of the largest insurers in North America realize $30M increase in annual revenue.
0%
Increase in Monthly Users
By redesigning North America's first national digital platform for mental health, we achieved a 52% lift in monthly users and an 83% improvement on clinical assessment.
0%
Reduction In Design Time
By designing a new process and getting buy-in from the C-Suite team, we helped one of the largest smartphone manufacturers in the world reduce software design time by 75%.
0%
Reduction in Client Drop-Off
By implementing targeted nudges based on proactive interventions, we reduced drop-off rates for 450,000 clients belonging to USA's oldest debt consolidation organizations by 46%