Speech Recognition

What is Speech Recognition?

Speech recognition technology enables devices to understand and convert human voice into text or commands, fundamentally bridging human and machine interactions. Beyond recognizing speech, it can execute specific actions by comprehending the input. Now prevalent in healthcare, customer service, automotive industries, and virtual assistants, speech recognition improves accessibility and efficiency in our daily lives.

The Basic Idea

You likely own an electronic device equipped with speech recognition technology, for example, your smartphone or smart home device. At first glance, the process behind speech recognition might seem straightforward: the electronic device recognizes the sounds you're making, translates them into written text, and receives them as commands. However, the actual process involves sophisticated tech and detailed processes1:

Signal Processing: The first step in speech recognition involves converting spoken words into a digital format that a computer can understand. When you speak, a microphone captures your voice as an acoustic signal. This signal is transformed into a digital representation the computer can process. This step lays the foundation for the next stages.

Speech Feature Extraction: You can think of this step as the sieve of speech recognition. The main goal is to filter out the background noise and irrelevant information, only keeping the critical data for the next stages. The computer recognizes features in your voice such as pitch, tone and rhythm to make an accurate extraction.

Acoustic & Language Models: Extracted features are then recognized and processed by acoustic and language models. The acoustic model maps these features to phonemes (the basic units of sound in a language), while the language model adds context by understanding how words typically fit together in a language. These models use predictions based on previous words to guess and confirm the next words, ensuring coherence and increasing accuracy.

Decoding: The final stage involves combining the outputs of acoustic and language models to generate the most probable transcription of what was spoken. This involves searching through possible combinations of words to find the sequence that best matches the input.

It’s also important to differentiate between speech recognition and voice recognition. Speech recognition technology focuses on what is being said, without considering who is saying it. On the other hand, voice recognition is concerned with identifying who is speaking, rather than the content of what they are saying.2

For example, Amazon’s Alexa is equipped with both technologies. Imagine your home's security alarm is set up with Alexa, but you also have children, and you need to ensure that the alarm isn’t activated accidentally by one of them. Through speech recognition, Alexa understands the command to activate the alarm, but through voice recognition, it can differentiate between your voice and your child’s, preventing unauthorized activation or deactivation.

Now, you might be asking: but how does it actually “understand” humans? This is where artificial intelligence, and more specifically deep learning comes in. Although other options are possible such as Gaussian mixture models (GMM) or hidden Markov models (HMM), most software and devices use deep learning. 

This type of AI learns from “experience,” or vast amounts of data. Specifically, Alexa is really good at interpreting humans because it has been “fed” millions of commands and human voices. Through speech recognition and deep learning, technology is able to understand and interact with us making it significantly more intuitive and responsive than ever.3

Key Terms

  • Automatic Speech Recognition (ASR): The technology that enables computers to identify spoken words and convert them into text.
  • Voice Recognition: Technology that identifies a user's unique voice patterns.
  • Artificial Intelligence: The simulation of human intelligence processes by machines, particularly computer systems. These processes include learning (acquiring information and rules for using it), reasoning (using rules to reach approximate or definite conclusions), and self-correction. AI applications range from simple tasks like language translation and image recognition to complex problem-solving and decision-making.
  • Deep Learning: Deep learning is a subset of machine learning that involves neural networks with many layers designed to model and understand complex patterns and representations in large datasets. It is particularly well-suited for tasks such as image and speech recognition, natural language processing, and autonomous systems.
  • Natural Language Processing (NLP): A branch of artificial intelligence that helps computers understand, interpret, and generate human language in a meaningful and valuable way.
  • Speech Synthesis: The artificial production of human speech, often used in conjunction with speech recognition to create conversational interfaces. It translates written text into spoken words using algorithms.


Speech recognition started in 1952 when AUDREY (Automatic Digit Recognizer) was created by Bell Laboratories. This machine was able to recognize the digits 0-9 with 90% accuracy when spoken by its inventor, HK Davis. Although AUDREY was the size of a whole room and could only recognize one voice—this was a groundbreaking invention.4

Almost a decade later, in 1961, William C. Dersch from IBM presented “Shoebox,” a machine—the size of a shoebox—capable of understanding numbers and mathematical commands. For example “Seven plus three plus six plus nine plus five. Subtotal.” The Shoebox marked a revolution in human-machine interaction as it could actually understand commands given by a human.5 It became such a groundbreaking invention that pop culture was full of references to it, one of the most notable being the 1966 TV show Star Trek, where Captain Kirk could speak to his spaceship, the USS Enterprise. 

From the late 1960s to the mid-1970s the hidden Markov models (HMMs) were introduced and applied to speech recognition, improving its accuracy and functionality. HMMs are statistical models that analyze sequential data that are also used in areas such as bioinformatics and economics.4

Speech recognizers were not only being invented and transforming the US market, for example, they were also being developed by University College London and NEC in Japan. 

In 1971, the Defense Advanced Research Projects Agency (DARPA)—a USA research and development agency for national security—launched and funded a speech recognition technology program to develop a system that could recognize 1000 words. By 1976, this goal was achieved by HARPY, a system written by a Carnegie Mellon University Graduate student.4

Further on, in 1987, Kai-Fu Lee—another Carnegie Mellon graduat—developed SPHINX-1, the first speaker-independent continuous speech recognition system. This was groundbreaking as speech systems didn’t need to be trained by a particular speaker which saved a lot of “training” time. By 1997 Dragon Naturally Speaking—a general-purpose continuous speech system—could recognize 100 words per minute.4

Finally, throughout the 2000’s the introduction of deep learning techniques revolutionized the accuracy and capabilities of speech recognition systems. These advancements enabled more natural and accurate speech recognition, leading to the creation of voice-enabled applications that have saturated our current market, such as Alexa, Siri, Cortana, and Google Home.4


James K. Baker: American entrepreneur and computer scientist, and the co-founder of Dragon Systems—a speech recognition company that created the first general-purpose speech dictation/transcription software: Dragon Naturally Speaking. Baker’s work has had a significant impact on the ASR world. For example, Apple's Siri uses Dragon’s technology. Baker is also a Professor at Carnegie Mellon University. His research now focuses on achieving superior intelligence by creating teams between AI and humans.6

Ray Kurzweil: Computer scientist who invented the first reading machine for those with visual impairments. Kurzweil is now an advocate and speaker of transhumanism, a scientific movement that believes the human condition can be enhanced with the help of technology.7

Geoffrey Hinton: Known as the Godfather of Deep Learning Hinton’s contributions have been essential to modern AI. He recently left his job at Google to research the risks associated with AI. In Hinton’s own words: “Sometimes I think it’s as if aliens had landed and people haven’t realized because they speak very good English.” 8


Speech recognition systems can capture and process spoken language, converting it into actions or textual data that machines can understand and respond to. This capability enables users to interact with devices naturally and intuitively, significantly enhancing user experience.

As previously mentioned, most speech recognition systems today function through deep learning technology, which helps them evolve with each interaction, refining their responses and actions based on the data they accumulate. As a consequence, these systems start feeling unique for each user as you can customize options to meet your needs.

Speech recognition has also improved accessibility for those with disabilities, enhanced productivity with voice-driven multitasking, and introduced new levels of convenience into everyday tasks. It also gives users the option to execute commands “hands-free.” Imagine you’re driving, and rather than going on your phone (not advised!), you can control your music by saying “Hey Siri, shuffle my jazz playlist,” instead of taking your eyes off the road to unlock your phone, open the music app, find the jazz playlist, and click play.

The bottom line is that when you speak to a machine, it not only understands what you're saying but also performs the action you requested... It feels great. There’s a reason Siri and Alexa are called virtual assistants. Speech recognition technology grants you access to an enormous amount of information in a couple of seconds and performs tasks efficiently and accurately. It is also extremely accessible, you don’t need a lot of training or advanced tech knowledge to use it


Accuracy Issues

It all sounds great, except when the system is not that accurate or doesn’t understand what you are saying. In other words, when it doesn’t work properly it can cause the user more frustration and annoyance than if they had performed the action themselves from the beginning. Speech recognition technology is becoming so popular and common that users are expecting to have flawless interactions with electronic devices. 

Background Noise

One major challenge for speech recognition systems is dealing with background noise. Although algorithms and speech extraction play an important role in filtering, noise is still an important challenge when these products are launched in the real world. The simplest way to solve this problem is by considering the different types of environments users and products will be exposed to. Other solutions rely on the microphone and using linear noise reduction filters (e.g. Gaussian mask).

Privacy Concerns

Additionally, speech recognition raises privacy concerns and I’m quite sure this is not the first you’ve pondered the question: is it always listening? This is especially worrying when it comes to voice-activated systems. For example, virtual assistants only answer back when you call their names (Alexa, Siri, Google), however, these tools are only programmed to record after the trigger word (ex. Hey, Siri). These devices can only store minimal amounts of audio, therefore they are not always eavesdropping. Amazon has also explained how, even though Alexa is always listening, it’s not always recording.8 Despite these assurances, many users might still feel uneasy knowing that an electronic device is always listening.

Specific privacy concerns arise from this “constant listening.” For example, it’s crucial to obtain user consent before activating continuous listening features so users are aware and can make an informed choice. Also, recorded data should be anonymized and encrypted to protect user identity and unauthorized access. 

Homophones, Accents & Dialects

Homophones present another challenge. Words that sound the same but have different meanings (e.g., "heal" and "heel") require speech recognition software to obtain contextual understanding to differentiate them correctly. This requires sophisticated algorithms that can understand the context of a conversation, which is an ongoing area of development.

Another significant challenge for speech recognition systems is dealing with accents and dialects. These systems have a history of performing poorly with accents not well-represented in training data, leading to accusations of bias and even discrimination. Speakers have different accents across countries and regions, which must be considered when developing products with a diverse audience in mind. Therefore, it's important to create systems that respect and empower users by implementing ethical training.

Moreover, mainstream speech recognition technology was created with English-speaking users in mind (with a bias toward American English). A lot of concerns come up here. Systems should work in people’s first language and provide the same quality results as those in English. Even if English is one of the most spoken languages globally, people will almost always prefer to speak and communicate in their native language because it is more intuitive. The whole point of speech recognition systems is to make human-computer interactions simple and helpful, but sometimes it is more easily written than done.

Case Study

Revamping Call Center Efficiency with Speech Emotion Recognition10

In call centers, particularly those handling emergencies and healthcare services, the priority is always to address the most urgent calls quickly. However, most call centers have traditional queuing systems which treat all calls equally. This is where Speech Emotion Recognition (SER) technology comes into play, offering a solution to intelligently prioritize calls based on the emotional urgency of the caller.

The concept was put to the test in a virtual setting that mimicked the hectic, high-pressure settings of a call center. Real-time emotional analysis by the SER system prioritized calls from people displaying distress indicators like fear or anger. Because of this method, urgent calls were promptly advanced to the front of the queue.

Response times for urgent calls were significantly shortened after SER was put into place, improving by 30%. This not only improved the call center's performance but also significantly raised satisfaction, which is crucial in emergency scenarios where every second counts.

The experiment proved how useful it is to include emotion detection technologies in call centers. Using SER improved and set a new bar on emergency service and operational efficiency. Finally, this case shows how speech recognition technologies are quickly adapting and being updated to aid as many sectors as possible.

Related TDL Content

Human-Computer Interaction

This article explores the multidisciplinary field of Human-Computer Interaction (HCI), which aims to enhance how humans interact with computers. It delves into how integrating computer science, psychology, design, and more can lead to intuitive and user-centric computer interfaces.


  1. Shrawankar, U. & Mahajan, A. (2013) Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction. Retrieved Jun 17, 2024 from: https://arxiv.org/pdf/1305.1925
  2. Pico Voice (n.d.) Speech Recognition vs. Voice Recognition. Retrieved May 10, 2024 from: https://picovoice.ai/blog/speech-recognition-voice-recognition/
  3. IBM. (n.d.) What is Speech Recognition?. Retrieved May 10, 2024 from: https://www.ibm.com/topics/speech-recognition#:~:text=Speech%20recognition%2C%20also%20known%20as,speech%20into%20a%20written%20format.
  4. Spicer, D. (2021) Audrey, Alexa, Hal, and More. Computer History. Retrieved May 10, 2024 from: https://computerhistory.org/blog/audrey-alexa-hal-and-more/
  5. IBM. (n.d.) Speech Recognition. Retrieved May 10, 2024 from: https://www.ibm.com/history/voice-recognition
  6. Computer History. (n.d.) James Baker. Retrieved May 10, 2024 from: https://computerhistory.org/profile/james-baker/
  7. Craine, A.G. (2024) Ray Kurzweil. Britannica. Retrieved May 10, 2024 from: https://www.britannica.com/biography/Raymond-Kurzweil
  8. Heaven, W.D. (2023). Geoffrey Hinton tells us why he’s now scared of the tech he helped build. MIT Technology Review. Retrieved May 10, 2024 from: https://www.technologyreview.com/2023/05/02/1072528/geoffrey-hinton-google-why-scared-ai/
  9. Asurion. (n.d.) Does Alexa listen to your conversations at home?. Retrieved May 10, 2024 from: https://www.asurion.com/connect/tech-tips/is-alexa-listening-to-conversations-at-home/#:~:text=Does%20Amazon%20Alexa%20listen%20all,means%20it's%20technically%20always%20listening.
  10. Bojanić, M., Vlado D., Alexey K. (2020). Call Redistribution for a Call Center Based on Speech Emotion Recognition. Applied Sciences 10, no. 13: 4653. https://doi.org/10.3390/app10134653

About the Author

Mariana Ontañón

Mariana Ontañón

Mariana holds a BSc in Pharmaceutical Biological Chemistry and a MSc in Women’s Health. She’s passionate about understanding human behavior in a hollistic way. Mariana combines her knowledge of health sciences with a keen interest in how societal factors influence individual behaviors. Her writing bridges the gap between intricate scientific information and everyday understanding, aiming to foster informed decisions.

Read Next

Notes illustration

Eager to learn about how behavioral science can help your organization?