Speech Recognition
What is Speech Recognition?
Speech recognition technology enables devices to understand and convert human voice into text or commands, fundamentally bridging human and machine interactions. Beyond recognizing speech, it can execute specific actions by comprehending the input. Now prevalent in healthcare, customer service, automotive industries, and virtual assistants, speech recognition improves accessibility and efficiency in our daily lives.
The Basic Idea
You likely own an electronic device equipped with speech recognition technology, for example, your smartphone or smart home device. At first glance, the process behind speech recognition might seem straightforward: the electronic device recognizes the sounds you're making, translates them into written text, and receives them as commands. However, the actual process involves sophisticated tech and detailed processes1:
Signal Processing: The first step in speech recognition involves converting spoken words into a digital format that a computer can understand. When you speak, a microphone captures your voice as an acoustic signal. This signal is transformed into a digital representation the computer can process. This step lays the foundation for the next stages.
Speech Feature Extraction: You can think of this step as the sieve of speech recognition. The main goal is to filter out the background noise and irrelevant information, only keeping the critical data for the next stages. The computer recognizes features in your voice such as pitch, tone and rhythm to make an accurate extraction.
Acoustic & Language Models: Extracted features are then recognized and processed by acoustic and language models. The acoustic model maps these features to phonemes (the basic units of sound in a language), while the language model adds context by understanding how words typically fit together in a language. These models use predictions based on previous words to guess and confirm the next words, ensuring coherence and increasing accuracy.
Decoding: The final stage involves combining the outputs of acoustic and language models to generate the most probable transcription of what was spoken. This involves searching through possible combinations of words to find the sequence that best matches the input.
It’s also important to differentiate between speech recognition and voice recognition. Speech recognition technology focuses on what is being said, without considering who is saying it. On the other hand, voice recognition is concerned with identifying who is speaking, rather than the content of what they are saying.2
For example, Amazon’s Alexa is equipped with both technologies. Imagine your home's security alarm is set up with Alexa, but you also have children, and you need to ensure that the alarm isn’t activated accidentally by one of them. Through speech recognition, Alexa understands the command to activate the alarm, but through voice recognition, it can differentiate between your voice and your child’s, preventing unauthorized activation or deactivation.
Now, you might be asking: but how does it actually “understand” humans? This is where artificial intelligence, and more specifically deep learning comes in. Although other options are possible such as Gaussian mixture models (GMM) or hidden Markov models (HMM), most software and devices use deep learning.
This type of AI learns from “experience,” or vast amounts of data. Specifically, Alexa is really good at interpreting humans because it has been “fed” millions of commands and human voices. Through speech recognition and deep learning, technology is able to understand and interact with us making it significantly more intuitive and responsive than ever.3
About the Author
Mariana Ontañón
Mariana holds a BSc in Pharmaceutical Biological Chemistry and a MSc in Women’s Health. She’s passionate about understanding human behavior in a hollistic way. Mariana combines her knowledge of health sciences with a keen interest in how societal factors influence individual behaviors. Her writing bridges the gap between intricate scientific information and everyday understanding, aiming to foster informed decisions.