Speech Recognition and the AI Symbiosis

AI is equally awe-inspiring and alienating, forming the fault line for one of our most divisive cultural debates. Is it the messianic tool that will streamline industry and everyday life, or is it the harbinger of our obsolescence?

It’s a high-stakes question and perspectives are understandably polarized, especially in the early innings of societal integration. It’s not surprising that the rapid adoption of machine learning tools triggers apprehension. It would be weird if it didn’t.

ML platforms appear and operate like mysterious black boxes. Their learning architecture, quite literally, has hidden layers between data ingestion and output, concealing the very means of their refinement. It leaves a lot to the imagination, but we can be buoyed by the fact that these tools rest on an inherently human foundation.

This balance is exemplified by Automated Speech Recognition (ASR) engines and the transcription tools they support. The one-to-one speed with which these tools dictate (and annotate) complex medical discussions can be humbling and give the appearance of full machine sentience; but believe it or not, they become even more impressive when you see how they leverage human input.

The ASR Pipeline

Take a moment to envision a human transcriptionist. What are they doing? Likely listening to someone, or a recording, and jotting down what they hear. You might also see the same when you imagine a machine transcribing. Audio goes in, something-something-digibrain, words.

The hidden layers of it all makes this a completely reasonable impulse, but the process is actually more elongated than ours; it just happens at incomprehensible speed. It’s a whole pipeline, the ASR pipeline to be exact, and here’s how it works:

The process begins with recording of a patient consulting their care provider. Once complete, captured audio undergoes pre-processing, where background noise is filtered and volume is normalized to ensure the ASR engine is interpreting optimal audio.

I See What You’re Saying

Let’s briefly return to our human transcriptionist. Unlike them, an ASR engine doesn’t immediately recognize phonemes, or units of language that correspond with sounds [e.g., the sh in ship, or the o (aw) in dog]. As such, the engine has to take a step back and visualize sounds using a spectrograph.

To do this, it will split a recording into smaller frames and, through a process called Short-Time Fourier Transform (STFT), map those frames onto the x and y-axes of a spectrograph. The former represents the total duration of the recording while the latter shows the frequency levels of captured audio. Mapped sounds are also assigned a color gradient based on their intensity, or amplitude.

Visualizing these audible elements helps ASR engines discern consonant and vowel enunciations, transitions between words, and vocal inflections that may distinguish a question from a statement.

The Neural Network

Now for the more familiar, mystifying aspects of AI: the learning in machine learning. Once recorded audio is converted into a spectrograph, it’s like that abstract painting you ponder for several minutes in a museum. There’s meaning there, but it’s not abundantly clear. That’s where the acoustic model comes in.

The acoustic model translates mapped audio signals into phonemes using one or more neural networks. These networks are complex data interpretation models composed of hundreds (sometimes thousands) of interconnected nodes, similar to the neurons and synapses in our brain. A Convolutional Neural Network (CNN), for example, can be used by an ASR engine to compare overlapping spectrogram frames, identify recurrent phonetic patterns, and associate them with word segments.

A CNN may also work with a Recurrent Neural Network (RNN) or a Long Short-Term Memory Network (LSTM), which reviews existing word segments and predicts subsequent ones based on preceding syntax and intonation. The acoustic model’s output resembles a dialogue tree, where multiple responses are connected to each question or prompt.

Following the Map

Once the acoustic model produces its sprawling outline of what a patient and care provider could be talking about, it passes it on to the ASR engine’s language model. Essentially, the former has provided the latter with the world’s least-helpful roadmap, where every possible route (even the most roundabout) has been charted.

The language model is tasked with finding the best one (i.e., the most coherent, accurate translation), but it doesn’t do this alone. It references a wide variety of human-imparted source material, including a series of Natural Language Processing (NLP) libraries that help it identify specific languages, dialects, and other conversational nuances that influence meaning.

Language models are also highly customized, especially those that facilitate clinical transcription. These tools need to accurately represent patient conversations and maintain a thorough comprehension of evolving, specialty-specific terminology and treatments. To ensure this, language models are trained with information from medical dictionaries, clinical notes, research papers, consultation transcripts, and practice-specific lexicons.

The Human Touch

When the language model produces its transcription, thisishowitreads. That’s where the decoder comes in. This mechanism applies necessary grammatical corrections and formatting adjustments, like capitalization and punctuation, to make transcriptions readable.

It may not be the most fascinating stage, but it does remind us that ASR is a process for humans. It’s representative of what we often overlook when envisioning the future of AI and our place within it. Some may see an AI tool produce a perfectly transcribed, annotated dictation, or an outline for an 800-page novel, or a concise breakdown of the most complicated thing they can think of–in seconds– and understandably wonder, “Well, what good am I?”

But before they spiral, I’d ask them to reconsider those neural network nodes. The way they communicate is similar to our own brain, but that’s not the only similarity. Like us, these nodes are also conditioned by experiential bias. Each one cross-references data against a unique set of parameters that allow it to assess input relevance and measure the discrepancy between its interpretation and reality. This discrepancy, or loss value, is calculated continuously so the network’s optimization algorithm can improve processing accuracy over time.

And who is absolutely instrumental in shaping those parameters…? That’s right, us.

We provide the knowledge base and serve as the final quality check. The ASR pipeline doesn’t end with a blindly accepted transcript; it’s punctuated with careful, collective review from medical editors, practitioners, and even patients.

Our relationship with AI is often characterized as an adversarial one, where only one side can prevail. In actuality, AI isn’t a rival; it’s a mirror. Its sophistication, goodwill, and evolution will reflect our own–and that might be what scares some. But for every gloomy prognostication, we should also take heart when AI is utilized in ways that bring us closer. Innovations like ASR and clinical dictation remind us that the one of the most powerful engines driving this period of staggering (sometimes-jarring) technological advancement is our own instinct to care for one another.