Skip to main content
Vietnam
AIMenta
Acronym intermediate · Natural Language Processing

Automatic Speech Recognition (ASR)

Converting spoken audio into text — the foundation of voice assistants, transcription services, and most speech-to-text workflows.

Automatic Speech Recognition maps an audio waveform to a sequence of words. Classical ASR pipelines separated acoustic modelling (audio → phonemes) from language modelling (phonemes → text); modern neural ASR fuses both into a single encoder-decoder model trained end-to-end on millions of hours of audio paired with transcripts.

The 2022 release of **Whisper** by OpenAI reset expectations. A single 1.5B-parameter transformer, trained on 680K hours of multilingual web audio, now serves as the de facto baseline for self-hosted transcription. Commercial systems (Deepgram, AssemblyAI, Google Speech-to-Text) still lead on latency, diarization, and specialised vocabularies, but Whisper's open weights democratised quality.

Production ASR decisions hinge on three axes: **latency budget** (real-time streaming vs batch transcription), **domain fit** (general conversation vs medical/legal/call-centre vocabulary), and **language coverage** (global apps need 40+ languages — tightens the vendor shortlist quickly). For APAC mid-market, Whisper-large-v3 plus a domain lexicon is usually the cheapest-to-quality starting point; graduate to Deepgram or AssemblyAI only when streaming latency or speaker labels become hard requirements.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies