Automatic Speech Recognition maps an audio waveform to a sequence of words. Classical ASR pipelines separated acoustic modelling (audio → phonemes) from language modelling (phonemes → text); modern neural ASR fuses both into a single encoder-decoder model trained end-to-end on millions of hours of audio paired with transcripts.
The 2022 release of **Whisper** by OpenAI reset expectations. A single 1.5B-parameter transformer, trained on 680K hours of multilingual web audio, now serves as the de facto baseline for self-hosted transcription. Commercial systems (Deepgram, AssemblyAI, Google Speech-to-Text) still lead on latency, diarization, and specialised vocabularies, but Whisper's open weights democratised quality.
Production ASR decisions hinge on three axes: **latency budget** (real-time streaming vs batch transcription), **domain fit** (general conversation vs medical/legal/call-centre vocabulary), and **language coverage** (global apps need 40+ languages — tightens the vendor shortlist quickly). For APAC mid-market, Whisper-large-v3 plus a domain lexicon is usually the cheapest-to-quality starting point; graduate to Deepgram or AssemblyAI only when streaming latency or speaker labels become hard requirements.
Where AIMenta applies this
Service lines where this concept becomes a deliverable for clients.
Beyond this term
Where this concept ships in practice.
Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.
Other service pillars
By industry