Skip to main content
Hong Kong
AIMenta
Acronym intermediate · Generative AI

Text-to-Speech (TTS)

Synthesising natural-sounding speech from written text — the inverse of ASR, now approaching human parity for major languages.

Text-to-Speech has crossed a qualitative threshold in the 2023–2025 window. Neural TTS systems (Tacotron, FastSpeech, VITS, and the diffusion-based generation that powers ElevenLabs and OpenAI's TTS API) produce speech that most listeners cannot distinguish from a real recording in short samples. The remaining gap is prosody in long-form narration and emotional range under adversarial cases.

Two deployment shapes dominate: **cloud API** (ElevenLabs, OpenAI TTS, Google Cloud TTS, Azure Neural Voices) for cost-effective scale and regular voice refreshes, and **self-hosted open models** (Piper, XTTS, StyleTTS) when privacy or latency rules out egress. Voice cloning — synthesising a specific person's voice from seconds of audio — is increasingly gated by consent verification and watermarking to reduce misuse.

Enterprise TTS use cases in APAC include Mandarin/Cantonese/Japanese IVR replacement, e-learning narration at scale, accessibility compliance (screen-reader quality), and multilingual marketing audio from a single source script. Cost per 1M characters has dropped ~10× in three years and continues to compress.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies