Encoder-Decoder Architecture — AIMenta AI Encyclopedia

The encoder-decoder architecture is a two-stage design where an **encoder** network compresses the input into a latent representation and a **decoder** network generates the output from that representation (usually conditioned on previous outputs as well). The pattern was the workhorse of sequence-to-sequence learning throughout the mid-2010s — machine translation, summarisation, question-answering — and remains a useful mental model for understanding Transformer variants today.

In the classical RNN era, the encoder was a recurrent network that read the input one token at a time and produced a final hidden state; the decoder was another recurrent network that generated output one token at a time, initialised from the encoder's state. The 2014-2017 wave of improvements — attention mechanisms (Bahdanau 2014), Google's Neural Machine Translation system, the original 2017 Transformer paper — were all structured as encoder-decoder. The Transformer's key contribution was replacing the recurrence with self-attention inside each stage and cross-attention between them, making training parallel and scaling tractable.

Modern Transformer architectures fall into three families defined by their relationship to the encoder-decoder split. **Encoder-only** (BERT, RoBERTa) uses just the encoder half and is specialised for classification, retrieval, and embedding tasks. **Decoder-only** (GPT, Claude, Llama) uses just the decoder half with causal masking and is the foundation of every modern generative LLM. **Encoder-decoder** (T5, BART, FLAN-T5, original Whisper) retains both halves and remains preferred for tasks with clean input-output distinctions — machine translation, structured summarisation, text-to-SQL, speech recognition.

For APAC mid-market teams, the architectural choice is usually made when you select a foundation model. Decoder-only LLMs handle nearly every text-generation task despite the theoretical case for encoder-decoder on specific tasks. For **speech-to-text** and **machine translation** workflows, encoder-decoder models (Whisper, SeamlessM4T, mBART) are often still the best choice. The general rule: if your task is structured as clear-input-maps-to-clear-output (translation, speech-to-text, table-to-text), encoder-decoder is usually stronger; if your task is open-ended generation, decoder-only is the default.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

service Infrastructure & Cloud

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Other service pillars

AI Strategy & Advisory Training & Enablement Talent & Hiring Workflow Automation Software & Platforms

By industry

Financial services Retail & e-commerce Manufacturing Logistics Healthcare Professional services Public sector Real estate Technology Education

By Asian market

🇭🇰 Hong Kong 🇨🇳 Mainland China 🇹🇼 Taiwan 🇯🇵 Japan 🇰🇷 Korea 🇸🇬 Singapore 🇲🇾 Malaysia 🇻🇳 Vietnam 🇮🇩 Indonesia

Continue with All terms · AI tools · Insights · Case studies