Skip to main content
Malaysia
AIMenta
intermediate · Generative AI

Diffusion Model

A generative-model architecture that learns to reverse a noise-adding process — used in Stable Diffusion, DALL-E, Midjourney, and most modern image/video generators.

Diffusion models learn to generate data by reversing a noise-adding process. During training, real images (or video frames, or audio spectrograms) are gradually corrupted with Gaussian noise across many steps until they become pure static; the model is trained to predict the noise added at each step, which is mathematically equivalent to learning to denoise. At generation time the process runs in reverse: start from random noise, iteratively denoise, arrive at a coherent sample. The technique was popularised by Ho et al.'s 2020 **DDPM** paper and has since eaten generative image, video, audio, and increasingly protein-structure generation.

The architectures underneath shifted in two big steps. Early diffusion models used a **U-Net** backbone operating in pixel space — too expensive to train at high resolution. **Latent diffusion** (Stable Diffusion's breakthrough) moved the diffusion process into the latent space of a pretrained autoencoder, slashing compute by 10×+ at comparable quality. The more recent shift is to **diffusion transformers** (DiT) — transformer backbones replacing U-Net — which scale more predictably and now power Stable Diffusion 3 / 3.5, FLUX, Sora, and Veo.

For image generation specifically, the 2026 commercial landscape is dominated by diffusion: Midjourney, DALL-E 3, Stable Diffusion 3.5, FLUX.1 [pro], Ideogram, Adobe Firefly, and most self-hosted stacks. Video follows the same architecture with temporal layers: Sora 2, Veo 3, Kling 2.0, Runway Gen-4, Luma Ray 2. The diffusion paradigm has proven more robust for high-fidelity perceptual generation than autoregressive approaches; autoregressive still wins for text and code.

For APAC mid-market enterprises, the architectural detail rarely matters in purchasing decisions — the question is almost always which **hosted product** or which **self-hostable checkpoint** to integrate. The architecture does matter when evaluating fine-tuning: latent-diffusion checkpoints fine-tune cheaply with LoRA at consumer-GPU scale, which is why every brand-style customisation workflow you'll encounter is Stable Diffusion or FLUX based rather than built on closed models.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies