Skip to main content
Mainland China
AIMenta
foundational · Deep Learning

Activation Function

The nonlinear function applied at each neuron — what gives neural networks their expressive power. Common choices: ReLU, GELU, Sigmoid, Tanh.

Activation functions are the nonlinearity applied to each neuron's weighted sum. Without them, a stack of linear layers collapses mathematically into a single linear layer and the network can only represent linear functions. Introducing a nonlinearity between layers is what gives deep networks their expressive power — the ability to approximate any continuous function given enough width or depth (universal approximation theorem).

The historical progression tells the story of deep learning. **Sigmoid** and **tanh** — smooth, bounded, the defaults in classical neural networks — saturate for large inputs, producing vanishing gradients that prevented networks deeper than a few layers from training. **ReLU** — piecewise linear, unbounded above — unlocked deep-learning by keeping gradients alive through many layers. **Leaky ReLU**, **ELU**, **PReLU** — variants that fix ReLU's dead-neuron failure mode. **GELU** and **Swish / SiLU** — smooth variants that slightly outperform ReLU in Transformers and modern CV backbones. **GLU**, **SwiGLU**, **GeGLU** — gated variants used in the feed-forward layers of state-of-the-art LLMs.

For production ML, the activation function is almost always a decision that has already been made by whichever pretrained architecture you adopt. Llama uses SwiGLU; GPT uses GELU; ResNets use ReLU; vision transformers use GELU. Swapping activations in a pretrained network rarely helps and usually breaks something. The place where activation choice still matters is custom architectures — a small MLP you are designing yourself — where the defaults (ReLU for speed, GELU for slight quality gain) are still the right starting points.

The non-obvious operational note: activation choice interacts with initialisation and normalisation. He-initialisation was derived for ReLU; Kaiming-init variants exist for GELU. Networks without batch-norm or layer-norm are more sensitive to activation choice than networks with normalisation. If you inherit a pretrained architecture, inherit its initialisation and normalisation with it.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies