Skip to main content
Vietnam
AIMenta
Acronym intermediate · Deep Learning

Convolutional Neural Network (CNN)

A neural-network architecture specialised for grid-structured data (especially images) via learned convolutional filters that exploit spatial locality.

A Convolutional Neural Network (CNN) is a neural-network architecture specialised for grid-structured data — most famously images, but also audio spectrograms, time-series, and any data with local spatial or temporal structure. The defining layer is the **convolutional layer**: a small learned filter slides across the input, computing a dot product at each position to produce a feature map. The approach exploits two strong priors about images: **locality** (nearby pixels are more related than distant ones) and **translation invariance** (a feature's meaning does not depend on its absolute position). These priors dramatically reduce the parameter count compared with dense layers and enable effective learning from the data volumes available in the mid-2010s.

The field moved through several architectural eras. **LeNet** (LeCun, 1989-1998) established the basic pattern. **AlexNet** (2012) triggered the deep-learning revolution by winning ImageNet with a GPU-trained deep CNN. **VGG** (2014) showed that deeper was better given enough data. **ResNet** (2015) introduced residual connections that made 100+-layer networks trainable, and dominated ImageNet leaderboards for years. **EfficientNet** (2019) formalised compound scaling. **ConvNeXt** (2022) modernised CNN design to compete with Transformers. Meanwhile, **Vision Transformers** (ViT, 2020) and **hybrid architectures** (Swin, CoAtNet) have taken the top of many leaderboards — though CNNs remain strong for many practical tasks, especially on small datasets or when hardware efficiency matters.

For APAC mid-market teams working on vision tasks, the practical question is rarely CNN-vs-ViT in the abstract — it is "which pretrained model on Hugging Face fine-tunes best on my data budget". For small-dataset classification (under 10K labelled images), pretrained CNNs like ResNet and ConvNeXt often fine-tune more sample-efficiently than large ViTs. For larger datasets and complex scenes, modern ViTs and hybrids frequently edge ahead. Benchmark both on your actual data before committing.

The non-obvious operational note: **CNN architectures age well for edge deployment**. Mobile-optimised CNN families (MobileNet, EfficientNet-Lite) remain the strongest choices for on-device inference where every millisecond and milliwatt counts. If your use case is on-device vision — retail self-checkout, factory inspection on edge appliances, smartphone cameras — start with a CNN, not a ViT.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies