Skip to main content
Japan
AIMenta
foundational · Computer Vision

Computer Vision

The AI subfield concerned with enabling machines to interpret images and video — object detection, classification, segmentation, tracking, and generation.

Computer vision is the AI subfield concerned with enabling machines to interpret and generate visual content — images, video, 3D scenes. The canonical tasks include **classification** (what is in this image), **detection** (where is each object), **segmentation** (pixel-level object boundaries), **tracking** (following objects across frames), **pose estimation** (body and hand configurations), **depth estimation**, **scene understanding**, **generation** (text-to-image, image-to-image, video generation), and **multimodal reasoning** (visual question answering, document understanding). The field moved from hand-crafted feature extractors (SIFT, HOG) to deep learning with AlexNet in 2012, and to Transformer-based models (ViT, DETR, SAM) over the 2020s.

The current architectural landscape is roughly three families. **Convolutional networks** (ResNet, ConvNeXt, EfficientNet) remain strong for classification and detection, especially at small dataset scales and on edge hardware. **Vision Transformers** (ViT, Swin, DINOv2) win at the top of most large-scale benchmarks and are the default for new foundation-scale vision work. **Multimodal vision-language models** (CLIP, SigLIP, BLIP-2, LLaVA, Gemini Vision, GPT-4V, Claude Vision) combine vision with language and have become the default for open-ended tasks — VQA, document understanding, visual reasoning. **Segment Anything** (SAM, SAM 2) demonstrated that one foundation model can segment nearly anything given a point or box prompt, collapsing segmentation into a general-purpose capability.

For APAC mid-market enterprises, computer vision use cases cluster around several mature patterns. **Manufacturing quality inspection** — defect detection on production lines — is one of the highest-ROI industrial applications, often running on edge hardware. **Retail analytics** — shelf compliance, foot-traffic counting, checkout automation — is widely deployed. **Document intelligence** — extracting structure from invoices, contracts, forms — has been transformed by vision-language models that understand layout natively. **Medical imaging** operates under region-specific regulatory regimes (PMDA, NMPA, HSA) that gate deployment. For open-ended vision tasks, the right starting point is usually a vision-language model API; for specialised narrow tasks, a fine-tuned CNN or ViT often wins on cost and latency.

The non-obvious operational note: **computer vision training data is more brittle than text data**. Lighting, camera model, angle, resolution, compression, and scene context all shift the distribution a model sees. A model trained on clean studio images often fails on real-world phone photos; a model trained in one factory's lighting often fails in another. Rigorous evaluation on the actual deployment distribution — not the training distribution — is the difference between a model that works in pilots and one that works in production.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies