What it does

Key features

Pull-based metric scraping — Prometheus scrapes instrumented targets at configurable intervals with Kubernetes auto-discovery
PromQL — powerful functional query language for multi-dimensional metric aggregation and alerting rule expression
Alertmanager — alert routing, grouping, and notification integration with PagerDuty, OpsGenie, Slack, and email
Kubernetes native — first-class kube-state-metrics and kubelet metrics collection for Kubernetes cluster monitoring
OpenMetrics compatibility — standard exposition format adopted across APAC cloud providers and instrumentation libraries
Grafana integration — canonical Grafana data source enabling the full LGTM observability stack
Long-term storage via Thanos/Mimir — distributed Prometheus for multi-cluster APAC enterprise deployments

When to reach for it

Best for

APAC Kubernetes platform teams wanting the cloud-native metrics standard without per-series SaaS pricing
Engineering teams adopting the LGTM (Loki/Grafana/Tempo/Mimir) open-source observability stack
APAC platform teams with Kubernetes expertise wanting operational control over their metrics infrastructure
Cost-conscious APAC organisations where managed observability platforms are cost-prohibitive at production metric volumes

Don't get burned

Limitations to know

! Single-node Prometheus has storage scaling limits at large APAC Kubernetes clusters — Thanos or Mimir required at scale
! PromQL has a learning curve for APAC teams not familiar with functional query language paradigms
! Operational complexity — running production Prometheus with high availability and long-term storage requires platform engineering investment
! Not a complete observability solution — Prometheus handles metrics only; logs and traces require Loki and Tempo alongside it

Context

About Prometheus

Prometheus is an open-source monitoring and alerting toolkit that has become the de facto standard for metrics collection in cloud-native Kubernetes environments — used by the vast majority of APAC engineering teams running Kubernetes workloads as the primary metrics backend for infrastructure and application monitoring, most commonly visualised through Grafana dashboards in the standard LGTM (Loki/Grafana/Tempo/Mimir) observability stack.

Prometheus's pull-based architecture — where Prometheus scrapes metrics from instrumented targets (Kubernetes pods, nodes, and services) at configurable intervals rather than having targets push metrics to a centralised receiver — provides APAC platform teams with several operational advantages. First, Prometheus can detect when a target becomes unreachable (the scrape fails), providing implicit availability monitoring without requiring explicit health check instrumentation. Second, the pull model means Prometheus controls the scrape rate, preventing target-side metric flooding that can overwhelm push-based metric collectors under high load. Third, Kubernetes service discovery — which Prometheus uses to automatically discover pods and services to scrape based on labels and annotations — means that as APAC Kubernetes workloads scale horizontally, Prometheus automatically discovers and scrapes new pod instances without manual configuration.

Prometheus's data model — time-series metrics identified by a metric name and key-value label pairs — provides APAC platform teams with multi-dimensional metric data that supports the aggregation patterns required for Kubernetes monitoring. A single HTTP request count metric can be labelled with namespace, service, method, and status code dimensions, enabling PromQL queries that aggregate by any combination of dimensions (total requests by service, error rates by namespace, p99 latency by method) from the same metric.

PromQL (Prometheus Query Language) — Prometheus's functional query language — provides APAC SRE teams with the expressive power to calculate derived metrics, apply time-based aggregation, and express alerting conditions precisely. Writing PromQL alert rules that express SLO burn rate conditions (alert when error rate over the last 5 minutes is consuming the error budget at a rate that will exhaust it within 2 hours) requires PromQL investment but enables precise, actionable alerting that reduces false positive rates compared to simple threshold-based alerting.

Prometheus's operational characteristics at APAC enterprise scale require consideration: the single-node Prometheus deployment model that works well for smaller APAC Kubernetes clusters develops storage and cardinality scaling limits at large cluster sizes and high metric series counts. APAC platform teams operating large Prometheus deployments typically adopt Thanos or Cortex (now Grafana Mimir) to provide long-term storage, global query federation, and high availability across multiple Prometheus instances.

Prometheus

Key features

Best for

Limitations to know

About Prometheus

Where this category meets practice depth.