The APAC Observability Cost Problem
APAC platform engineering teams running Kubernetes at scale share a common operational tension: observability data volume grows much faster than application traffic. An APAC Kubernetes cluster handling 1,000 requests per second generates:
- Metrics: 50,000+ time series from Prometheus scraped every 15 seconds — 200 MB/hour, 4.8 GB/day
- Logs: 10,000 log lines per second across 50 APAC services — 360 GB/day raw, before compression
- Traces: 10 million spans per day at 10% sampling rate — 50 GB/day in Jaeger format
Traditional APAC observability stacks — Elasticsearch for logs, Jaeger with Elasticsearch for traces, Prometheus with long-term Thanos for metrics — require dedicated APAC infrastructure that many APAC mid-market teams cannot justify at this scale.
The Grafana observability stack — Loki for logs, Tempo for traces, VictoriaMetrics for long-term metrics — addresses the APAC cost problem through a shared principle: store the data payload in cheap object storage (S3, GCS, Azure Blob at $0.02-0.03/GB), index only the metadata needed for query routing, and query through Grafana using familiar Prometheus-derived syntax.
For APAC platform teams already running Prometheus and Grafana — which covers the majority of Kubernetes-native APAC engineering organizations — adding Loki, Tempo, and VictoriaMetrics completes the observability triad without adopting an entirely different toolchain.
Grafana Loki: APAC Log Aggregation Without Elasticsearch
The Elasticsearch log storage economics problem
Elasticsearch's full-text indexing model — where all log content is tokenized and indexed for fast arbitrary string search — creates storage costs that scale with log text volume, not just log count. An APAC Kubernetes cluster with 50 services generating 500 MB of raw logs per hour requires approximately 1.5-2 GB of Elasticsearch index storage per hour (the inverted index for full-text search adds 3-4x storage overhead on raw text volume).
At 12 months of APAC log retention, this becomes 13-17 TB of Elasticsearch storage — requiring SSD-backed nodes for acceptable query performance, totaling $800-1,500/month in dedicated APAC storage infrastructure before compute and memory costs.
Loki's label-only indexing model stores the same APAC log streams in compressed S3 at approximately 20-50x less storage than Elasticsearch — the same 12-month APAC retention costs $30-80/month in S3 storage.
Loki deployment for APAC Kubernetes
# Add Grafana Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Deploy Loki stack (Loki + Promtail) for APAC cluster
helm install loki grafana/loki-stack \
--namespace monitoring \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi \
--values apac-loki-values.yaml
# apac-loki-values.yaml:
# loki:
# config:
# storage_config:
# aws:
# s3: s3://apac-loki-logs/ # APAC S3 bucket for log storage
# region: ap-southeast-1 # Singapore
# compactor:
# retention_enabled: true
# retention_period: 2160h # 90-day APAC log retention
# promtail:
# config:
# clients:
# - url: http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/push
Promtail automatically discovers all APAC pods running on each Kubernetes node and ships their stdout/stderr logs to Loki with Kubernetes labels attached — no application code changes required for APAC log collection.
LogQL for APAC service debugging
# Count APAC payment errors per minute
sum(rate({namespace="apac-payments", container="api"} |= "ERROR" [1m])) by (pod)
# APAC slow query log filter with JSON parsing
{namespace="apac-database"} | json | query_duration_ms > 1000
# APAC trace correlation: find logs for a specific trace ID
{namespace="apac-payments"} |= "trace_id=abc123def456"
# APAC log volume by service (for capacity planning)
sum by (app) (bytes_over_time({namespace="apac-production"}[24h]))
APAC platform teams new to LogQL find the label selector syntax immediately familiar from Prometheus ({label="value"}) with filter operations (|= "string", |~ "regex") as extensions.
Grafana Tempo: APAC Distributed Tracing at Object Storage Costs
Distributed tracing for APAC microservices
As APAC microservice architectures grow beyond 5-10 services, debugging latency issues becomes increasingly difficult with logs alone. A Singapore user experiencing a 3-second API response time on an APAC e-commerce checkout may be experiencing latency from the APAC inventory service, the APAC payment processor, the APAC notification service, or network hops between them — and no individual APAC service log identifies the causative layer.
Distributed tracing — where a unique trace ID propagates through all APAC service calls, and each service records its span (start time, duration, status, attributes) against that trace ID — provides the APAC request execution timeline that identifies where latency originated.
Tempo stores these APAC traces in object storage for months of history, enabling APAC platform teams to compare current APAC service behavior with historical baselines.
OpenTelemetry instrumentation for APAC services
# APAC Python service: OpenTelemetry instrumentation with Tempo
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure APAC Tempo exporter
apac_exporter = OTLPSpanExporter(
endpoint="http://tempo.monitoring.svc.cluster.local:4317",
insecure=True
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(apac_exporter))
trace.set_tracer_provider(provider)
apac_tracer = trace.get_tracer("apac-payments-service")
# Instrument APAC payment processing
with apac_tracer.start_as_current_span("process-apac-payment") as span:
span.set_attribute("payment.amount", 150.00)
span.set_attribute("payment.currency", "SGD")
span.set_attribute("customer.region", "SG")
# This span's trace ID propagates to all downstream APAC service calls
result = call_apac_fraud_check(payment_data)
result = call_apac_payment_gateway(payment_data)
# Tempo Kubernetes deployment for APAC cluster
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
namespace: monitoring
spec:
template:
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args: ["-config.file=/etc/tempo.yaml"]
volumeMounts:
- name: tempo-config
mountPath: /etc/tempo.yaml
subPath: tempo.yaml
# tempo.yaml: APAC Tempo configuration
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc: # APAC services send OTLP gRPC to port 4317
http: # APAC services send OTLP HTTP to port 4318
jaeger:
protocols:
grpc: # Legacy APAC Jaeger clients still supported
zipkin: # Legacy APAC Zipkin clients
storage:
trace:
backend: s3
s3:
bucket: apac-tempo-traces
endpoint: s3.ap-southeast-1.amazonaws.com
region: ap-southeast-1
Grafana trace-to-log-to-metric correlation for APAC incidents
The most valuable Grafana observability integration for APAC incident response is the tri-correlation: from an APAC metric alert, navigate to the trace causing the metric anomaly, then navigate to the logs generated during that trace:
APAC Incident Response Flow:
1. Grafana alert: APAC payment API p99 latency > 2s
(VictoriaMetrics metric: http_request_duration_seconds{service="apac-payments"})
2. Grafana Explore: click trace exemplar on the metric spike
→ Opens Tempo trace showing 2.3s in apac-fraud-check span
3. Tempo trace detail: click "Logs for this span"
→ Loki shows: "APAC external fraud API timeout after 2000ms - retrying"
Resolution: APAC external fraud service SLA breach detected in 90 seconds,
not 30 minutes of log grep across APAC services.
VictoriaMetrics: APAC Long-Term Metrics Storage and Federation
Why Prometheus alone isn't enough for APAC multi-cluster environments
Prometheus is excellent for APAC real-time metrics scraping and alerting within a single Kubernetes cluster. Its limitations appear when APAC platform teams need:
Long-term retention: Prometheus default retention is 15 days. APAC capacity planning requires 3-12 months of metric history. Extending Prometheus local retention to 6 months requires significant SSD storage per cluster.
Multi-cluster aggregation: APAC enterprises operating 5 EKS clusters (Singapore, Tokyo, Seoul, Sydney, Jakarta) need a federated view of cross-cluster metrics for capacity planning — Prometheus federation is architecturally complex and creates high cardinality federation queries.
Cardinality at APAC scale: Kubernetes clusters with 500+ pods, 100+ services, and dynamic label cardinality (pod IDs, deployment hashes) can exceed Prometheus's practical cardinality limits, causing high memory pressure and slow query performance.
VictoriaMetrics addresses all three limitations while remaining fully compatible with existing APAC Prometheus infrastructure.
VictoriaMetrics for APAC Kubernetes
# Deploy VictoriaMetrics single-node for APAC long-term storage
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update
helm install victoria-metrics vm/victoria-metrics-single \
--namespace monitoring \
--set server.retentionPeriod=12 \ # 12-month APAC metric retention
--set server.persistentVolume.size=100Gi \
--values apac-vm-values.yaml
# Prometheus remote_write to VictoriaMetrics (keep Prometheus for alerting)
# prometheus.yml addition:
remote_write:
- url: http://victoria-metrics.monitoring.svc.cluster.local:8428/api/v1/write
queue_config:
max_samples_per_send: 10000
max_shards: 30
capacity: 100000
APAC multi-cluster metric federation with VictoriaMetrics Cluster
VictoriaMetrics APAC Cluster Architecture:
APAC Regional Clusters (Prometheus scrapes locally):
├── EKS ap-southeast-1 (Singapore)
│ ├── Prometheus → remote_write → vminsert.apac-vm:8480
├── EKS ap-northeast-1 (Tokyo)
│ ├── Prometheus → remote_write → vminsert.apac-vm:8480
└── EKS ap-northeast-2 (Seoul)
├── Prometheus → remote_write → vminsert.apac-vm:8480
VictoriaMetrics Cluster (central APAC metrics store):
├── vminsert (receives remote_write from all APAC Prometheus)
├── vmstorage (stores APAC metrics with 12-month retention)
└── vmselect (serves MetricsQL queries to APAC Grafana)
Grafana (APAC unified metrics dashboard):
└── VictoriaMetrics data source → cross-cluster APAC queries
"Show p99 latency for apac-payments service across all regions"
APAC platform teams configure Grafana dashboards that query VictoriaMetrics for cross-cluster APAC metrics with cluster label filters — enabling the APAC infrastructure lead to see Singapore, Tokyo, and Seoul cluster metrics side-by-side without maintaining Grafana federation or Thanos query layers.
MetricsQL advantages for APAC monitoring
VictoriaMetrics supports PromQL plus MetricsQL extensions useful for APAC monitoring:
# Median latency by APAC region (MetricsQL quantile function)
quantile(0.5,
rate(http_request_duration_seconds_sum{region=~"apac-.*"}[5m])
/
rate(http_request_duration_seconds_count{region=~"apac-.*"}[5m])
) by (region, service)
# APAC error rate anomaly detection with keep_last_value
100 * (
rate(http_requests_total{status=~"5..", namespace="apac-payments"}[5m])
/
rate(http_requests_total{namespace="apac-payments"}[5m])
)
keep_last_value 1 # MetricsQL: fill gaps in APAC metrics with last value
The APAC Grafana Observability Stack
The three tools complete the Grafana observability stack that many APAC platform teams are partially running:
APAC Observability Stack (all queryable from single Grafana):
Metrics (existing + enhanced):
├── Prometheus — APAC scraping and alerting (keep existing)
└── VictoriaMetrics — long-term APAC metric storage + multi-cluster federation
Logs (new, replaces APAC log grepping):
└── Loki — APAC log aggregation via Promtail DaemonSet
Traces (new, enables APAC distributed tracing):
└── Tempo — APAC trace storage via OpenTelemetry
Dashboards (existing):
└── Grafana — unified APAC metrics + logs + traces in one interface
- Correlate: metric spike → trace exemplar → log line
APAC platform teams deploying this stack report the primary operational benefit is time-to-resolution during APAC incidents: the ability to navigate from a metric alert to the causative trace to the specific log line within 2-3 clicks replaces the 15-30 minute manual investigation cycle that grep-based APAC log debugging requires.
Related APAC Platform Engineering Resources
For the Kubernetes platform that runs these APAC observability components, see the APAC Kubernetes platform engineering guide covering vCluster, External Secrets, and ExternalDNS.
For the DevSecOps controls governing APAC observability component images, see the APAC Kubernetes DevSecOps guide covering Kyverno, Cosign, and Kubescape.
For AIOps tools that process APAC observability data for anomaly detection and incident correlation, see the APAC AIOps guide covering Dynatrace, PagerDuty, and Datadog.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.