The Data Lineage Gap in APAC Data Engineering
APAC data engineering teams operating large-scale pipelines without lineage tracking face a recurring crisis when data quality incidents occur: the analytics team reports that yesterday's APAC revenue figure is wrong, but no one knows which of the 47 pipeline jobs that ran overnight is responsible. The data engineer manually traces backwards through pipeline dependencies — checking each job's source tables, joining against data quality logs, reading pipeline code to understand transformations — a process that takes hours and requires deep institutional knowledge that exists only in senior engineers' heads.
Data lineage solves this by recording, for every pipeline run, exactly which APAC source datasets were consumed, which APAC output datasets were produced, what transformations were applied, and (at column-level) which specific APAC columns derive from which source columns. With lineage in place, a data quality incident becomes a graph traversal: start from the incorrect APAC output column, traverse upstream edges to find which APAC source changed, and identify which APAC pipeline job introduced the fault.
Three tools cover the APAC data lineage implementation spectrum:
OpenLineage — open standard for data lineage metadata, with integrations for Spark, Airflow, dbt, and Flink that emit structured lineage events to any compatible backend.
Marquez — open-source lineage metadata service and OpenLineage reference backend for storing, querying, and visualizing APAC pipeline lineage.
Spline — Apache Spark-specific lineage tracking with column-level capture from Spark execution plans without application code changes.
APAC Data Lineage Fundamentals
Job-level vs column-level lineage
APAC Job-level lineage (OpenLineage / Marquez):
Job: apac_stg_payments_transform
Inputs: [apac_raw.payment_transactions, apac_raw.currency_rates]
Outputs: [apac_staging.stg_payments]
← Tells APAC which datasets are involved; not which columns
APAC Column-level lineage (Spline for Spark):
apac_staging.stg_payments.apac_payment_amount_usd
← apac_raw.payment_transactions.amount_cents / 100
← apac_raw.currency_rates.usd_rate
← Tells APAC exactly which APAC source columns feed each output
APAC lineage use cases
APAC Impact analysis (change upstream → what breaks?):
"If APAC DBA renames raw_payments.amount_cents"
→ Lineage graph shows: 3 APAC Spark jobs, 7 APAC dbt models, 2 APAC BI reports affected
APAC Root cause analysis (broken output → what caused it?):
"APAC fct_revenue has wrong data for 2026-04-23"
→ Traverse APAC lineage backwards → find stg_payments loaded 0 rows
→ Cause: APAC Kafka lag; raw_payments not fully synced before job ran
APAC Compliance / data sovereignty:
"Which APAC reports consume personal data from apac_raw.customer_pii?"
→ Forward lineage traversal shows all APAC downstream consumers
APAC Schema evolution safety:
"Is APAC column apac_user_id used anywhere downstream?"
→ Lineage confirms before safe APAC column removal
OpenLineage: APAC Pipeline Instrumentation Standard
OpenLineage Airflow integration — APAC automatic emission
# airflow/dags/apac_payments_pipeline.py
# OpenLineage instruments Airflow automatically via the OpenLineage provider
# No code changes needed — install the provider and configure the backend
# pip install apache-airflow-providers-openlineage
# APAC Airflow environment variables:
# OPENLINEAGE_URL=http://marquez.apac-internal:5000
# OPENLINEAGE_NAMESPACE=apac-payments-pipeline
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime
with DAG(
dag_id="apac_payments_etl",
start_date=datetime(2026, 1, 1),
schedule="@daily",
) as dag:
# OpenLineage automatically captures:
# Input: apac_raw.payment_transactions (with schema + row count)
# Output: apac_staging.stg_payments (with schema + row count)
# No additional code required
apac_transform = PostgresOperator(
task_id="apac_transform_payments",
postgres_conn_id="apac_warehouse",
sql="""
INSERT INTO apac_staging.stg_payments
SELECT
transaction_id AS apac_payment_id,
CAST(amount_cents AS NUMERIC) / 100 AS apac_payment_amount_usd,
CONVERT_TIMEZONE('UTC', created_at) AS apac_created_at_utc
FROM apac_raw.payment_transactions
WHERE DATE(created_at) = '{{ ds }}'
""",
)
OpenLineage event structure — APAC lineage payload
{
"eventType": "COMPLETE",
"eventTime": "2026-04-24T08:00:00Z",
"run": {
"runId": "apac-run-7f3a9b2c",
"facets": {
"apac_environment": {"region": "ap-southeast-1", "cluster": "apac-prod"}
}
},
"job": {
"namespace": "apac-payments-pipeline",
"name": "apac_transform_payments"
},
"inputs": [{
"namespace": "apac-warehouse",
"name": "apac_raw.payment_transactions",
"facets": {
"schema": {
"fields": [
{"name": "transaction_id", "type": "VARCHAR"},
{"name": "amount_cents", "type": "INTEGER"},
{"name": "created_at", "type": "TIMESTAMP"}
]
},
"dataQualityMetrics": {
"rowCount": 847293,
"nullCount": {"transaction_id": 0}
}
}
}],
"outputs": [{
"namespace": "apac-warehouse",
"name": "apac_staging.stg_payments",
"facets": {
"schema": {
"fields": [
{"name": "apac_payment_id", "type": "VARCHAR"},
{"name": "apac_payment_amount_usd", "type": "NUMERIC"},
{"name": "apac_created_at_utc", "type": "TIMESTAMP"}
]
}
}
}]
}
Marquez: APAC Lineage Backend and Visualization
Marquez deployment — APAC Docker Compose setup
# docker-compose.apac-marquez.yml
version: "3"
services:
marquez-api:
image: marquezproject/marquez:latest
ports:
- "5000:5000" # APAC OpenLineage API endpoint
- "5001:5001" # APAC Marquez admin API
environment:
MARQUEZ_CONFIG: /usr/src/app/marquez.yml
volumes:
- ./marquez.yml:/usr/src/app/marquez.yml
marquez-web:
image: marquezproject/marquez-web:latest
ports:
- "3000:3000" # APAC Marquez lineage UI
environment:
MARQUEZ_HOST: marquez-api
MARQUEZ_PORT: 5000
postgresql:
image: postgres:16
environment:
POSTGRES_USER: marquez
POSTGRES_PASSWORD: apac_marquez_secret
POSTGRES_DB: marquez_db
Marquez API — APAC lineage graph query
# APAC: Query Marquez lineage API for upstream dependencies
# Which APAC jobs write to apac_staging.stg_payments?
curl "http://marquez.apac-internal:5000/api/v1/lineage?nodeId=dataset:apac-warehouse:apac_staging.stg_payments&depth=3" \
| jq '.graph.nodes[] | select(.type == "JOB") | .data.name'
# Output:
# "apac_transform_payments"
# "apac_backfill_payments_2025"
# APAC: What datasets does apac_transform_payments consume?
curl "http://marquez.apac-internal:5000/api/v1/jobs/apac-payments-pipeline/apac_transform_payments" \
| jq '.latestRun.inputVersions[].name'
# Output:
# "apac_raw.payment_transactions"
# "apac_raw.currency_rates"
# APAC: Get run history (identify when APAC data went wrong)
curl "http://marquez.apac-internal:5000/api/v1/jobs/apac-payments-pipeline/apac_transform_payments/runs?limit=7" \
| jq '.runs[] | {state, startedAt, endedAt}'
Spline: APAC Spark Column-Level Lineage
Spline — APAC Spark integration (zero code changes)
// APAC Spark job: no code changes needed for Spline
// 1. Add Spline agent JAR to Spark submit:
// spark-submit \
// --packages za.co.absa.spline.agent.spark:spark-3.3-spline-agent-bundle_2.12:1.0.0 \
// --conf spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.SplineQueryExecutionListener \
// --conf spline.producer.url=http://spline-rest.apac-internal:8080/producer \
// apac_payments_transform.jar
// APAC Spark application code — unchanged:
import org.apache.spark.sql.SparkSession
object APACPaymentsTransform {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("APAC Payments Transform")
.getOrCreate()
// Spline automatically captures: input → transformation → output lineage
val apacRawPayments = spark.read
.format("delta")
.load("s3://apac-data-lake/raw/payment_transactions/")
val apacStagedPayments = apacRawPayments
.filter("transaction_id IS NOT NULL")
.selectExpr(
"transaction_id AS apac_payment_id",
"CAST(amount_cents AS DOUBLE) / 100.0 AS apac_payment_amount_usd",
"to_utc_timestamp(created_at, 'Asia/Singapore') AS apac_created_at_utc"
)
// Spline captures column-level lineage at write:
// apac_payment_amount_usd ← amount_cents (CAST + divide by 100)
apacStagedPayments.write
.format("delta")
.mode("overwrite")
.save("s3://apac-data-lake/staging/stg_payments/")
}
}
APAC Data Lineage Tool Selection
APAC Data Lineage Need → Tool → Why
APAC multi-framework lineage → OpenLineage Standard across Spark +
(Airflow + dbt + Spark + Flink) → Airflow + dbt + Flink;
single APAC backend
APAC lineage storage + visualization → Marquez OpenLineage reference
(self-hosted, lightweight) → backend; graph UI;
APAC Docker Compose deploy
APAC Spark-primary column lineage → Spline Zero-code APAC Spark
(Databricks/EMR teams) → integration; column-level
derivation tracking
APAC enterprise data catalog → DataHub / Full APAC governance:
(governance + lineage combined) → OpenMetadata business glossary + APAC
lineage + stewardship
APAC impact analysis at scale → OpenLineage Standard events feed
(many teams, many APAC pipelines) → + DataHub DataHub APAC lineage graph
Related APAC Data Engineering Resources
For the analytics engineering tools (dbt Core, SQLMesh, Coalesce) that emit OpenLineage events for SQL transformation lineage, see the APAC analytics engineering guide.
For the data catalog tools (DataHub, OpenMetadata, Atlas) that consume OpenLineage events from Marquez to provide a unified APAC governance view, see the APAC data governance guide covering Collibra, Alation, and Atlan.
For the data quality tools (Great Expectations, Soda, Airflow) that emit quality facets through OpenLineage to enrich APAC lineage events with quality context, see the APAC data quality guide.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.