Skip to main content
Global
AIMenta
Blog

APAC Data Lineage Guide 2026: OpenLineage, Marquez, and Spline for Data Pipeline Transparency

A practitioner guide for APAC data engineering teams implementing data lineage tracking in 2026 — covering OpenLineage as the open standard for structured lineage event emission from Spark, Airflow, dbt, and Flink pipelines with schema and data quality metadata facets; Marquez as the OpenLineage-compatible reference backend for storing job run history, dataset version tracking, and interactive lineage graph visualization with Docker Compose deployment; and Spline for zero-code-change column-level Apache Spark lineage capture that records exactly which source columns contributed to each output column through which transformations for APAC impact analysis and root cause investigation.

AE By AIMenta Editorial Team ·

The Data Lineage Gap in APAC Data Engineering

APAC data engineering teams operating large-scale pipelines without lineage tracking face a recurring crisis when data quality incidents occur: the analytics team reports that yesterday's APAC revenue figure is wrong, but no one knows which of the 47 pipeline jobs that ran overnight is responsible. The data engineer manually traces backwards through pipeline dependencies — checking each job's source tables, joining against data quality logs, reading pipeline code to understand transformations — a process that takes hours and requires deep institutional knowledge that exists only in senior engineers' heads.

Data lineage solves this by recording, for every pipeline run, exactly which APAC source datasets were consumed, which APAC output datasets were produced, what transformations were applied, and (at column-level) which specific APAC columns derive from which source columns. With lineage in place, a data quality incident becomes a graph traversal: start from the incorrect APAC output column, traverse upstream edges to find which APAC source changed, and identify which APAC pipeline job introduced the fault.

Three tools cover the APAC data lineage implementation spectrum:

OpenLineage — open standard for data lineage metadata, with integrations for Spark, Airflow, dbt, and Flink that emit structured lineage events to any compatible backend.

Marquez — open-source lineage metadata service and OpenLineage reference backend for storing, querying, and visualizing APAC pipeline lineage.

Spline — Apache Spark-specific lineage tracking with column-level capture from Spark execution plans without application code changes.


APAC Data Lineage Fundamentals

Job-level vs column-level lineage

APAC Job-level lineage (OpenLineage / Marquez):
  Job: apac_stg_payments_transform
  Inputs:  [apac_raw.payment_transactions, apac_raw.currency_rates]
  Outputs: [apac_staging.stg_payments]
  ← Tells APAC which datasets are involved; not which columns

APAC Column-level lineage (Spline for Spark):
  apac_staging.stg_payments.apac_payment_amount_usd
    ← apac_raw.payment_transactions.amount_cents / 100
    ← apac_raw.currency_rates.usd_rate
  ← Tells APAC exactly which APAC source columns feed each output

APAC lineage use cases

APAC Impact analysis (change upstream → what breaks?):
  "If APAC DBA renames raw_payments.amount_cents"
  → Lineage graph shows: 3 APAC Spark jobs, 7 APAC dbt models, 2 APAC BI reports affected

APAC Root cause analysis (broken output → what caused it?):
  "APAC fct_revenue has wrong data for 2026-04-23"
  → Traverse APAC lineage backwards → find stg_payments loaded 0 rows
  → Cause: APAC Kafka lag; raw_payments not fully synced before job ran

APAC Compliance / data sovereignty:
  "Which APAC reports consume personal data from apac_raw.customer_pii?"
  → Forward lineage traversal shows all APAC downstream consumers

APAC Schema evolution safety:
  "Is APAC column apac_user_id used anywhere downstream?"
  → Lineage confirms before safe APAC column removal

OpenLineage: APAC Pipeline Instrumentation Standard

OpenLineage Airflow integration — APAC automatic emission

# airflow/dags/apac_payments_pipeline.py
# OpenLineage instruments Airflow automatically via the OpenLineage provider
# No code changes needed — install the provider and configure the backend

# pip install apache-airflow-providers-openlineage

# APAC Airflow environment variables:
# OPENLINEAGE_URL=http://marquez.apac-internal:5000
# OPENLINEAGE_NAMESPACE=apac-payments-pipeline

from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime

with DAG(
    dag_id="apac_payments_etl",
    start_date=datetime(2026, 1, 1),
    schedule="@daily",
) as dag:

    # OpenLineage automatically captures:
    # Input: apac_raw.payment_transactions (with schema + row count)
    # Output: apac_staging.stg_payments (with schema + row count)
    # No additional code required
    apac_transform = PostgresOperator(
        task_id="apac_transform_payments",
        postgres_conn_id="apac_warehouse",
        sql="""
            INSERT INTO apac_staging.stg_payments
            SELECT
                transaction_id AS apac_payment_id,
                CAST(amount_cents AS NUMERIC) / 100 AS apac_payment_amount_usd,
                CONVERT_TIMEZONE('UTC', created_at) AS apac_created_at_utc
            FROM apac_raw.payment_transactions
            WHERE DATE(created_at) = '{{ ds }}'
        """,
    )

OpenLineage event structure — APAC lineage payload

{
  "eventType": "COMPLETE",
  "eventTime": "2026-04-24T08:00:00Z",
  "run": {
    "runId": "apac-run-7f3a9b2c",
    "facets": {
      "apac_environment": {"region": "ap-southeast-1", "cluster": "apac-prod"}
    }
  },
  "job": {
    "namespace": "apac-payments-pipeline",
    "name": "apac_transform_payments"
  },
  "inputs": [{
    "namespace": "apac-warehouse",
    "name": "apac_raw.payment_transactions",
    "facets": {
      "schema": {
        "fields": [
          {"name": "transaction_id", "type": "VARCHAR"},
          {"name": "amount_cents", "type": "INTEGER"},
          {"name": "created_at", "type": "TIMESTAMP"}
        ]
      },
      "dataQualityMetrics": {
        "rowCount": 847293,
        "nullCount": {"transaction_id": 0}
      }
    }
  }],
  "outputs": [{
    "namespace": "apac-warehouse",
    "name": "apac_staging.stg_payments",
    "facets": {
      "schema": {
        "fields": [
          {"name": "apac_payment_id", "type": "VARCHAR"},
          {"name": "apac_payment_amount_usd", "type": "NUMERIC"},
          {"name": "apac_created_at_utc", "type": "TIMESTAMP"}
        ]
      }
    }
  }]
}

Marquez: APAC Lineage Backend and Visualization

Marquez deployment — APAC Docker Compose setup

# docker-compose.apac-marquez.yml
version: "3"
services:
  marquez-api:
    image: marquezproject/marquez:latest
    ports:
      - "5000:5000"    # APAC OpenLineage API endpoint
      - "5001:5001"    # APAC Marquez admin API
    environment:
      MARQUEZ_CONFIG: /usr/src/app/marquez.yml
    volumes:
      - ./marquez.yml:/usr/src/app/marquez.yml

  marquez-web:
    image: marquezproject/marquez-web:latest
    ports:
      - "3000:3000"    # APAC Marquez lineage UI
    environment:
      MARQUEZ_HOST: marquez-api
      MARQUEZ_PORT: 5000

  postgresql:
    image: postgres:16
    environment:
      POSTGRES_USER: marquez
      POSTGRES_PASSWORD: apac_marquez_secret
      POSTGRES_DB: marquez_db

Marquez API — APAC lineage graph query

# APAC: Query Marquez lineage API for upstream dependencies

# Which APAC jobs write to apac_staging.stg_payments?
curl "http://marquez.apac-internal:5000/api/v1/lineage?nodeId=dataset:apac-warehouse:apac_staging.stg_payments&depth=3" \
  | jq '.graph.nodes[] | select(.type == "JOB") | .data.name'

# Output:
# "apac_transform_payments"
# "apac_backfill_payments_2025"

# APAC: What datasets does apac_transform_payments consume?
curl "http://marquez.apac-internal:5000/api/v1/jobs/apac-payments-pipeline/apac_transform_payments" \
  | jq '.latestRun.inputVersions[].name'

# Output:
# "apac_raw.payment_transactions"
# "apac_raw.currency_rates"

# APAC: Get run history (identify when APAC data went wrong)
curl "http://marquez.apac-internal:5000/api/v1/jobs/apac-payments-pipeline/apac_transform_payments/runs?limit=7" \
  | jq '.runs[] | {state, startedAt, endedAt}'

Spline: APAC Spark Column-Level Lineage

Spline — APAC Spark integration (zero code changes)

// APAC Spark job: no code changes needed for Spline

// 1. Add Spline agent JAR to Spark submit:
// spark-submit \
//   --packages za.co.absa.spline.agent.spark:spark-3.3-spline-agent-bundle_2.12:1.0.0 \
//   --conf spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.SplineQueryExecutionListener \
//   --conf spline.producer.url=http://spline-rest.apac-internal:8080/producer \
//   apac_payments_transform.jar

// APAC Spark application code — unchanged:
import org.apache.spark.sql.SparkSession

object APACPaymentsTransform {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("APAC Payments Transform")
      .getOrCreate()

    // Spline automatically captures: input → transformation → output lineage
    val apacRawPayments = spark.read
      .format("delta")
      .load("s3://apac-data-lake/raw/payment_transactions/")

    val apacStagedPayments = apacRawPayments
      .filter("transaction_id IS NOT NULL")
      .selectExpr(
        "transaction_id AS apac_payment_id",
        "CAST(amount_cents AS DOUBLE) / 100.0 AS apac_payment_amount_usd",
        "to_utc_timestamp(created_at, 'Asia/Singapore') AS apac_created_at_utc"
      )

    // Spline captures column-level lineage at write:
    // apac_payment_amount_usd ← amount_cents (CAST + divide by 100)
    apacStagedPayments.write
      .format("delta")
      .mode("overwrite")
      .save("s3://apac-data-lake/staging/stg_payments/")
  }
}

APAC Data Lineage Tool Selection

APAC Data Lineage Need                → Tool          → Why

APAC multi-framework lineage          → OpenLineage   Standard across Spark +
(Airflow + dbt + Spark + Flink)      →               Airflow + dbt + Flink;
                                                      single APAC backend

APAC lineage storage + visualization  → Marquez       OpenLineage reference
(self-hosted, lightweight)            →               backend; graph UI;
                                                      APAC Docker Compose deploy

APAC Spark-primary column lineage     → Spline        Zero-code APAC Spark
(Databricks/EMR teams)               →               integration; column-level
                                                      derivation tracking

APAC enterprise data catalog          → DataHub /     Full APAC governance:
(governance + lineage combined)       → OpenMetadata  business glossary + APAC
                                                      lineage + stewardship

APAC impact analysis at scale         → OpenLineage   Standard events feed
(many teams, many APAC pipelines)    → + DataHub      DataHub APAC lineage graph

Related APAC Data Engineering Resources

For the analytics engineering tools (dbt Core, SQLMesh, Coalesce) that emit OpenLineage events for SQL transformation lineage, see the APAC analytics engineering guide.

For the data catalog tools (DataHub, OpenMetadata, Atlas) that consume OpenLineage events from Marquez to provide a unified APAC governance view, see the APAC data governance guide covering Collibra, Alation, and Atlan.

For the data quality tools (Great Expectations, Soda, Airflow) that emit quality facets through OpenLineage to enrich APAC lineage events with quality context, see the APAC data quality guide.

Beyond this insight

Cross-reference our practice depth.

If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.

Keep reading

Related reading

Want this applied to your firm?

We use these frameworks daily in client engagements. Let's see what they look like for your stage and market.