Skip to main content
Global
AIMenta
Blog

APAC Data Lineage Guide 2026: OpenLineage, Marquez, and Spline for Data Pipeline Transparency

A practitioner guide for APAC data engineering teams implementing data lineage tracking in 2026 — covering OpenLineage as the open standard for structured lineage event emission from Spark, Airflow, dbt, and Flink pipelines with schema and data quality metadata facets; Marquez as the OpenLineage-compatible reference backend for storing job run history, dataset version tracking, and interactive lineage graph visualization with Docker Compose deployment; and Spline for zero-code-change column-level Apache Spark lineage capture that records exactly which source columns contributed to each output column through which transformations for APAC impact analysis and root cause investigation.

AE By AIMenta Editorial Team ·

The Data Lineage Gap in APAC Data Engineering

APAC data engineering teams operating large-scale pipelines without lineage tracking face a recurring crisis when data quality incidents occur: the analytics team reports that yesterday's APAC revenue figure is wrong, but no one knows which of the 47 pipeline jobs that ran overnight is responsible. The data engineer manually traces backwards through pipeline dependencies — checking each job's source tables, joining against data quality logs, reading pipeline code to understand transformations — a process that takes hours and requires deep institutional knowledge that exists only in senior engineers' heads.

Data lineage solves this by recording, for every pipeline run, exactly which APAC source datasets were consumed, which APAC output datasets were produced, what transformations were applied, and (at column-level) which specific APAC columns derive from which source columns. With lineage in place, a data quality incident becomes a graph traversal: start from the incorrect APAC output column, traverse upstream edges to find which APAC source changed, and identify which APAC pipeline job introduced the fault.

Three tools cover the APAC data lineage implementation spectrum:

OpenLineage — open standard for data lineage metadata, with integrations for Spark, Airflow, dbt, and Flink that emit structured lineage events to any compatible backend.

Marquez — open-source lineage metadata service and OpenLineage reference backend for storing, querying, and visualizing APAC pipeline lineage.

Spline — Apache Spark-specific lineage tracking with column-level capture from Spark execution plans without application code changes.


APAC Data Lineage Fundamentals

Job-level vs column-level lineage

APAC Job-level lineage (OpenLineage / Marquez):
  Job: apac_stg_payments_transform
  Inputs:  [apac_raw.payment_transactions, apac_raw.currency_rates]
  Outputs: [apac_staging.stg_payments]
  ← Tells APAC which datasets are involved; not which columns

APAC Column-level lineage (Spline for Spark):
  apac_staging.stg_payments.apac_payment_amount_usd
    ← apac_raw.payment_transactions.amount_cents / 100
    ← apac_raw.currency_rates.usd_rate
  ← Tells APAC exactly which APAC source columns feed each output

APAC lineage use cases

APAC Impact analysis (change upstream → what breaks?):
  "If APAC DBA renames raw_payments.amount_cents"
  → Lineage graph shows: 3 APAC Spark jobs, 7 APAC dbt models, 2 APAC BI reports affected

APAC Root cause analysis (broken output → what caused it?):
  "APAC fct_revenue has wrong data for 2026-04-23"
  → Traverse APAC lineage backwards → find stg_payments loaded 0 rows
  → Cause: APAC Kafka lag; raw_payments not fully synced before job ran

APAC Compliance / data sovereignty:
  "Which APAC reports consume personal data from apac_raw.customer_pii?"
  → Forward lineage traversal shows all APAC downstream consumers

APAC Schema evolution safety:
  "Is APAC column apac_user_id used anywhere downstream?"
  → Lineage confirms before safe APAC column removal

OpenLineage: APAC Pipeline Instrumentation Standard

OpenLineage Airflow integration — APAC automatic emission

# airflow/dags/apac_payments_pipeline.py
# OpenLineage instruments Airflow automatically via the OpenLineage provider
# No code changes needed — install the provider and configure the backend

# pip install apache-airflow-providers-openlineage

# APAC Airflow environment variables:
# OPENLINEAGE_URL=http://marquez.apac-internal:5000
# OPENLINEAGE_NAMESPACE=apac-payments-pipeline

from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime

with DAG(
    dag_id="apac_payments_etl",
    start_date=datetime(2026, 1, 1),
    schedule="@daily",
) as dag:

    # OpenLineage automatically captures:
    # Input: apac_raw.payment_transactions (with schema + row count)
    # Output: apac_staging.stg_payments (with schema + row count)
    # No additional code required
    apac_transform = PostgresOperator(
        task_id="apac_transform_payments",
        postgres_conn_id="apac_warehouse",
        sql="""
            INSERT INTO apac_staging.stg_payments
            SELECT
                transaction_id AS apac_payment_id,
                CAST(amount_cents AS NUMERIC) / 100 AS apac_payment_amount_usd,
                CONVERT_TIMEZONE('UTC', created_at) AS apac_created_at_utc
            FROM apac_raw.payment_transactions
            WHERE DATE(created_at) = '{{ ds }}'
        """,
    )

OpenLineage event structure — APAC lineage payload

{
  "eventType": "COMPLETE",
  "eventTime": "2026-04-24T08:00:00Z",
  "run": {
    "runId": "apac-run-7f3a9b2c",
    "facets": {
      "apac_environment": {"region": "ap-southeast-1", "cluster": "apac-prod"}
    }
  },
  "job": {
    "namespace": "apac-payments-pipeline",
    "name": "apac_transform_payments"
  },
  "inputs": [{
    "namespace": "apac-warehouse",
    "name": "apac_raw.payment_transactions",
    "facets": {
      "schema": {
        "fields": [
          {"name": "transaction_id", "type": "VARCHAR"},
          {"name": "amount_cents", "type": "INTEGER"},
          {"name": "created_at", "type": "TIMESTAMP"}
        ]
      },
      "dataQualityMetrics": {
        "rowCount": 847293,
        "nullCount": {"transaction_id": 0}
      }
    }
  }],
  "outputs": [{
    "namespace": "apac-warehouse",
    "name": "apac_staging.stg_payments",
    "facets": {
      "schema": {
        "fields": [
          {"name": "apac_payment_id", "type": "VARCHAR"},
          {"name": "apac_payment_amount_usd", "type": "NUMERIC"},
          {"name": "apac_created_at_utc", "type": "TIMESTAMP"}
        ]
      }
    }
  }]
}

Marquez: APAC Lineage Backend and Visualization

Marquez deployment — APAC Docker Compose setup

# docker-compose.apac-marquez.yml
version: "3"
services:
  marquez-api:
    image: marquezproject/marquez:latest
    ports:
      - "5000:5000"    # APAC OpenLineage API endpoint
      - "5001:5001"    # APAC Marquez admin API
    environment:
      MARQUEZ_CONFIG: /usr/src/app/marquez.yml
    volumes:
      - ./marquez.yml:/usr/src/app/marquez.yml

  marquez-web:
    image: marquezproject/marquez-web:latest
    ports:
      - "3000:3000"    # APAC Marquez lineage UI
    environment:
      MARQUEZ_HOST: marquez-api
      MARQUEZ_PORT: 5000

  postgresql:
    image: postgres:16
    environment:
      POSTGRES_USER: marquez
      POSTGRES_PASSWORD: apac_marquez_secret
      POSTGRES_DB: marquez_db

Marquez API — APAC lineage graph query

# APAC: Query Marquez lineage API for upstream dependencies

# Which APAC jobs write to apac_staging.stg_payments?
curl "http://marquez.apac-internal:5000/api/v1/lineage?nodeId=dataset:apac-warehouse:apac_staging.stg_payments&depth=3" \
  | jq '.graph.nodes[] | select(.type == "JOB") | .data.name'

# Output:
# "apac_transform_payments"
# "apac_backfill_payments_2025"

# APAC: What datasets does apac_transform_payments consume?
curl "http://marquez.apac-internal:5000/api/v1/jobs/apac-payments-pipeline/apac_transform_payments" \
  | jq '.latestRun.inputVersions[].name'

# Output:
# "apac_raw.payment_transactions"
# "apac_raw.currency_rates"

# APAC: Get run history (identify when APAC data went wrong)
curl "http://marquez.apac-internal:5000/api/v1/jobs/apac-payments-pipeline/apac_transform_payments/runs?limit=7" \
  | jq '.runs[] | {state, startedAt, endedAt}'

Spline: APAC Spark Column-Level Lineage

Spline — APAC Spark integration (zero code changes)

// APAC Spark job: no code changes needed for Spline

// 1. Add Spline agent JAR to Spark submit:
// spark-submit \
//   --packages za.co.absa.spline.agent.spark:spark-3.3-spline-agent-bundle_2.12:1.0.0 \
//   --conf spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.SplineQueryExecutionListener \
//   --conf spline.producer.url=http://spline-rest.apac-internal:8080/producer \
//   apac_payments_transform.jar

// APAC Spark application code — unchanged:
import org.apache.spark.sql.SparkSession

object APACPaymentsTransform {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("APAC Payments Transform")
      .getOrCreate()

    // Spline automatically captures: input → transformation → output lineage
    val apacRawPayments = spark.read
      .format("delta")
      .load("s3://apac-data-lake/raw/payment_transactions/")

    val apacStagedPayments = apacRawPayments
      .filter("transaction_id IS NOT NULL")
      .selectExpr(
        "transaction_id AS apac_payment_id",
        "CAST(amount_cents AS DOUBLE) / 100.0 AS apac_payment_amount_usd",
        "to_utc_timestamp(created_at, 'Asia/Singapore') AS apac_created_at_utc"
      )

    // Spline captures column-level lineage at write:
    // apac_payment_amount_usd ← amount_cents (CAST + divide by 100)
    apacStagedPayments.write
      .format("delta")
      .mode("overwrite")
      .save("s3://apac-data-lake/staging/stg_payments/")
  }
}

APAC Data Lineage Tool Selection

APAC Data Lineage Need                → Tool          → Why

APAC multi-framework lineage          → OpenLineage   Standard across Spark +
(Airflow + dbt + Spark + Flink)      →               Airflow + dbt + Flink;
                                                      single APAC backend

APAC lineage storage + visualization  → Marquez       OpenLineage reference
(self-hosted, lightweight)            →               backend; graph UI;
                                                      APAC Docker Compose deploy

APAC Spark-primary column lineage     → Spline        Zero-code APAC Spark
(Databricks/EMR teams)               →               integration; column-level
                                                      derivation tracking

APAC enterprise data catalog          → DataHub /     Full APAC governance:
(governance + lineage combined)       → OpenMetadata  business glossary + APAC
                                                      lineage + stewardship

APAC impact analysis at scale         → OpenLineage   Standard events feed
(many teams, many APAC pipelines)    → + DataHub      DataHub APAC lineage graph

Related APAC Data Engineering Resources

For the analytics engineering tools (dbt Core, SQLMesh, Coalesce) that emit OpenLineage events for SQL transformation lineage, see the APAC analytics engineering guide.

For the data catalog tools (DataHub, OpenMetadata, Atlas) that consume OpenLineage events from Marquez to provide a unified APAC governance view, see the APAC data governance guide covering Collibra, Alation, and Atlan.

For the data quality tools (Great Expectations, Soda, Airflow) that emit quality facets through OpenLineage to enrich APAC lineage events with quality context, see the APAC data quality guide.

Beyond this insight

Cross-reference our practice depth.

If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.

Keep reading

Related reading

Blog

APAC Computer Vision Deployment Guide 2026: Ultralytics, LandingAI, and Roboflow Inference

A practitioner guide for APAC ML and engineering teams building and deploying computer vision systems in 2026 — covering Ultralytics YOLO as the state-of-the-art real-time CV framework for training, fine-tuning, and exporting YOLO models to TensorRT, ONNX, and TFLite for APAC edge and cloud deployment with one Python API; LandingAI as a no-code visual inspection platform enabling APAC factory quality engineers to build defect detection models using active learning with 50-200 labeled images and no ML expertise, with edge deployment for on-premise factory inference; and Roboflow Inference as an open-source CV model serving engine that deploys YOLO, GroundingDINO, and SAM2 as Docker APIs with one command, with Workflows for chaining multi-model CV pipelines into single API calls for APAC engineering teams.

Blog

APAC ML Experiment Tracking and Data Versioning Guide 2026: DagsHub, Aim, and DVC

A practitioner guide for APAC data science teams implementing ML reproducibility through data versioning and experiment tracking in 2026 — covering DVC as a Git-compatible data version control tool that tracks large datasets and model artifacts in APAC cloud storage while storing lightweight metadata in Git, enabling reproducible ML pipelines with pipeline stage caching that skips unchanged preprocessing stages; DagsHub as an integrated ML project collaboration platform combining Git hosting, DVC data versioning, MLflow-compatible experiment tracking, and model registry in a GitHub-like interface; and Aim as an open-source self-hosted ML experiment tracker providing APAC regulated industry teams with complete data sovereignty over training metadata, rich run comparison, and hyperparameter visualization without cloud vendor dependency.

Blog

APAC AI Podcast Production Guide 2026: Podcastle, Cleanvoice AI, and Alitu

A practitioner guide for APAC thought leaders, corporate communicators, and content teams launching AI-assisted podcast production workflows in 2026 — covering Podcastle as an AI podcast recording platform with remote multi-track recording for distributed APAC guest networks, AI audio enhancement for non-studio recordings, and transcript-based text editing that removes audio mistakes by deleting transcript text; Cleanvoice AI as a specialized audio cleanup service that automatically removes filler words, mouth noises, dead air, and stutters from APAC podcast recordings via API, with a case study showing 54 hours of editor time saved on 12 back episodes; and Alitu as an all-in-one podcast production and hosting platform where non-technical APAC creators record, clean, assemble, and publish to Apple Podcasts and Spotify in under 90 minutes total without audio engineering knowledge.

Want this applied to your firm?

We use these frameworks daily in client engagements. Let's see what they look like for your stage and market.