What it does

Key features

Automated metadata ingestion — 70+ connectors for APAC Snowflake, BigQuery, dbt, Airflow, Kafka, and Spark
End-to-end lineage — trace APAC data from raw sources through transformations to dashboards and ML models
Data ownership — assign APAC data assets to owners, stewards, and business domains with domain visibility
Data quality assertions — freshness, volume, and custom SQL health checks directly in DataHub
Business glossary — APAC enterprise data terminology linked to physical data assets for semantic consistency
Search and discovery — full-text and faceted search across all APAC catalogued data assets
GraphQL API — programmatic metadata access for APAC data portal and ML feature store integration

When to reach for it

Best for

APAC data engineering teams at scale (10+ engineers) who need centralised metadata governance across a heterogeneous data stack of Snowflake, Kafka, dbt, and Airflow without a commercial catalog license
Data platform teams managing APAC data mesh architectures who need domain-based ownership and data product registration across multiple APAC business domains
APAC financial services and healthcare data teams with regulatory requirements for data lineage provenance and impact analysis when upstream systems change
Engineering organisations building APAC internal data portals or ML feature stores that need programmatic metadata access via DataHub's GraphQL API

Don't get burned

Limitations to know

! Operational complexity — self-hosted DataHub requires running multiple components (GMS, MAE consumer, MCE consumer, Elasticsearch, MySQL, Kafka); APAC teams without dedicated platform engineering capacity should evaluate Acryl Cloud managed offering
! Ingestion freshness — DataHub's default ingestion is scheduled batch pull; real-time metadata updates require push-based ingestion via DataHub's Kafka integration, adding Kafka to APAC infrastructure dependencies
! UI for non-technical users — DataHub's interface is optimised for APAC data engineers; non-technical data stewards or business analysts may find column-level lineage and technical metadata less accessible than commercial catalog products like Collibra or Alation
! Community support vs commercial SLA — self-hosted DataHub support is community-based (Slack, GitHub Issues); APAC enterprises requiring SLA-backed support need Acryl Cloud or a support contract

Context

About DataHub

DataHub is an open-source metadata platform originally built at LinkedIn and now maintained by Acryl Data as a CNCF project — providing APAC data engineering and analytics teams with a unified data catalog, lineage graph, data ownership model, and data quality assertion framework that integrates with the full modern APAC data stack including Snowflake, BigQuery, Redshift, dbt, Airflow, Kafka, Spark, and 70+ additional connectors.

DataHub's metadata ingestion model — where DataHub ingests metadata from connected data sources on a schedule using Python-based recipes (`datahub ingest -c snowflake-recipe.yaml`) — enables APAC data teams to automatically discover and catalogue all tables, views, dashboards, pipelines, and topics in their data estate without manual data asset registration, ensuring DataHub's catalog reflects the current state of APAC data infrastructure.

DataHub's lineage graph — where end-to-end data lineage is traced from raw data sources through transformation pipelines (dbt models, Spark jobs, Airflow tasks) to downstream dashboards and ML models — enables APAC data engineers and analysts to understand the impact of upstream schema changes, identify root causes of data quality failures by tracing lineage backwards, and satisfy APAC regulatory audit requirements for data provenance in financial services and healthcare.

DataHub's data ownership and domain model — where APAC data assets are assigned to data owners, data stewards, and business glossary terms with domain-scoped visibility — enables APAC data platform teams to establish clear accountability for each data asset, enabling data consumers to find the right owner to contact when data quality issues arise and enabling APAC governance teams to audit ownership coverage across business domains.

DataHub's data quality assertions — where APAC data engineers define freshness assertions ('this table should be updated at least every 24 hours'), volume assertions ('this table should have more than 10,000 rows'), and custom SQL assertions directly in DataHub — enables APAC data teams to monitor data health across their catalog without building custom monitoring pipelines, with assertion failures surfaced in DataHub's UI and routed to Slack or PagerDuty for APAC on-call response.

DataHub

Key features

Best for

Limitations to know

About DataHub

Where this category meets practice depth.