Skip to main content
Hong Kong
AIMenta
D

DataHub

by Acryl Data

LinkedIn's open-source metadata platform and data catalog enabling APAC data engineering and analytics teams to discover data assets, trace end-to-end lineage across pipelines and warehouses, establish data ownership, and enforce data quality assertions with 70+ native integrations.

AIMenta verdict
Recommended
5/5

"DataHub is LinkedIn's open-source data catalog for APAC data teams — lineage tracking, schema discovery, ownership, and data quality assertions across Snowflake, BigQuery, Kafka, and 70+ integrations. Best for APAC data engineering teams centralising metadata governance."

Features
7
Use cases
4
Watch outs
4
What it does

Key features

  • Automated metadata ingestion — 70+ connectors for APAC Snowflake, BigQuery, dbt, Airflow, Kafka, and Spark
  • End-to-end lineage — trace APAC data from raw sources through transformations to dashboards and ML models
  • Data ownership — assign APAC data assets to owners, stewards, and business domains with domain visibility
  • Data quality assertions — freshness, volume, and custom SQL health checks directly in DataHub
  • Business glossary — APAC enterprise data terminology linked to physical data assets for semantic consistency
  • Search and discovery — full-text and faceted search across all APAC catalogued data assets
  • GraphQL API — programmatic metadata access for APAC data portal and ML feature store integration
When to reach for it

Best for

  • APAC data engineering teams at scale (10+ engineers) who need centralised metadata governance across a heterogeneous data stack of Snowflake, Kafka, dbt, and Airflow without a commercial catalog license
  • Data platform teams managing APAC data mesh architectures who need domain-based ownership and data product registration across multiple APAC business domains
  • APAC financial services and healthcare data teams with regulatory requirements for data lineage provenance and impact analysis when upstream systems change
  • Engineering organisations building APAC internal data portals or ML feature stores that need programmatic metadata access via DataHub's GraphQL API
Don't get burned

Limitations to know

  • ! Operational complexity — self-hosted DataHub requires running multiple components (GMS, MAE consumer, MCE consumer, Elasticsearch, MySQL, Kafka); APAC teams without dedicated platform engineering capacity should evaluate Acryl Cloud managed offering
  • ! Ingestion freshness — DataHub's default ingestion is scheduled batch pull; real-time metadata updates require push-based ingestion via DataHub's Kafka integration, adding Kafka to APAC infrastructure dependencies
  • ! UI for non-technical users — DataHub's interface is optimised for APAC data engineers; non-technical data stewards or business analysts may find column-level lineage and technical metadata less accessible than commercial catalog products like Collibra or Alation
  • ! Community support vs commercial SLA — self-hosted DataHub support is community-based (Slack, GitHub Issues); APAC enterprises requiring SLA-backed support need Acryl Cloud or a support contract
Context

About DataHub

DataHub is an open-source metadata platform originally built at LinkedIn and now maintained by Acryl Data as a CNCF project — providing APAC data engineering and analytics teams with a unified data catalog, lineage graph, data ownership model, and data quality assertion framework that integrates with the full modern APAC data stack including Snowflake, BigQuery, Redshift, dbt, Airflow, Kafka, Spark, and 70+ additional connectors.

DataHub's metadata ingestion model — where DataHub ingests metadata from connected data sources on a schedule using Python-based recipes (`datahub ingest -c snowflake-recipe.yaml`) — enables APAC data teams to automatically discover and catalogue all tables, views, dashboards, pipelines, and topics in their data estate without manual data asset registration, ensuring DataHub's catalog reflects the current state of APAC data infrastructure.

DataHub's lineage graph — where end-to-end data lineage is traced from raw data sources through transformation pipelines (dbt models, Spark jobs, Airflow tasks) to downstream dashboards and ML models — enables APAC data engineers and analysts to understand the impact of upstream schema changes, identify root causes of data quality failures by tracing lineage backwards, and satisfy APAC regulatory audit requirements for data provenance in financial services and healthcare.

DataHub's data ownership and domain model — where APAC data assets are assigned to data owners, data stewards, and business glossary terms with domain-scoped visibility — enables APAC data platform teams to establish clear accountability for each data asset, enabling data consumers to find the right owner to contact when data quality issues arise and enabling APAC governance teams to audit ownership coverage across business domains.

DataHub's data quality assertions — where APAC data engineers define freshness assertions ('this table should be updated at least every 24 hours'), volume assertions ('this table should have more than 10,000 rows'), and custom SQL assertions directly in DataHub — enables APAC data teams to monitor data health across their catalog without building custom monitoring pipelines, with assertion failures surfaced in DataHub's UI and routed to Slack or PagerDuty for APAC on-call response.

Beyond this tool

Where this category meets practice depth.

A tool only matters in context. Browse the service pillars that operationalise it, the industries where it ships, and the Asian markets where AIMenta runs adoption programs.