What it does

Key features

Spark SQL — distributed SQL engine with Catalyst optimiser for APAC data lake and warehouse queries
PySpark API — Python DataFrame API with Pandas compatibility for APAC data engineering and science workflows
Structured Streaming — unified batch+streaming API with exactly-once semantics for APAC real-time pipelines
MLlib — distributed machine learning library covering classification, regression, and feature engineering for APAC ML at scale
Spark on Kubernetes — native Kubernetes scheduler for APAC cluster-native Spark job execution
Delta Lake integration — ACID transactions and time travel on APAC data lakes (Databricks Delta Lake, Apache Delta)
Adaptive query execution — runtime query plan optimisation for APAC workloads with skewed data distributions

When to reach for it

Best for

APAC data engineering teams processing terabyte-scale datasets that exceed single-machine processing capacity in batch ELT pipelines
Data science teams training ML models on APAC datasets too large for single-node scikit-learn or pandas processing
Engineering organisations running unified batch and streaming data pipelines using the same APAC Spark codebase
APAC teams using Databricks as their managed data platform, where Spark is the underlying compute engine

Don't get burned

Limitations to know

! Operational complexity for self-managed clusters — running APAC Spark on Kubernetes or Hadoop requires significant platform engineering investment; managed offerings (Databricks, EMR) reduce but do not eliminate operational burden
! Memory management overhead — Spark's in-memory execution model requires careful APAC cluster memory configuration; memory misconfiguration causes job failures or excessive garbage collection pauses
! Overkill for small data — APAC data pipelines processing gigabytes rather than terabytes often run faster on DuckDB or single-node Pandas than on multi-node Spark due to distributed coordination overhead
! Python dependency management on distributed clusters — PySpark requires consistent Python environment across all APAC worker nodes; managing library versions across cluster nodes requires container-based worker deployment

Context

About Apache Spark

Apache Spark is an open-source distributed data processing framework that provides APAC data engineering and data science teams with a unified computation engine for large-scale batch SQL analytics (Spark SQL), near-real-time streaming (Spark Structured Streaming), machine learning (MLlib), and graph processing (GraphX) — processing data across clusters of tens to hundreds of nodes through an in-memory execution model that is 10-100x faster than MapReduce for iterative algorithms.

Spark's DataFrame API — where data is represented as distributed tabular datasets with a Pandas-compatible Python API (PySpark) or SQL query interface, processed through a lazy evaluation model that Spark's Catalyst query optimiser compiles into an efficient distributed execution plan — enables APAC data engineering teams to write data transformation logic in familiar Python or SQL syntax that Spark executes across the full cluster, without requiring APAC engineers to reason about distributed system internals.

Spark Structured Streaming — where the same Spark DataFrame API used for batch processing can process continuously arriving data streams (Kafka topics, Kinesis streams, Delta Lake change data feed) with exactly-once semantics and watermark-based late data handling — enables APAC data engineering teams to share a single code path between batch and streaming processing, reducing the operational overhead of maintaining separate batch and streaming pipeline codebases.

MLlib, Spark's distributed machine learning library — covering classification, regression, clustering, collaborative filtering, feature engineering, and model evaluation — enables APAC data science teams to train machine learning models on datasets that exceed single-machine memory limits, distributing both data and computation across the APAC Spark cluster while using the same Python scikit-learn-like API familiar to data scientists.

Spark's managed deployment options — where APAC teams can run Spark on Kubernetes (native Spark-on-Kubernetes scheduler), Databricks (managed Apache Spark PaaS), Amazon EMR, Google Dataproc, or Azure HDInsight — enable APAC organisations to choose between the operational control of self-managed Spark clusters and the convenience of fully managed Spark services that handle cluster provisioning, autoscaling, and fault recovery automatically.

Apache Spark

Key features

Best for

Limitations to know

About Apache Spark

Where this category meets practice depth.