Nº07 · Processing
Apache Spark
The distributed engine for processing data at large scale.
What is it?
Apache Spark is a distributed processing engine: it spreads the work across many machines (or many cores) to transform data volumes that don't fit on a single one. Its Python API, PySpark, feels similar to pandas but scales to terabytes.
What is it for?
- Heavy ETL/ELT over data lakes (Parquet, Iceberg) and databases.
- Large-scale batch processing and also streaming (Structured Streaming).
- Preparing features and training models over huge datasets (with MLlib).
When to use it / when not
Use it when the data doesn't fit on one machine or when you need real parallelism over a cluster.
Think twice if your data fits in memory: running Spark is complex, and for GBs DuckDB or Polars are simpler and often faster single-node.
Get started in 1 minute
pip install pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("demo").getOrCreate()
df = spark.read.parquet("sales.parquet")
(df.groupBy("country")
.agg(F.sum("amount").alias("total"))
.orderBy(F.desc("total"))
.show())
Quick trivia — test what you just read.
How much do you know about Apache Spark?
Official documentation
The source of truth lives there. Here we orient you; the depth is up to you.
Open official docs ↗What to learn next
See alsoNº07 · Updated 2026-06-08