Nº07 · Processing

Apache Spark

The distributed engine for processing data at large scale.

Engine / DB—Intermediate—Data Engineer·Data Scientist—python

What is it?

Apache Spark is a distributed processing engine: it spreads the work across many machines (or many cores) to transform data volumes that don't fit on a single one. Its Python API, PySpark, feels similar to pandas but scales to terabytes.

What is it for?

Heavy ETL/ELT over data lakes (Parquet, Iceberg) and databases.
Large-scale batch processing and also streaming (Structured Streaming).
Preparing features and training models over huge datasets (with MLlib).

When to use it / when not

Use it when the data doesn't fit on one machine or when you need real parallelism over a cluster.

Think twice if your data fits in memory: running Spark is complex, and for GBs DuckDB or Polars are simpler and often faster single-node.

Get started in 1 minute

pip install pyspark

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("demo").getOrCreate()

df = spark.read.parquet("sales.parquet")

(df.groupBy("country")
   .agg(F.sum("amount").alias("total"))
   .orderBy(F.desc("total"))
   .show())

Quick trivia — test what you just read.

How much do you know about Apache Spark?

Official documentation

The source of truth lives there. Here we orient you; the depth is up to you.

Open official docs ↗

What to learn next

Apache Kafka

The nervous system for real-time data.

Intermediatepython

Nº30Processing

Trino

One SQL to query data wherever it lives.

Intermediatesql

Nº07 · Updated 2026-06-08