Open-source curation · Python-first · in Spanish & English

The catalogue

Nº07 · Processing

Apache Spark

The distributed engine for processing data at large scale.

Engine / DBIntermediateData Engineer·Data Scientistpython

What is it?

Apache Spark is a distributed processing engine: it spreads the work across many machines (or many cores) to transform data volumes that don't fit on a single one. Its Python API, PySpark, feels similar to pandas but scales to terabytes.

What is it for?

  • Heavy ETL/ELT over data lakes (Parquet, Iceberg) and databases.
  • Large-scale batch processing and also streaming (Structured Streaming).
  • Preparing features and training models over huge datasets (with MLlib).

When to use it / when not

Use it when the data doesn't fit on one machine or when you need real parallelism over a cluster.

Think twice if your data fits in memory: running Spark is complex, and for GBs DuckDB or Polars are simpler and often faster single-node.

Get started in 1 minute

pip install pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("demo").getOrCreate()

df = spark.read.parquet("sales.parquet")

(df.groupBy("country")
   .agg(F.sum("amount").alias("total"))
   .orderBy(F.desc("total"))
   .show())

Quick trivia — test what you just read.

How much do you know about Apache Spark?

Official documentation

The source of truth lives there. Here we orient you; the depth is up to you.

Open official docs

What to learn next

See also

Nº07 · Updated 2026-06-08