Nº06 · Storage

Apache Parquet

The columnar format that makes file-based analytics cheap and fast.

Format—Intro—Data Engineer·Data Scientist

What is it?

Apache Parquet is an open, binary, columnar file format designed to store tabular data efficiently for analytics. Unlike a CSV —which stores data row by row as text— Parquet groups values by column, adds compression, and stores statistics (min, max) for each block.

The result: much smaller files and much faster queries. It is the de facto format of modern analytical storage: data lakes, lakehouses (Iceberg, Delta), and nearly every OSS engine read it natively.

What is it for?

Storing tables for analytics. It is the natural output of a data pipeline: a fraction of the size of CSV and faster to query.
Selective reads (column pruning). A query using 3 of 40 columns reads only those 3 from disk. On a CSV you'd have to read the whole thing.
Interoperating across tools. pandas, Polars, DuckDB, Spark, Trino, and warehouses read and write Parquet — it is the common language for moving data without lock-in.
The basis of table formats. Iceberg and Delta Lake store their data as Parquet files underneath, adding transactions and versioning on top.

When to use it / when not

Use it whenever you store data to read it analytically: pipeline outputs, intermediate lake layers, datasets you share across tools. It is almost always a better choice than CSV at serious volumes.

Think twice for:

Transactional workloads with frequent row-by-row writes/updates: a database like PostgreSQL is right. Parquet is for reading, not mutating record by record.
Exchange with humans or text-expecting tools (open in Excel, a simple script): a CSV is more universal for that one-off case.
Updates/deletes on the lake: if you need to modify already-written data with guarantees, step up to Iceberg or Delta (which use Parquet underneath).

Get started in 1 minute

With DuckDB you convert a CSV to Parquet and see the difference, no server:

pip install duckdb

import duckdb

# Write a sample table to Parquet (compressed and columnar)
duckdb.sql("""
  COPY (SELECT * FROM (VALUES ('PE', 100), ('CL', 80)) AS t(country, amount))
  TO 'sales.parquet' (FORMAT parquet)
""")

# Read only the column you need — without loading the rest
print(duckdb.sql("SELECT country FROM 'sales.parquet'").df())

Quick trivia — test what you just read.

How much do you know about Apache Parquet?

Official documentation

The source of truth lives there. Here we orient you; the depth is up to you.

Open official docs ↗

What to learn next

DuckDB

The analytical database that runs inside your process — no server.

Introsql

Nº03Storage

Apache Iceberg

Tables with database guarantees on top of your data lake.

IntermediateOSS

Nº23Analysis

Polars

DataFrames in Rust: fast, parallel and with lazy evaluation.

Intropython

Nº06 · Updated 2026-06-25