Nº06 · Storage
Apache Parquet
The columnar format that makes file-based analytics cheap and fast.
What is it?
Apache Parquet is an open, binary, columnar file format designed to store tabular data efficiently for analytics. Unlike a CSV —which stores data row by row as text— Parquet groups values by column, adds compression, and stores statistics (min, max) for each block.
The result: much smaller files and much faster queries. It is the de facto format of modern analytical storage: data lakes, lakehouses (Iceberg, Delta), and nearly every OSS engine read it natively.
What is it for?
- Storing tables for analytics. It is the natural output of a data pipeline: a fraction of the size of CSV and faster to query.
- Selective reads (column pruning). A query using 3 of 40 columns reads only those 3 from disk. On a CSV you'd have to read the whole thing.
- Interoperating across tools. pandas, Polars, DuckDB, Spark, Trino, and warehouses read and write Parquet — it is the common language for moving data without lock-in.
- The basis of table formats. Iceberg and Delta Lake store their data as Parquet files underneath, adding transactions and versioning on top.
When to use it / when not
Use it whenever you store data to read it analytically: pipeline outputs, intermediate lake layers, datasets you share across tools. It is almost always a better choice than CSV at serious volumes.
Think twice for:
- Transactional workloads with frequent row-by-row writes/updates: a database like PostgreSQL is right. Parquet is for reading, not mutating record by record.
- Exchange with humans or text-expecting tools (open in Excel, a simple script): a CSV is more universal for that one-off case.
- Updates/deletes on the lake: if you need to modify already-written data with guarantees, step up to Iceberg or Delta (which use Parquet underneath).
Get started in 1 minute
With DuckDB you convert a CSV to Parquet and see the difference, no server:
pip install duckdb
import duckdb
# Write a sample table to Parquet (compressed and columnar)
duckdb.sql("""
COPY (SELECT * FROM (VALUES ('PE', 100), ('CL', 80)) AS t(country, amount))
TO 'sales.parquet' (FORMAT parquet)
""")
# Read only the column you need — without loading the rest
print(duckdb.sql("SELECT country FROM 'sales.parquet'").df())
Quick trivia — test what you just read.
How much do you know about Apache Parquet?
Official documentation
The source of truth lives there. Here we orient you; the depth is up to you.
Open official docs ↗What to learn next
See alsoNº06 · Updated 2026-06-25