Nº03 · Storage

Apache Iceberg

Tables with database guarantees on top of your data lake.

Format—Intermediate—Data Engineer

Apache Iceberg is an open table format built to sit on top of distributed file systems (S3, GCS, HDFS). Rather than storing data as loose Parquet files, Iceberg adds a metadata layer that enables ACID transactions, schema evolution, and time travel — without moving or rewriting your existing files.

What is it?

Iceberg defines how data files (Parquet, ORC, Avro) are organized and tracked in a lake. Its metadata catalog is the core piece: it records every snapshot of a table, which files belong to it, and what the schema looked like at that point in time. This turns a collection of files into a table with database-like behavior:

ACID — atomic commits; no read ever sees a half-written state.
Schema evolution — add, rename, or drop columns without rewriting historical data.
Time travel — query the state of a table at any past snapshot using AS OF.
Partition evolution — change your partitioning strategy without manual migrations.

The format is a first-class citizen of the lakehouse ecosystem and is supported by Spark, Trino, Flink, DuckDB, and other engines.

What is it for?

High-quality ingestion pipelines. Write data in batches with atomic commits; if something fails, the previous snapshot remains intact.
Audit trails and rollback. Restore the exact state of a table at a past timestamp without maintaining extra backups.
Multi-engine analytics platforms. The same set of files can be read by Spark (ETL), Trino (ad-hoc SQL), and DuckDB (local exploration) without duplicating data.

When to use it / when not to

Use it when:

You manage large tables (hundreds of GB or more) queried by multiple engines.
You need ACID guarantees over a data lake (audit requirements, compliance, rollback).
Your schemas change frequently and you want to avoid costly rewrites.

Think twice when:

The dataset is small or read by a single process — plain Parquet is simpler and sufficient.
Your stack does not include a compatible engine — Spark and Trino are the most mature; without an engine that understands the Iceberg catalog, the format adds overhead without benefit.
You want quick local results without infrastructure — start with DuckDB directly in that case.

Get started in 1 minute

Create and read an Iceberg table locally with pyiceberg (no Spark cluster needed):

pip install "pyiceberg[duckdb,pyarrow]"

from pyiceberg.catalog.sql import SqlCatalog
import pyarrow as pa

# Local catalog backed by SQLite
catalog = SqlCatalog(
    "local",
    **{
        "uri": "sqlite:///iceberg_local.db",
        "warehouse": "/tmp/iceberg_warehouse",
    },
)

catalog.create_namespace("demo")

schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("event", pa.string()),
    pa.field("ts", pa.timestamp("us")),
])

table = catalog.create_table("demo.events", schema=schema)

# Write data
batch = pa.table({
    "id": [1, 2, 3],
    "event": ["login", "purchase", "logout"],
    "ts": pa.array(
        ["2026-06-08T10:00:00", "2026-06-08T10:05:00", "2026-06-08T10:10:00"],
        type=pa.timestamp("us"),
    ),
})

table.append(batch)

# Read as Arrow table
df = table.scan().to_arrow()
print(df)

For production with Spark, see the official configuration guide.

Quick trivia — test what you just read.