Nº12 · Storage

Delta Lake

Tables with ACID guarantees and time travel on top of your data lake.

Format—Intermediate—Data Engineer

Delta Lake is an open-source table format that adds ACID transactions, schema evolution, and time travel on top of Parquet files in a data lake. It is one of the foundations of the lakehouse pattern (reliable tables on object storage): Databricks popularized it, but it is OSS and runs outside their cloud too.

What is it?

Delta Lake does not invent a new data format: it stores your data as Parquet and keeps a transaction log alongside it (the _delta_log directory) that records every change as an ordered commit. That log is the key piece — it turns a folder of files into a table with database-like behavior:

ACID — atomic commits; no read ever sees a half-written state.
Updates, deletes, and merge — modify rows on object storage, not just append.
Schema evolution — add or adjust columns without rewriting historical data.
Time travel — query the table at a past version or timestamp.

What is it for?

Reliable tables on object storage. Pipelines writing to S3/GCS/ADLS with ACID guarantees instead of fragile loose files.
Updates, deletes, and merge. Apply incremental changes or upserts (CDC) over data that would be immutable in plain Parquet.
Time travel. Restore the exact state of a table at a past point for audit, reproducibility, or rollback.
Foundation of a lakehouse. The base for reliable analytical tables shared by multiple workloads (ETL, BI, ML) without duplicating data.

When to use it / when not to

Use it when:

You need ACID, updates/deletes/merge, and time travel over a data lake.
Your stack lives in the Spark/Databricks world, where Delta is the most native and mature option.

Think twice when:

The dataset is small and read by a single process — plain Parquet is simpler and sufficient.
Your ecosystem is multi-engine or more neutral: Iceberg solves almost the same problem (ACID, schema evolution, time travel) and tends to integrate better outside Spark. The real difference between Delta and Iceberg is not the features but the ecosystem: pick Delta if you gravitate toward Spark/Databricks, Iceberg if you want multi-engine neutrality (Trino, Flink, DuckDB).

Get started in 1 minute

Create and read a Delta table locally with delta-rs (no Spark cluster needed), from pandas:

pip install deltalake pandas

import pandas as pd
from deltalake import write_deltalake, DeltaTable

# Write a DataFrame as a Delta table in a local folder
df = pd.DataFrame({
    "id": [1, 2, 3],
    "event": ["login", "purchase", "logout"],
})
write_deltalake("/tmp/delta_events", df)

# Read it back
dt = DeltaTable("/tmp/delta_events")
print(dt.to_pandas())

# Append more data (another commit in the log)
more = pd.DataFrame({"id": [4], "event": ["refund"]})
write_deltalake("/tmp/delta_events", more, mode="append")

# Time travel: read version 0 (before the append)
print(DeltaTable("/tmp/delta_events", version=0).to_pandas())

Quick trivia — test what you just read.

How much do you know about Delta Lake?

Official documentation

The source of truth lives there. Here we orient you; the depth is up to you.

Open official docs ↗

What to learn next

Apache Iceberg

Tables with database guarantees on top of your data lake.

IntermediateOSS

Nº06Storage

Apache Parquet

The columnar format that makes file-based analytics cheap and fast.

IntroOSS

Nº07Processing

Apache Spark

The distributed engine for processing data at large scale.

Intermediatepython

Nº12 · Updated 2026-06-26