Nº14 · Analysis

DuckDB

The analytical database that runs inside your process — no server.

Engine / DB—Intro—Data Engineer·Data Scientist—sql

What is it?

DuckDB is an in-process (embedded) analytical database: no server to run, no infrastructure to manage — it runs inside your Python script, your notebook or your terminal. Think "SQLite, but for analytics": a fast SQL engine that lives next to your code.

What is it for?

Querying Parquet and CSV files directly with SQL, without loading them into a database first.
Fast local analysis over medium-to-large datasets (GBs) on a single machine, without paying for a data warehouse.
Living alongside pandas/Polars: read a DataFrame with SQL and return another DataFrame, mixing the best of both worlds.

When to use it / when not

Use it when you want analytical SQL over local files or object storage, to prototype transformations, or to speed up exploration that gets slow in pandas.

Think twice if you need concurrent writes from many users, an always-on transactional service (that's PostgreSQL), or distributed petabyte-scale processing (that's Spark or Trino). DuckDB shines single-node.

Get started in 1 minute

pip install duckdb

import duckdb

# Query a Parquet file directly, without loading it into a table
df = duckdb.sql("""
    SELECT country, SUM(amount) AS total
    FROM 'sales.parquet'
    GROUP BY country
    ORDER BY total DESC
""").df()

print(df)

That's it: no server, no upfront schema, no CREATE TABLE. DuckDB reads the file, runs the SQL and hands you back a pandas DataFrame.

Quick trivia — test what you just read.