Nº25 · Languages

Python

The lingua franca of the data stack: from scripts to pipelines and ML.

Language—Intro—Base / cross-cutting·Data Engineer·Data Scientist—python

What is it?

Python is a general-purpose, interpreted, dynamically typed language created by Guido van Rossum in 1991. In the data world, it doesn't dominate because it's the fastest or most strict, but because of its ecosystem: there's a Python library for almost every problem in the data stack, and most modern frameworks — orchestrators, ML platforms, transformation tools — expose their primary API in Python. That makes it the lingua franca of the field: the language that connects layers that would otherwise never talk to each other.

→ Official docs: docs.python.org/3

What is it used for?

Pipeline glue. Orchestrators like Airflow or Prefect define their DAGs in Python; ingestion connectors (Singer, dlt, Airbyte SDK) are written or configured in Python. It's the common tongue that lets different parts of the stack talk to each other.
Data analysis and transformation. Libraries like pandas, NumPy, and Polars let you explore, clean, and transform in-memory datasets. This is usually the starting point before deciding whether a workload needs to scale to Spark or stay in SQL.
Machine learning and AI. The ML ecosystem — scikit-learn, PyTorch, TensorFlow, XGBoost, Hugging Face — lives almost entirely in Python. From feature engineering to model serving, Python is the common thread.

When to use it / when not to?

Use it for virtually any data stack task: ingestion, lightweight transformation, orchestration, automation scripting, model training and serving, ad-hoc exploration.

Think twice in these scenarios:

Pure analytical queries over structured data: SQL (in PostgreSQL, DuckDB, Trino, BigQuery…) is typically more concise, more readable for the team, and more efficient. Python is then best used to run the query, not to replace it.
Extreme performance or low-level concurrency: languages like Rust or Go win on throughput and memory usage. Tools like Polars or Apache Arrow narrow this gap from Python, but when the bottleneck is the interpreter itself, consider dedicated engines.
Massive-scale transformations: when the data volume exceeds a single machine's memory, distributed frameworks (Spark, Flink) or cloud warehouses do the heavy lifting; Python remains the interface, but not the engine.

Start in 1 minute

Check your Python version and install a data library in an isolated environment:

# Check installed version (3.10+ recommended)
python3 --version

# Create an isolated virtual environment
python3 -m venv .venv
source .venv/bin/activate   # Linux/macOS
# .venv\Scripts\activate    # Windows

# Install a data library (example: pandas)
pip install pandas

# Confirm it works
python3 -c "import pandas; print(pandas.__version__)"

For production projects, consider package managers like uv or Poetry instead of bare pip — they handle dependencies in a more reproducible way.

Quick trivia — test what you just read.

How much do you know about Python?

Official documentation

The source of truth lives there. Here we orient you; the depth is up to you.

Open official docs ↗

What to learn next

pandas

The Swiss Army knife for manipulating and analyzing tabular data in Python.

Intropython

Nº20Analysis

NumPy

Python's numeric foundation: fast, vectorized arrays.

Intropython

Nº25 · Updated 2026-06-08