The kiosk to learn data · open source

Find your path into the world of data

Pick your role, learn the skills that matter, and master them with 100% open-source tools. One at a time.

Curation, not duplication · we orient you and link the official docs

The data journey

The path data follows end to end — from entering the ecosystem to becoming a chart. Each tool in its place in the flow. Click to open its guide.

ingestion → storage → processing → analysis → visualization

01
Ingestion
Data enters the ecosystem, in real time or in batches.
Platform
Apache NiFi→Apache Kafka→Airbyte→
02
Storage
Where data lives: databases, formats and tables.
Engine / DB
PostgreSQL→
Format
Apache Parquet→Apache Iceberg→Delta Lake→
Storage
MinIO→Ceph→
03
Processing
Transform and combine data at scale.
Library / framework
dbt→
Engine / DB
Apache Spark→
04
Query & analysis
Query, explore and understand the data.
Library / framework
pandas→Polars→NumPy→
Engine / DB
Trino→DuckDB→
Environment
Jupyter→
05
Visualization
Turn data into charts and dashboards.
Library / framework
Matplotlib→seaborn→
Platform
Apache Superset→

They span the whole flow

Cross-cutting layers

Languages

Language

Python→SQL→

Orchestration

Platform

Apache Airflow→Dagster→

Machine Learning

Library / framework

scikit-learn→PyTorch→

Governance

Library / framework

Great Expectations→

Platform

OpenMetadata→

Infrastructure

Docker→Git→

Self-assessment

Not sure where to start?

Take your role's self-assessment: five questions on the core skills, and you walk away with your level and a suggested learning path — all with open-source tools.

Data Engineer→Data Scientist→Base / cross-cutting→

Data Engineer or Data Scientist?

Profiles and what they share

Two specialties with a shared core. See what they have in common and where each diverges — and pick where to go next.

Data Engineer

→

Move reliable data, on time and at scale.

Apache NiFi Apache Kafka Airbyte PostgreSQL MinIO Ceph Apache Parquet Apache Iceberg Delta Lake Apache Spark dbt Apache Airflow Dagster Trino Great Expectations OpenMetadata

Shared fundamentals

→

What every profile needs. Start here.

Python SQL Git Docker DuckDB

Data Scientist

→

Analyze, model and communicate with data.

pandas NumPy Polars Jupyter scikit-learn PyTorch Matplotlib seaborn Apache Superset

By layer

Where it fits

Languages2 Analysis5 Storage6 Processing4 Orchestration4 Machine Learning2 Visualization3 Governance2 Infrastructure2

By kind

What kind of thing it is

Language2 Library / framework9 Engine / DB4 Platform7 Format3 Storage2 Environment1 Infrastructure2

The catalogue

All editions

Nº01Orchestration

Airbyte

Move data from any source to your warehouse with ready-made connectors.

IntroOSS

Nº02Orchestration

Apache Airflow

Orchestrate data pipelines as code: schedule, run and monitor.

Intropython

Nº03Storage

Apache Iceberg

Tables with database guarantees on top of your data lake.

IntermediateOSS

Nº04Processing

Apache Kafka

The nervous system for real-time data.

Intermediatepython

Nº05Orchestration

Apache NiFi

Move data between systems with visual flows, no code required.

IntermediateOSS

Nº06Storage

Apache Parquet

The columnar format that makes file-based analytics cheap and fast.

IntroOSS

Nº07Processing

Apache Spark

The distributed engine for processing data at large scale.

Intermediatepython

Nº08Visualization

Apache Superset

Data exploration and BI dashboards, open-source and SQL-native.

IntroOSS

Nº09Storage

Ceph

Distributed storage at production scale: objects, blocks and files.

IntermediateOSS

Nº10Orchestration

Dagster

Pipeline orchestration centered on the data (assets), not just on tasks.

Intermediatepython

Nº11Processing

dbt

Transform data in your warehouse with SQL, treated like software.

Introsql

Nº12Storage

Delta Lake

Tables with ACID guarantees and time travel on top of your data lake.

IntermediateOSS

Nº13Infrastructure

Docker

Package any stack tool into a reproducible container.

IntroOSS

Nº14Analysis

DuckDB

The analytical database that runs inside your process — no server.

Introsql

Nº15Infrastructure

Git

The version control that underpins all reproducible data work.

IntroOSS

Nº16Governance

Great Expectations

Quality tests for your data: define expectations and validate every load.

Intermediatepython

Nº17Analysis

Jupyter

The interactive notebook where data analysis takes shape.

Intropython

Nº18Visualization

Matplotlib

The foundational library for visualizing data with code in Python.

Intropython

Nº19Storage

MinIO

S3-compatible object storage to run your own data lake.

IntroOSS

Nº20Analysis

NumPy

Python's numeric foundation: fast, vectorized arrays.

Intropython

Nº21Governance

OpenMetadata

The open catalog to discover and trace the lineage of your data.

IntermediateOSS

Nº22Analysis

pandas

The Swiss Army knife for manipulating and analyzing tabular data in Python.

Intropython

Nº23Analysis

Polars

DataFrames in Rust: fast, parallel and with lazy evaluation.

Intropython

Nº24Storage

PostgreSQL

The reference open-source relational database — reliable and extensible.

Introsql

Nº25Languages

Python

The lingua franca of the data stack: from scripts to pipelines and ML.

Intropython

Nº26Machine Learning

PyTorch

The flexible, Pythonic deep learning framework.

Intermediatepython

Nº27Machine Learning

scikit-learn

The toolbox for classic machine learning in Python.

Intropython

Nº28Visualization

seaborn

Elegant statistical charts in one line, on top of Matplotlib.

Intropython

Nº29Languages

SQL

The universal language for asking questions of your data.

Introsql

Nº30Processing

Trino

One SQL to query data wherever it lives.

Intermediatesql

Find your path into the world of data

The data journey

Ingestion

Storage

Processing

Query & analysis

Visualization

Cross-cutting layers

Not sure where to start?

Data Engineer or Data Scientist?

Data Engineer

Shared fundamentals

Data Scientist

By layer

By kind

The catalogue

Airbyte

Apache Airflow

Apache Iceberg

Apache Kafka

Apache NiFi

Apache Parquet

Apache Spark

Apache Superset

Ceph

Dagster

dbt

Delta Lake

Docker

DuckDB

Git

Great Expectations

Jupyter

Matplotlib

MinIO

NumPy

OpenMetadata

pandas

Polars

PostgreSQL

Python

PyTorch

scikit-learn

seaborn

SQL

Trino