The kiosk to learn data · open source
Find your path into the world of data
Pick your role, learn the skills that matter, and master them with 100% open-source tools. One at a time.
Curation, not duplication · we orient you and link the official docs
The data journey
The data journeyThe path data follows end to end — from entering the ecosystem to becoming a chart. Each tool in its place in the flow. Click to open its guide.
ingestion → storage → processing → analysis → visualization
01
Ingestion
Data enters the ecosystem, in real time or in batches.
02
Storage
Where data lives: databases, formats and tables.
03
Processing
Transform and combine data at scale.
Library / framework
Engine / DB
04
Query & analysis
Query, explore and understand the data.
05
Visualization
Turn data into charts and dashboards.
Library / framework
Platform
They span the whole flow
Cross-cutting layers
Orchestration
Platform
Machine Learning
Library / framework
Governance
Library / framework
Platform
Self-assessment
Not sure where to start?
Take your role's self-assessment: five questions on the core skills, and you walk away with your level and a suggested learning path — all with open-source tools.
Data Engineer or Data Scientist?
Profiles and what they shareTwo specialties with a shared core. See what they have in common and where each diverges — and pick where to go next.
Data Engineer
→Move reliable data, on time and at scale.
Data Scientist
→Analyze, model and communicate with data.
By layer
Where it fitsBy kind
What kind of thing it isThe catalogue
All editionsAirbyte
Move data from any source to your warehouse with ready-made connectors.
Apache Airflow
Orchestrate data pipelines as code: schedule, run and monitor.
Apache Iceberg
Tables with database guarantees on top of your data lake.
Apache Kafka
The nervous system for real-time data.
Apache NiFi
Move data between systems with visual flows, no code required.
Apache Parquet
The columnar format that makes file-based analytics cheap and fast.
Apache Spark
The distributed engine for processing data at large scale.
Apache Superset
Data exploration and BI dashboards, open-source and SQL-native.
Ceph
Distributed storage at production scale: objects, blocks and files.
Dagster
Pipeline orchestration centered on the data (assets), not just on tasks.
dbt
Transform data in your warehouse with SQL, treated like software.
Delta Lake
Tables with ACID guarantees and time travel on top of your data lake.
Docker
Package any stack tool into a reproducible container.
DuckDB
The analytical database that runs inside your process — no server.
Git
The version control that underpins all reproducible data work.
Great Expectations
Quality tests for your data: define expectations and validate every load.
Jupyter
The interactive notebook where data analysis takes shape.
Matplotlib
The foundational library for visualizing data with code in Python.
MinIO
S3-compatible object storage to run your own data lake.
NumPy
Python's numeric foundation: fast, vectorized arrays.
OpenMetadata
The open catalog to discover and trace the lineage of your data.
pandas
The Swiss Army knife for manipulating and analyzing tabular data in Python.
Polars
DataFrames in Rust: fast, parallel and with lazy evaluation.
PostgreSQL
The reference open-source relational database — reliable and extensible.
Python
The lingua franca of the data stack: from scripts to pipelines and ML.
PyTorch
The flexible, Pythonic deep learning framework.
scikit-learn
The toolbox for classic machine learning in Python.
seaborn
Elegant statistical charts in one line, on top of Matplotlib.
SQL
The universal language for asking questions of your data.
Trino
One SQL to query data wherever it lives.