Open-source curation · Python-first · in Spanish & English

The kiosk to learn data · open source

Find your path into the world of data

Pick your role, learn the skills that matter, and master them with 100% open-source tools. One at a time.

Curation, not duplication · we orient you and link the official docs

The data journey

The data journey

The path data follows end to end — from entering the ecosystem to becoming a chart. Each tool in its place in the flow. Click to open its guide.

ingestion → storage → processing → analysis → visualization

  1. 01

    Ingestion

    Data enters the ecosystem, in real time or in batches.

  2. 02

    Storage

    Where data lives: databases, formats and tables.

  3. 03

    Processing

    Transform and combine data at scale.

    Library / framework

    Engine / DB

  4. 04

    Query & analysis

    Query, explore and understand the data.

    Library / framework

    Environment

  5. 05

    Visualization

    Turn data into charts and dashboards.

They span the whole flow

Cross-cutting layers

Languages

Orchestration

Machine Learning

Library / framework

Governance

Library / framework

Infrastructure

Infrastructure

Self-assessment

Not sure where to start?

Take your role's self-assessment: five questions on the core skills, and you walk away with your level and a suggested learning path — all with open-source tools.

Data Engineer or Data Scientist?

Profiles and what they share

Two specialties with a shared core. See what they have in common and where each diverges — and pick where to go next.

By layer

Where it fits

By kind

What kind of thing it is

The catalogue

All editions
Nº01Orchestration

Airbyte

Move data from any source to your warehouse with ready-made connectors.

IntroOSS
Nº02Orchestration

Apache Airflow

Orchestrate data pipelines as code: schedule, run and monitor.

Intropython
Nº03Storage

Apache Iceberg

Tables with database guarantees on top of your data lake.

IntermediateOSS
Nº04Processing

Apache Kafka

The nervous system for real-time data.

Intermediatepython
Nº05Orchestration

Apache NiFi

Move data between systems with visual flows, no code required.

IntermediateOSS
Nº06Storage

Apache Parquet

The columnar format that makes file-based analytics cheap and fast.

IntroOSS
Nº07Processing

Apache Spark

The distributed engine for processing data at large scale.

Intermediatepython
Nº08Visualization

Apache Superset

Data exploration and BI dashboards, open-source and SQL-native.

IntroOSS
Nº09Storage

Ceph

Distributed storage at production scale: objects, blocks and files.

IntermediateOSS
Nº10Orchestration

Dagster

Pipeline orchestration centered on the data (assets), not just on tasks.

Intermediatepython
Nº11Processing

dbt

Transform data in your warehouse with SQL, treated like software.

Introsql
Nº12Storage

Delta Lake

Tables with ACID guarantees and time travel on top of your data lake.

IntermediateOSS
Nº13Infrastructure

Docker

Package any stack tool into a reproducible container.

IntroOSS
Nº14Analysis

DuckDB

The analytical database that runs inside your process — no server.

Introsql
Nº15Infrastructure

Git

The version control that underpins all reproducible data work.

IntroOSS
Nº16Governance

Great Expectations

Quality tests for your data: define expectations and validate every load.

Intermediatepython
Nº17Analysis

Jupyter

The interactive notebook where data analysis takes shape.

Intropython
Nº18Visualization

Matplotlib

The foundational library for visualizing data with code in Python.

Intropython
Nº19Storage

MinIO

S3-compatible object storage to run your own data lake.

IntroOSS
Nº20Analysis

NumPy

Python's numeric foundation: fast, vectorized arrays.

Intropython
Nº21Governance

OpenMetadata

The open catalog to discover and trace the lineage of your data.

IntermediateOSS
Nº22Analysis

pandas

The Swiss Army knife for manipulating and analyzing tabular data in Python.

Intropython
Nº23Analysis

Polars

DataFrames in Rust: fast, parallel and with lazy evaluation.

Intropython
Nº24Storage

PostgreSQL

The reference open-source relational database — reliable and extensible.

Introsql
Nº25Languages

Python

The lingua franca of the data stack: from scripts to pipelines and ML.

Intropython
Nº26Machine Learning

PyTorch

The flexible, Pythonic deep learning framework.

Intermediatepython
Nº27Machine Learning

scikit-learn

The toolbox for classic machine learning in Python.

Intropython
Nº28Visualization

seaborn

Elegant statistical charts in one line, on top of Matplotlib.

Intropython
Nº29Languages

SQL

The universal language for asking questions of your data.

Introsql
Nº30Processing

Trino

One SQL to query data wherever it lives.

Intermediatesql