Nº27 · Machine Learning

scikit-learn

The toolbox for classic machine learning in Python.

Library / framework—Intro—Data Scientist—python

What is it?

scikit-learn (commonly imported as sklearn) is the standard library for classic machine learning in Python. It provides consistent implementations of dozens of algorithms —from linear regression to Random Forests and SVMs— under a unified API: fit, predict, transform. It is the usual entry point into ML for practitioners working with tabular data.

What is it for?

Classification, regression, and clustering. Direct access to algorithms like LogisticRegression, RandomForestClassifier, KMeans, and GradientBoostingRegressor, all ready to use on tabular data out of the box.
Pipelines and preprocessing. Chain cleaning, scaling, encoding, and modeling steps into a single Pipeline object, ensuring preprocessing is applied consistently between training and inference.
Model evaluation. Built-in cross-validation, metrics (accuracy_score, roc_auc_score, mean_squared_error), and hyperparameter search (GridSearchCV, RandomizedSearchCV).

When to use it / when not to?

Use it when working with structured tabular data and you need classic ML: binary or multiclass classification, regression, clustering, dimensionality reduction, or anomaly detection. It is the safe, battle-tested choice for datasets that fit in memory and for building quick baselines before exploring more complex approaches.

Think twice if your problem calls for deep learning —deep neural networks, image processing, or large-scale text— where PyTorch or TensorFlow are better suited. It is also not the right tool for large-scale distributed data: that is where Spark MLlib or similar frameworks come in, since scikit-learn operates in memory on a single node.

Start in 1 minute

pip install scikit-learn

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data (150 rows, 4 features, 3 classes)
X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.2f}")

From here, the natural next step is to explore Pipeline to chain a StandardScaler before the model, and GridSearchCV to tune hyperparameters. The official documentation includes worked examples for every algorithm.

Quick trivia — test what you just read.

How much do you know about scikit-learn?

Official documentation

The source of truth lives there. Here we orient you; the depth is up to you.

Open official docs ↗

What to learn next

pandas

The Swiss Army knife for manipulating and analyzing tabular data in Python.

Intropython

Nº20Analysis

NumPy

Python's numeric foundation: fast, vectorized arrays.

Intropython

Nº27 · Updated 2026-06-08