Day 01 Foundations

Your First ML Model with scikit-learn

Understand the ML workflow end-to-end and train your first classification model — from raw data to predictions.

~1 hour Hands-on Precision AI Academy

Today’s Objective

Understand the ML workflow end-to-end and train your first classification model — from raw data to predictions.

A trained scikit-learn classifier that predicts whether a customer will churn, evaluated with accuracy, precision, and recall — plus a reusable training pipeline you can drop into any project.

The Machine Learning Workflow

Every ML project follows the same lifecycle: get data, explore it, preprocess it, train a model, evaluate it, and deploy it. Today we'll complete that full loop on a real dataset.

ML Workflow Steps
1. Load & Explore
Understand what data you have, its shape, types, missing values
2. Preprocess
Clean nulls, encode categories, scale numbers
3. Train/Test Split
Hold out 20% of data to evaluate on — never train on test data
4. Train Model
Fit model on training data
5. Evaluate
Check accuracy, precision, recall, F1 on test data
6. Iterate
Tune hyperparameters, try different algorithms

Install and Load Data

We'll use scikit-learn's built-in datasets so you don't need to download anything. The breast cancer dataset is a real medical classification problem with 30 features.

terminal
BASH
pip install scikit-learn pandas numpy matplotlib
train.py
PYTHON
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print(f"Shape: {df.shape}") # (569, 31)
print(f"Target counts:
{df.target.value_counts()}")
print(df.describe()) # statistics for each column

# Check for missing values
print(f"Missing: {df.isnull().sum().sum()}")  # 0

Preprocess, Train, and Evaluate

Split the data, scale features (ML algorithms work better when all features are on the same scale), then train a Random Forest classifier.

train.py (continued)
PYTHON
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit AND transform train
X_test = scaler.transform(X_test) # ONLY transform test

# Train Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# precision  recall  f1-score support
# 0 0.96 0.95 0.96 42
# 1 0.97 0.98 0.97 72
# accuracy 0.96 114

Critical rule: Only call fit_transform() on training data. On test data, call transform() only — using the scaler fitted on train. If you fit on test data, you're leaking information.

Save and Load the Model

A trained model is useless if you have to retrain every time. Save it with joblib so you can load it in your API or app.

train.py (continued)
PYTHON
import joblib

# Save model and scaler together
pipeline = {'model': model, 'scaler': scaler, 'features': list(X.columns)}
joblib.dump(pipeline, 'model_pipeline.pkl')
print("Model saved to model_pipeline.pkl")

# Load and predict in a different script
pipeline = joblib.load('model_pipeline.pkl')
model = pipeline['model']
scaler = pipeline['scaler']

# Predict a single sample
sample = X_test[0:1]  # already scaled
pred = model.predict(sample)
prob = model.predict_proba(sample)
print(f"Prediction: {pred[0]} (confidence: {prob[0].max():.1%})")
20%

Supporting Resources

Go deeper with these references.

scikit-learn
scikit-learn Documentation Complete API reference and user guide for scikit-learn estimators.
Kaggle
Kaggle Learn: ML Course Free hands-on ML course with Jupyter notebooks and datasets.
YouTube
StatQuest with Josh Starmer Clear visual explanations of ML algorithms — widely considered the best free resource.

Day 1 Checkpoint

Before moving on, make sure you can answer these without looking:

Continue To Day 2
Data Preprocessing and Feature Engineering