Home Courses Machine Learning Day 1

Day 01 Foundations

Your First ML Model with scikit-learn

Understand the ML workflow end-to-end and train your first classification model — from raw data to predictions.

~1 hour Hands-on Precision AI Academy

Today’s Objective

Understand the ML workflow end-to-end and train your first classification model — from raw data to predictions.

A trained scikit-learn classifier that predicts whether a customer will churn, evaluated with accuracy, precision, and recall — plus a reusable training pipeline you can drop into any project.

The Machine Learning Workflow

Every ML project follows the same lifecycle: get data, explore it, preprocess it, train a model, evaluate it, and deploy it. Today we'll complete that full loop on a real dataset.

ML Workflow Steps

1. Load & Explore

Understand what data you have, its shape, types, missing values

2. Preprocess

Clean nulls, encode categories, scale numbers

3. Train/Test Split

Hold out 20% of data to evaluate on — never train on test data

4. Train Model

Fit model on training data

5. Evaluate

Check accuracy, precision, recall, F1 on test data

6. Iterate

Tune hyperparameters, try different algorithms

Install and Load Data

We'll use scikit-learn's built-in datasets so you don't need to download anything. The breast cancer dataset is a real medical classification problem with 30 features.

terminal

BASH

pip install scikit-learn pandas numpy matplotlib

train.py

PYTHON

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print(f"Shape: {df.shape}") # (569, 31)
print(f"Target counts:
{df.target.value_counts()}")
print(df.describe()) # statistics for each column

# Check for missing values
print(f"Missing: {df.isnull().sum().sum()}")  # 0

Preprocess, Train, and Evaluate

Split the data, scale features (ML algorithms work better when all features are on the same scale), then train a Random Forest classifier.

train.py (continued)

PYTHON

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit AND transform train
X_test = scaler.transform(X_test) # ONLY transform test

# Train Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# precision  recall  f1-score support
# 0 0.96 0.95 0.96 42
# 1 0.97 0.98 0.97 72
# accuracy 0.96 114

Critical rule: Only call fit_transform() on training data. On test data, call transform() only — using the scaler fitted on train. If you fit on test data, you're leaking information.

Save and Load the Model

A trained model is useless if you have to retrain every time. Save it with joblib so you can load it in your API or app.

train.py (continued)

PYTHON

import joblib

# Save model and scaler together
pipeline = {'model': model, 'scaler': scaler, 'features': list(X.columns)}
joblib.dump(pipeline, 'model_pipeline.pkl')
print("Model saved to model_pipeline.pkl")

# Load and predict in a different script
pipeline = joblib.load('model_pipeline.pkl')
model = pipeline['model']
scaler = pipeline['scaler']

# Predict a single sample
sample = X_test[0:1]  # already scaled
pred = model.predict(sample)
prob = model.predict_proba(sample)
print(f"Prediction: {pred[0]} (confidence: {prob[0].max():.1%})")

Day 1 Checkpoint

Before moving on, make sure you can answer these without looking:

What is the core concept introduced in this lesson, and why does it matter?
What problem does Your solve that simpler approaches cannot?
Can you trace through the main code example in this lesson and explain each step?
What are the most common mistakes made when first learning this concept?
How would you explain today’s topic to a colleague who has never seen it before?

Your First ML Model with scikit-learn

Today’s Objective

The Machine Learning Workflow

Install and Load Data

Preprocess, Train, and Evaluate

Save and Load the Model

Supporting Resources

Go deeper with these references.

Day 1 Checkpoint