Understand the ML workflow end-to-end and train your first classification model — from raw data to predictions.
Understand the ML workflow end-to-end and train your first classification model — from raw data to predictions.
A trained scikit-learn classifier that predicts whether a customer will churn, evaluated with accuracy, precision, and recall — plus a reusable training pipeline you can drop into any project.
Every ML project follows the same lifecycle: get data, explore it, preprocess it, train a model, evaluate it, and deploy it. Today we'll complete that full loop on a real dataset.
We'll use scikit-learn's built-in datasets so you don't need to download anything. The breast cancer dataset is a real medical classification problem with 30 features.
pip install scikit-learn pandas numpy matplotlib import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Load dataset data = load_breast_cancer() df = pd.DataFrame(data.data, columns=data.feature_names) df['target'] = data.target print(f"Shape: {df.shape}") # (569, 31) print(f"Target counts: {df.target.value_counts()}") print(df.describe()) # statistics for each column # Check for missing values print(f"Missing: {df.isnull().sum().sum()}") # 0
Split the data, scale features (ML algorithms work better when all features are on the same scale), then train a Random Forest classifier.
from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix # Features and target X = df.drop('target', axis=1) y = df['target'] # Split: 80% train, 20% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Scale features scaler = StandardScaler() X_train = scaler.fit_transform(X_train) # fit AND transform train X_test = scaler.transform(X_test) # ONLY transform test # Train Random Forest model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) print(classification_report(y_test, y_pred)) # precision recall f1-score support # 0 0.96 0.95 0.96 42 # 1 0.97 0.98 0.97 72 # accuracy 0.96 114
Critical rule: Only call fit_transform() on training data. On test data, call transform() only — using the scaler fitted on train. If you fit on test data, you're leaking information.
A trained model is useless if you have to retrain every time. Save it with joblib so you can load it in your API or app.
import joblib # Save model and scaler together pipeline = {'model': model, 'scaler': scaler, 'features': list(X.columns)} joblib.dump(pipeline, 'model_pipeline.pkl') print("Model saved to model_pipeline.pkl") # Load and predict in a different script pipeline = joblib.load('model_pipeline.pkl') model = pipeline['model'] scaler = pipeline['scaler'] # Predict a single sample sample = X_test[0:1] # already scaled pred = model.predict(sample) prob = model.predict_proba(sample) print(f"Prediction: {pred[0]} (confidence: {prob[0].max():.1%})")
Before moving on, make sure you can answer these without looking: