Day 04 Advanced Topics

Model Evaluation and Interpretation

Go beyond accuracy — understand what your model actually learned, find where it fails, and explain predictions to non-technical stakeholders.

~1 hour Hands-on Precision AI Academy

Today’s Objective

Go beyond accuracy — understand what your model actually learned, find where it fails, and explain predictions to non-technical stakeholders.

A full model evaluation report with confusion matrix, ROC curve, precision-recall curve, feature importance chart, and SHAP values — the kind of analysis that gets models approved in enterprise settings.

Confusion Matrix and Classification Metrics

Accuracy is often misleading. If 95% of transactions are legit, a model that always predicts "legit" has 95% accuracy — but catches zero fraud. Use precision, recall, and F1.

evaluate.py
PYTHON
from sklearn.metrics import ( confusion_matrix, classification_report, ConfusionMatrixDisplay, roc_auc_score
)
import matplotlib.pyplot as plt

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# [[TN  FP]
#  [FN  TP]]

# Plot it
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Benign', 'Malignant'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png', dpi=150)

# Full report
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}")

ROC Curve and Precision-Recall Curve

ROC-AUC measures how well the model separates classes across all possible decision thresholds. PR-AUC is better for imbalanced datasets. Plot both to understand the tradeoffs.

evaluate.py (continued)
PYTHON
from sklearn.metrics import roc_curve, precision_recall_curve, auc

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
ax1.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}')
ax1.plot([0, 1], [0, 1], '--', color='gray', label='Random')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curve')
ax1.legend()

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)
ax2.plot(recall, precision, label=f'PR-AUC = {pr_auc:.3f}')
ax2.set_xlabel('Recall')
ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curve')
ax2.legend()

plt.tight_layout()
plt.savefig('curves.png', dpi=150)

Feature Importance and SHAP Values

Feature importance tells you which columns drive predictions. SHAP (SHapley Additive exPlanations) goes further — it explains individual predictions, which is critical for medical, legal, and financial models.

terminal
BASH
pip install shap
explain.py
PYTHON
import shap

# Built-in feature importance (Random Forest)
importances = pd.Series(model.feature_importances_, index=feature_names)
importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Feature Importances')
plt.savefig('feature_importance.png')

# SHAP values (model-agnostic explanation)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature impact
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names)

# Explain one prediction
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test[0], feature_names=feature_names, matplotlib=True)

SHAP for trust: When a doctor asks "why did your model say cancer?", you need SHAP. A list of the features that pushed the prediction up or down gives actionable, defensible explanations.

Threshold Tuning

By default, classifiers use a 0.5 probability threshold. But you can adjust it. In fraud detection you might prefer lower precision to catch more fraud (lower threshold). In medical screening, you want high recall.

threshold.py
PYTHON
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

thresholds = np.arange(0.1, 0.9, 0.05)
results = []

for t in thresholds: y_pred_t = (y_prob >= t).astype(int) results.append({ 'threshold': round(t, 2), 'precision': precision_score(y_test, y_pred_t), 'recall': recall_score(y_test, y_pred_t), 'f1': f1_score(y_test, y_pred_t) })

print(pd.DataFrame(results).to_string())

# For a cancer model, choose threshold that maximizes recall
# (better to have false alarms than miss a real cancer)
80%

Supporting Resources

Go deeper with these references.

scikit-learn
scikit-learn Documentation Complete API reference and user guide for scikit-learn estimators.
Kaggle
Kaggle Learn: ML Course Free hands-on ML course with Jupyter notebooks and datasets.
YouTube
StatQuest with Josh Starmer Clear visual explanations of ML algorithms — widely considered the best free resource.

Day 4 Checkpoint

Before moving on, make sure you can answer these without looking:

Continue To Day 5
Deploy an ML Model as an API