Go beyond accuracy — understand what your model actually learned, find where it fails, and explain predictions to non-technical stakeholders.
Go beyond accuracy — understand what your model actually learned, find where it fails, and explain predictions to non-technical stakeholders.
A full model evaluation report with confusion matrix, ROC curve, precision-recall curve, feature importance chart, and SHAP values — the kind of analysis that gets models approved in enterprise settings.
Accuracy is often misleading. If 95% of transactions are legit, a model that always predicts "legit" has 95% accuracy — but catches zero fraud. Use precision, recall, and F1.
from sklearn.metrics import ( confusion_matrix, classification_report, ConfusionMatrixDisplay, roc_auc_score ) import matplotlib.pyplot as plt y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] # Confusion matrix cm = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cm) # [[TN FP] # [FN TP]] # Plot it disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Benign', 'Malignant']) disp.plot(cmap='Blues') plt.title('Confusion Matrix') plt.savefig('confusion_matrix.png', dpi=150) # Full report print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'])) print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}")
ROC-AUC measures how well the model separates classes across all possible decision thresholds. PR-AUC is better for imbalanced datasets. Plot both to understand the tradeoffs.
from sklearn.metrics import roc_curve, precision_recall_curve, auc fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) # ROC Curve fpr, tpr, _ = roc_curve(y_test, y_prob) roc_auc = auc(fpr, tpr) ax1.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}') ax1.plot([0, 1], [0, 1], '--', color='gray', label='Random') ax1.set_xlabel('False Positive Rate') ax1.set_ylabel('True Positive Rate') ax1.set_title('ROC Curve') ax1.legend() # Precision-Recall Curve precision, recall, _ = precision_recall_curve(y_test, y_prob) pr_auc = auc(recall, precision) ax2.plot(recall, precision, label=f'PR-AUC = {pr_auc:.3f}') ax2.set_xlabel('Recall') ax2.set_ylabel('Precision') ax2.set_title('Precision-Recall Curve') ax2.legend() plt.tight_layout() plt.savefig('curves.png', dpi=150)
Feature importance tells you which columns drive predictions. SHAP (SHapley Additive exPlanations) goes further — it explains individual predictions, which is critical for medical, legal, and financial models.
pip install shap import shap # Built-in feature importance (Random Forest) importances = pd.Series(model.feature_importances_, index=feature_names) importances.nlargest(10).plot(kind='barh') plt.title('Top 10 Feature Importances') plt.savefig('feature_importance.png') # SHAP values (model-agnostic explanation) explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) # Summary plot: global feature impact shap.summary_plot(shap_values[1], X_test, feature_names=feature_names) # Explain one prediction shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test[0], feature_names=feature_names, matplotlib=True)
SHAP for trust: When a doctor asks "why did your model say cancer?", you need SHAP. A list of the features that pushed the prediction up or down gives actionable, defensible explanations.
By default, classifiers use a 0.5 probability threshold. But you can adjust it. In fraud detection you might prefer lower precision to catch more fraud (lower threshold). In medical screening, you want high recall.
import numpy as np from sklearn.metrics import precision_score, recall_score, f1_score thresholds = np.arange(0.1, 0.9, 0.05) results = [] for t in thresholds: y_pred_t = (y_prob >= t).astype(int) results.append({ 'threshold': round(t, 2), 'precision': precision_score(y_test, y_pred_t), 'recall': recall_score(y_test, y_pred_t), 'f1': f1_score(y_test, y_pred_t) }) print(pd.DataFrame(results).to_string()) # For a cancer model, choose threshold that maximizes recall # (better to have false alarms than miss a real cancer)
Before moving on, make sure you can answer these without looking: