Day 02 Core Concepts

Data Preprocessing and Feature Engineering

Handle real-world messy data: missing values, categorical encoding, feature scaling, and creating new features that improve model performance.

~1 hour Hands-on Precision AI Academy

Today’s Objective

Handle real-world messy data: missing values, categorical encoding, feature scaling, and creating new features that improve model performance.

A preprocessing pipeline using scikit-learn's Pipeline and ColumnTransformer that handles mixed-type data — ready to drop into any ML project with messy real-world data.

Handle Missing Values

Real data is always missing values. You have three choices: drop rows, drop columns, or impute (fill in estimated values). The right choice depends on how much is missing and why.

preprocess.py
PYTHON
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Simulate messy data
df = pd.DataFrame({ 'age': [25, np.nan, 35, 28, np.nan], 'salary': [50000, 60000, np.nan, 55000, 70000], 'department': ['Eng', 'Sales', np.nan, 'Eng', 'HR'], 'churned': [0, 1, 0, 1, 0]
})

# See missing count per column
print(df.isnull().sum())

# Impute numeric: replace NaN with column median
num_imputer = SimpleImputer(strategy='median')
df[['age', 'salary']] = num_imputer.fit_transform(df[['age', 'salary']])

# Impute categorical: replace NaN with most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['department']] = cat_imputer.fit_transform(df[['department']])

Encode Categorical Features

ML models need numbers. Categorical columns like 'department' must be encoded. Two main approaches: One-Hot Encoding (for nominal categories) and Label Encoding (for ordinal categories).

preprocess.py (continued)
PYTHON
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-Hot Encoding (nominal: no order)
# 'department' becomes: dept_Eng, dept_HR, dept_Sales
ohe = OneHotEncoder(sparse_output=False, drop='first')
dept_encoded = ohe.fit_transform(df[['department']])
dept_df = pd.DataFrame(dept_encoded, columns=ohe.get_feature_names_out())
df = pd.concat([df.drop('department', axis=1), dept_df], axis=1)

# Label Encoding (ordinal: has order like Low/Med/High)
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
# Low=0, Medium=1, High=2

Why drop='first'? With 3 categories, you only need 2 columns — the third is implied. This avoids multicollinearity (the dummy variable trap).

Build a sklearn Pipeline

Pipelines chain preprocessing and model training into one object. This prevents data leakage and makes deployment much cleaner — you call .fit() once and .predict() anywhere.

pipeline.py
PYTHON
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

numeric_features = ['age', 'salary']
categorical_features = ['department']

# Numeric: impute then scale
numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())
])

# Categorical: impute then encode
categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both
preprocessor = ColumnTransformer(transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)
])

# Full pipeline: preprocessing + model
pipe = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train on raw data — pipeline handles all preprocessing
pipe.fit(X_train, y_train)
print(f"Accuracy: {pipe.score(X_test, y_test):.1%}")

Feature Engineering

Feature engineering is creating new columns from existing ones that give the model more predictive signal. This is where domain knowledge beats raw algorithm power.

features.py
PYTHON
# Common feature engineering patterns

# 1. Ratios
df['salary_per_year_experience'] = df['salary'] / (df['years_exp'] + 1)

# 2. Interaction features
df['age_x_salary'] = df['age'] * df['salary']

# 3. Date features
df['join_date'] = pd.to_datetime(df['join_date'])
df['tenure_days'] = (pd.Timestamp.now() - df['join_date']).dt.days
df['join_month'] = df['join_date'].dt.month

# 4. Binning continuous to categories
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100], labels=['young', 'mid', 'senior', 'veteran']
)

# 5. Log transform for skewed distributions
df['log_salary'] = np.log1p(df['salary'])  # log1p handles zeros
40%

Supporting Resources

Go deeper with these references.

scikit-learn
scikit-learn Documentation Complete API reference and user guide for scikit-learn estimators.
Kaggle
Kaggle Learn: ML Course Free hands-on ML course with Jupyter notebooks and datasets.
YouTube
StatQuest with Josh Starmer Clear visual explanations of ML algorithms — widely considered the best free resource.

Day 2 Checkpoint

Before moving on, make sure you can answer these without looking:

Continue To Day 3
Model Selection and Hyperparameter Tuning