Data Preprocessing and Feature Engineering

Handle real-world messy data: missing values, categorical encoding, feature scaling, and creating new features that improve model performance.

~1 hour Hands-on Precision AI Academy

Today’s Objective

Handle real-world messy data: missing values, categorical encoding, feature scaling, and creating new features that improve model performance.

A preprocessing pipeline using scikit-learn's Pipeline and ColumnTransformer that handles mixed-type data — ready to drop into any ML project with messy real-world data.

Handle Missing Values

Real data is always missing values. You have three choices: drop rows, drop columns, or impute (fill in estimated values). The right choice depends on how much is missing and why.

preprocess.py

PYTHON

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Simulate messy data
df = pd.DataFrame({ 'age': [25, np.nan, 35, 28, np.nan], 'salary': [50000, 60000, np.nan, 55000, 70000], 'department': ['Eng', 'Sales', np.nan, 'Eng', 'HR'], 'churned': [0, 1, 0, 1, 0]
})

# See missing count per column
print(df.isnull().sum())

# Impute numeric: replace NaN with column median
num_imputer = SimpleImputer(strategy='median')
df[['age', 'salary']] = num_imputer.fit_transform(df[['age', 'salary']])

# Impute categorical: replace NaN with most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['department']] = cat_imputer.fit_transform(df[['department']])

Encode Categorical Features

ML models need numbers. Categorical columns like 'department' must be encoded. Two main approaches: One-Hot Encoding (for nominal categories) and Label Encoding (for ordinal categories).

preprocess.py (continued)

PYTHON

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-Hot Encoding (nominal: no order)
# 'department' becomes: dept_Eng, dept_HR, dept_Sales
ohe = OneHotEncoder(sparse_output=False, drop='first')
dept_encoded = ohe.fit_transform(df[['department']])
dept_df = pd.DataFrame(dept_encoded, columns=ohe.get_feature_names_out())
df = pd.concat([df.drop('department', axis=1), dept_df], axis=1)

# Label Encoding (ordinal: has order like Low/Med/High)
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
# Low=0, Medium=1, High=2

Build a sklearn Pipeline

Pipelines chain preprocessing and model training into one object. This prevents data leakage and makes deployment much cleaner — you call .fit() once and .predict() anywhere.

pipeline.py

PYTHON

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

numeric_features = ['age', 'salary']
categorical_features = ['department']

# Numeric: impute then scale
numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())
])

# Categorical: impute then encode
categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both
preprocessor = ColumnTransformer(transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)
])

# Full pipeline: preprocessing + model
pipe = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train on raw data — pipeline handles all preprocessing
pipe.fit(X_train, y_train)
print(f"Accuracy: {pipe.score(X_test, y_test):.1%}")

Feature Engineering

Feature engineering is creating new columns from existing ones that give the model more predictive signal. This is where domain knowledge beats raw algorithm power.

features.py

PYTHON

# Common feature engineering patterns

# 1. Ratios
df['salary_per_year_experience'] = df['salary'] / (df['years_exp'] + 1)

# 2. Interaction features
df['age_x_salary'] = df['age'] * df['salary']

# 3. Date features
df['join_date'] = pd.to_datetime(df['join_date'])
df['tenure_days'] = (pd.Timestamp.now() - df['join_date']).dt.days
df['join_month'] = df['join_date'].dt.month

# 4. Binning continuous to categories
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100], labels=['young', 'mid', 'senior', 'veteran']
)

# 5. Log transform for skewed distributions
df['log_salary'] = np.log1p(df['salary'])  # log1p handles zeros

Day 2 Checkpoint

Before moving on, make sure you can answer these without looking:

What is the core concept introduced in this lesson, and why does it matter?
What problem does Data solve that simpler approaches cannot?
Can you trace through the main code example in this lesson and explain each step?
What are the most common mistakes made when first learning this concept?
How would you explain today’s topic to a colleague who has never seen it before?

Data Preprocessing and Feature Engineering

Today’s Objective

Handle Missing Values

Encode Categorical Features

Build a sklearn Pipeline

Feature Engineering

Supporting Resources

Go deeper with these references.

Day 2 Checkpoint