Handle real-world messy data: missing values, categorical encoding, feature scaling, and creating new features that improve model performance.
Handle real-world messy data: missing values, categorical encoding, feature scaling, and creating new features that improve model performance.
A preprocessing pipeline using scikit-learn's Pipeline and ColumnTransformer that handles mixed-type data — ready to drop into any ML project with messy real-world data.
Real data is always missing values. You have three choices: drop rows, drop columns, or impute (fill in estimated values). The right choice depends on how much is missing and why.
import pandas as pd import numpy as np from sklearn.impute import SimpleImputer # Simulate messy data df = pd.DataFrame({ 'age': [25, np.nan, 35, 28, np.nan], 'salary': [50000, 60000, np.nan, 55000, 70000], 'department': ['Eng', 'Sales', np.nan, 'Eng', 'HR'], 'churned': [0, 1, 0, 1, 0] }) # See missing count per column print(df.isnull().sum()) # Impute numeric: replace NaN with column median num_imputer = SimpleImputer(strategy='median') df[['age', 'salary']] = num_imputer.fit_transform(df[['age', 'salary']]) # Impute categorical: replace NaN with most frequent value cat_imputer = SimpleImputer(strategy='most_frequent') df[['department']] = cat_imputer.fit_transform(df[['department']])
ML models need numbers. Categorical columns like 'department' must be encoded. Two main approaches: One-Hot Encoding (for nominal categories) and Label Encoding (for ordinal categories).
from sklearn.preprocessing import OneHotEncoder, LabelEncoder # One-Hot Encoding (nominal: no order) # 'department' becomes: dept_Eng, dept_HR, dept_Sales ohe = OneHotEncoder(sparse_output=False, drop='first') dept_encoded = ohe.fit_transform(df[['department']]) dept_df = pd.DataFrame(dept_encoded, columns=ohe.get_feature_names_out()) df = pd.concat([df.drop('department', axis=1), dept_df], axis=1) # Label Encoding (ordinal: has order like Low/Med/High) from sklearn.preprocessing import OrdinalEncoder oe = OrdinalEncoder(categories=[['Low', 'Medium', 'High']]) # Low=0, Medium=1, High=2
Why drop='first'? With 3 categories, you only need 2 columns — the third is implied. This avoids multicollinearity (the dummy variable trap).
Pipelines chain preprocessing and model training into one object. This prevents data leakage and makes deployment much cleaner — you call .fit() once and .predict() anywhere.
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier numeric_features = ['age', 'salary'] categorical_features = ['department'] # Numeric: impute then scale numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # Categorical: impute then encode categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]) # Combine both preprocessor = ColumnTransformer(transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Full pipeline: preprocessing + model pipe = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ]) # Train on raw data — pipeline handles all preprocessing pipe.fit(X_train, y_train) print(f"Accuracy: {pipe.score(X_test, y_test):.1%}")
Feature engineering is creating new columns from existing ones that give the model more predictive signal. This is where domain knowledge beats raw algorithm power.
# Common feature engineering patterns # 1. Ratios df['salary_per_year_experience'] = df['salary'] / (df['years_exp'] + 1) # 2. Interaction features df['age_x_salary'] = df['age'] * df['salary'] # 3. Date features df['join_date'] = pd.to_datetime(df['join_date']) df['tenure_days'] = (pd.Timestamp.now() - df['join_date']).dt.days df['join_month'] = df['join_date'].dt.month # 4. Binning continuous to categories df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100], labels=['young', 'mid', 'senior', 'veteran'] ) # 5. Log transform for skewed distributions df['log_salary'] = np.log1p(df['salary']) # log1p handles zeros
Before moving on, make sure you can answer these without looking: