Skip to content
Unverified — AI-generated content. Help verify this page

Advanced Feature Engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive model. Andrew Ng famously said: "Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." This page covers advanced techniques that go beyond basic one-hot encoding and scaling.

Why Feature Engineering Matters

A well-engineered feature can replace an entire model upgrade:

ScenarioRaw FeaturesEngineered FeaturesImpact
House price predictionSquare footage, bedroomsPrice per sq ft in neighborhood+15% R²
Fraud detectionTransaction amountRatio to user's average spend+20% recall
Customer churnAccount age (days)Days since last login+25% AUC
Click predictionUser ID, page IDUser's historical CTR on similar pages+30% AUC

Target Encoding

The Problem with One-Hot Encoding

High-cardinality categorical features (ZIP codes, product IDs, user IDs) create thousands of binary columns with one-hot encoding. This causes:

  • Dimensionality explosion: 10,000 ZIP codes = 10,000 new features
  • Sparse matrices: Most values are 0
  • No ordinal information: ZIP 10001 and 10002 are treated as completely unrelated

Basic Target Encoding

Replace each category with the mean of the target for that category:

TE(c)=E[y|x=c]=1nci:xi=cyi

Problem: Categories with few samples have noisy estimates. A category with 1 sample gets encoded as exactly 0 or 1 — pure memorization.

Bayesian Smoothing (Regularized Target Encoding)

Blend the category mean with the global mean using a smoothing factor:

TEsmooth(c)=ncy¯c+my¯globalnc+m

where:

  • y¯c = mean target for category c
  • nc = number of samples in category c
  • y¯global = global mean target
  • m = smoothing parameter (higher = more regularization)

Interpretation: When ncm, trust the category mean. When ncm, fall back to the global mean.

From Scratch Implementation

python
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold

class TargetEncoderFromScratch:
    """Target encoding with Bayesian smoothing and CV regularization."""

    def __init__(self, smoothing=10, cv=5, random_state=42):
        self.smoothing = smoothing
        self.cv = cv
        self.random_state = random_state

    def fit_transform(self, X, y, columns):
        """Fit and transform using out-of-fold encoding to prevent leakage."""
        X = X.copy()
        self.encodings_ = {}
        self.global_mean_ = y.mean()

        for col in columns:
            # For transform (test data): use full training data
            self.encodings_[col] = self._compute_encodings(X[col], y)

            # For train data: use out-of-fold to prevent leakage
            X[f'{col}_target_enc'] = np.nan
            kf = KFold(n_splits=self.cv, shuffle=True,
                       random_state=self.random_state)

            for train_idx, val_idx in kf.split(X):
                encodings_fold = self._compute_encodings(
                    X.iloc[train_idx][col], y.iloc[train_idx]
                )
                X.iloc[val_idx, X.columns.get_loc(f'{col}_target_enc')] = (
                    X.iloc[val_idx][col].map(encodings_fold)
                )

            # Fill unknown categories with global mean
            X[f'{col}_target_enc'].fillna(self.global_mean_, inplace=True)

        return X

    def transform(self, X, columns):
        """Transform new data using fitted encodings."""
        X = X.copy()
        for col in columns:
            X[f'{col}_target_enc'] = (
                X[col].map(self.encodings_[col]).fillna(self.global_mean_)
            )
        return X

    def _compute_encodings(self, feature, target):
        """Compute smoothed target encoding."""
        stats = pd.DataFrame({'feature': feature, 'target': target})
        agg = stats.groupby('feature')['target'].agg(['mean', 'count'])

        smooth = (
            (agg['count'] * agg['mean'] + self.smoothing * self.global_mean_) /
            (agg['count'] + self.smoothing)
        )
        return smooth.to_dict()


# ---- Demo ----
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix',
                               'Rare1', 'Rare2', 'Rare3'], n,
                              p=[0.25, 0.2, 0.15, 0.15, 0.1, 0.05, 0.05, 0.05]),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n),
})
# Target depends on city
city_effect = {'NYC': 0.8, 'LA': 0.6, 'Chicago': 0.5, 'Houston': 0.4,
               'Phoenix': 0.3, 'Rare1': 0.9, 'Rare2': 0.1, 'Rare3': 0.5}
df['target'] = [np.random.binomial(1, city_effect.get(c, 0.5)) for c in df['city']]

encoder = TargetEncoderFromScratch(smoothing=10, cv=5)
df_encoded = encoder.fit_transform(df, df['target'], ['city', 'category'])

print("Target encoding results:")
print(df_encoded.groupby('city')['city_target_enc'].first().sort_values())
print(f"\nGlobal mean: {df['target'].mean():.3f}")
print("Note: Rare categories are pulled toward global mean by smoothing")

Scikit-learn Target Encoding

python
from sklearn.preprocessing import TargetEncoder

te = TargetEncoder(smooth='auto', cv=5, random_state=42)
X_encoded = te.fit_transform(df[['city', 'category']], df['target'])

print("Sklearn TargetEncoder:")
print(pd.DataFrame(X_encoded, columns=['city_enc', 'category_enc']).describe())

Interaction Features

Polynomial Features

Create all polynomial and interaction terms up to a given degree:

For features [x1,x2] with degree 2:

[1,x1,x2,x12,x1x2,x22]

Number of output features: (d+degreedegree) where d is the number of input features.

python
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

X_demo = np.array([[2, 3], [4, 5], [6, 7]])

poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
X_poly = poly.fit_transform(X_demo)

print("Feature names:", poly.get_feature_names_out())
print("Shape:", X_demo.shape, "->", X_poly.shape)
print(X_poly)

Targeted Interaction Features

Instead of generating all interactions (combinatorial explosion), create domain-specific ones:

python
def create_interaction_features(df):
    """Create domain-specific interaction features."""
    df = df.copy()

    # Ratios (more interpretable than products)
    df['price_per_sqft'] = df['price'] / df['sqft'].clip(lower=1)
    df['rooms_per_sqft'] = df['rooms'] / df['sqft'].clip(lower=1)
    df['bathrooms_ratio'] = df['bathrooms'] / df['bedrooms'].clip(lower=1)

    # Products (capture joint effects)
    df['location_size'] = df['location_score'] * df['sqft']
    df['age_condition'] = df['age'] * df['condition_score']

    # Differences
    df['listed_minus_avg'] = df['listed_price'] - df['neighborhood_avg_price']

    # Grouping + aggregation (statistical features)
    df['price_vs_zip_median'] = df.groupby('zipcode')['price'].transform(
        lambda x: x / x.median()
    )

    return df

Feature Crosses for Categorical Variables

python
def create_feature_crosses(df, col1, col2):
    """Create a feature cross (combination) of two categorical columns."""
    df = df.copy()
    df[f'{col1}_x_{col2}'] = df[col1].astype(str) + '_' + df[col2].astype(str)
    return df

# Example: city + property_type -> "NYC_condo", "LA_house"
# This captures interactions like "condos are expensive in NYC but cheap in Houston"

Automated Feature Engineering with Featuretools

Featuretools automatically creates features from relational data using Deep Feature Synthesis (DFS).

Primitives

TypePrimitiveExample
Aggregationcount, sum, mean, max, stdMean transaction amount per customer
Transformyear, month, hour, absoluteExtract month from date
Aggregationnum_unique, mode, percent_trueNumber of unique products bought
python
import featuretools as ft
import pandas as pd
import numpy as np

# ---- Create sample relational data ----
np.random.seed(42)
n_customers = 200
n_transactions = 2000

customers = pd.DataFrame({
    'customer_id': range(n_customers),
    'signup_date': pd.date_range('2023-01-01', periods=n_customers, freq='D'),
    'age': np.random.randint(18, 70, n_customers),
    'region': np.random.choice(['East', 'West', 'Central'], n_customers),
})

transactions = pd.DataFrame({
    'transaction_id': range(n_transactions),
    'customer_id': np.random.choice(n_customers, n_transactions),
    'amount': np.random.exponential(50, n_transactions).round(2),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books'],
                                          n_transactions),
    'timestamp': pd.date_range('2023-01-01', periods=n_transactions, freq='2h'),
})

# ---- Create EntitySet ----
es = ft.EntitySet(id='retail')

es.add_dataframe(
    dataframe_name='customers',
    dataframe=customers,
    index='customer_id',
    time_index='signup_date'
)

es.add_dataframe(
    dataframe_name='transactions',
    dataframe=transactions,
    index='transaction_id',
    time_index='timestamp'
)

es.add_relationship('customers', 'customer_id', 'transactions', 'customer_id')

# ---- Deep Feature Synthesis ----
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    max_depth=2,                   # depth of primitive stacking
    agg_primitives=['count', 'mean', 'sum', 'std', 'max', 'min',
                    'num_unique', 'mode'],
    trans_primitives=['year', 'month', 'weekday'],
    verbose=True
)

print(f"\nGenerated {len(feature_defs)} features:")
for fd in feature_defs[:20]:
    print(f"  {fd}")

print(f"\nFeature matrix shape: {feature_matrix.shape}")
print(feature_matrix.head())

Selecting the Best Auto-Generated Features

python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Create a target (e.g., high-value customer)
customer_total_spend = transactions.groupby('customer_id')['amount'].sum()
y = (customer_total_spend > customer_total_spend.median()).astype(int)
y = y.reindex(feature_matrix.index).fillna(0).astype(int)

# Handle missing values
X_auto = feature_matrix.select_dtypes(include=[np.number])

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

scores = cross_val_score(pipe, X_auto, y, cv=5, scoring='accuracy')
print(f"\nAuto-engineered features accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")

# Compare with raw features only
X_raw = customers[['age']].values
scores_raw = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_raw, y, cv=5, scoring='accuracy'
)
print(f"Raw features only accuracy: {scores_raw.mean():.4f} +/- {scores_raw.std():.4f}")

Feature Importance Comparison

Different methods measure "importance" differently. Always compare multiple methods.

Method 1: Permutation Importance (Model-Agnostic)

Randomly shuffle one feature and measure how much the score drops. Large drop = important.

PI(fj)=s1Kk=1Ksk,j

where s is the baseline score and sk,j is the score after the k-th permutation of feature j.

Method 2: Tree-Based Impurity Importance

Sum of impurity decreases (Gini or entropy) across all splits using feature j, weighted by the number of samples reaching each node.

Warning: Biased toward high-cardinality features (more split opportunities).

Method 3: SHAP Values

Based on cooperative game theory. Measures each feature's contribution to each individual prediction.

Comprehensive Comparison

python
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
feature_names = cancer.feature_names

# ---- Train a Random Forest ----
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X, y)

# ---- Method 1: Impurity-based importance ----
imp_impurity = rf.feature_importances_

# ---- Method 2: Permutation importance ----
perm_result = permutation_importance(rf, X, y, n_repeats=30,
                                      random_state=42, n_jobs=-1)
imp_perm = perm_result.importances_mean

# ---- Method 3: Coefficient-based (Logistic Regression) ----
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lr = LogisticRegression(max_iter=5000, C=1.0, random_state=42)
lr.fit(X_scaled, y)
imp_coef = np.abs(lr.coef_[0])

# ---- Method 4: Mutual Information ----
from sklearn.feature_selection import mutual_info_classif
imp_mi = mutual_info_classif(X, y, random_state=42)

# ---- Normalize all to [0, 1] for comparison ----
def normalize(x):
    return (x - x.min()) / (x.max() - x.min() + 1e-10)

importances = {
    'Impurity (RF)': normalize(imp_impurity),
    'Permutation (RF)': normalize(imp_perm),
    'Coefficient (LR)': normalize(imp_coef),
    'Mutual Info': normalize(imp_mi),
}

# ---- Compare top features across methods ----
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

for ax, (name, imp) in zip(axes.ravel(), importances.items()):
    sorted_idx = np.argsort(imp)[-15:]  # top 15
    ax.barh(range(15), imp[sorted_idx], color=f'C{list(importances.keys()).index(name)}')
    ax.set_yticks(range(15))
    ax.set_yticklabels(feature_names[sorted_idx], fontsize=8)
    ax.set_title(name)
    ax.set_xlabel('Normalized Importance')

plt.suptitle('Feature Importance: Four Methods Compared', fontsize=14)
plt.tight_layout()
plt.savefig('feature_importance_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

# ---- Rank correlation between methods ----
from scipy.stats import spearmanr

methods = list(importances.keys())
print("\nSpearman rank correlation between importance methods:")
print(f"{'':>20}", end='')
for m in methods:
    print(f"{m:>20}", end='')
print()

for m1 in methods:
    print(f"{m1:>20}", end='')
    for m2 in methods:
        corr, _ = spearmanr(importances[m1], importances[m2])
        print(f"{corr:>20.3f}", end='')
    print()

Feature Selection Strategies

Filter Methods (Before Training)

python
from sklearn.feature_selection import (SelectKBest, f_classif,
                                        mutual_info_classif,
                                        VarianceThreshold)

# Remove constant/near-constant features
vt = VarianceThreshold(threshold=0.01)
X_var = vt.fit_transform(X)
print(f"Variance threshold: {X.shape[1]} -> {X_var.shape[1]} features")

# Select top K by ANOVA F-score
selector = SelectKBest(f_classif, k=10)
X_best = selector.fit_transform(X, y)
selected = feature_names[selector.get_support()]
print(f"\nTop 10 features (F-test): {list(selected)}")

# Select by mutual information
selector_mi = SelectKBest(mutual_info_classif, k=10)
X_mi = selector_mi.fit_transform(X, y)
selected_mi = feature_names[selector_mi.get_support()]
print(f"Top 10 features (MI):     {list(selected_mi)}")

Wrapper Methods (With Training)

python
from sklearn.feature_selection import RFECV

# Recursive Feature Elimination with CV
rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    step=1,
    cv=5,
    scoring='accuracy',
    min_features_to_select=5,
    n_jobs=-1
)
rfecv.fit(X, y)

print(f"\nOptimal features: {rfecv.n_features_}")
print(f"Selected: {list(feature_names[rfecv.support_])}")
print(f"CV score with selected: {rfecv.cv_results_['mean_test_score'][rfecv.n_features_ - 5]:.4f}")

Embedded Methods (During Training)

python
from sklearn.linear_model import LassoCV
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()

# L1 regularization (Lasso) naturally zeros out unimportant features
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(diabetes.data, diabetes.target)

feature_importance = np.abs(lasso.coef_)
selected = feature_importance > 0
print(f"\nLasso selected {selected.sum()} of {len(selected)} features")
print(f"Best alpha: {lasso.alpha_:.4f}")
for name, coef in zip(diabetes.feature_names, lasso.coef_):
    status = "SELECTED" if abs(coef) > 0 else "dropped"
    print(f"  {name:>10}: {coef:>8.3f} ({status})")

Time-Based Feature Engineering

python
def create_time_features(df, datetime_col='timestamp'):
    """Extract temporal features from a datetime column."""
    df = df.copy()
    dt = pd.to_datetime(df[datetime_col])

    # Cyclical encoding (preserves continuity: hour 23 is close to hour 0)
    df['hour_sin'] = np.sin(2 * np.pi * dt.dt.hour / 24)
    df['hour_cos'] = np.cos(2 * np.pi * dt.dt.hour / 24)
    df['day_sin'] = np.sin(2 * np.pi * dt.dt.dayofweek / 7)
    df['day_cos'] = np.cos(2 * np.pi * dt.dt.dayofweek / 7)
    df['month_sin'] = np.sin(2 * np.pi * dt.dt.month / 12)
    df['month_cos'] = np.cos(2 * np.pi * dt.dt.month / 12)

    # Binary features
    df['is_weekend'] = dt.dt.dayofweek.isin([5, 6]).astype(int)
    df['is_month_start'] = dt.dt.is_month_start.astype(int)
    df['is_month_end'] = dt.dt.is_month_end.astype(int)

    # Ordinal
    df['day_of_year'] = dt.dt.dayofyear
    df['week_of_year'] = dt.dt.isocalendar().week.astype(int)
    df['quarter'] = dt.dt.quarter

    return df

Feature Engineering Checklist

StepTechniqueWhen to Use
1Missing value indicatorsWhen missingness is informative
2Log/sqrt transformsSkewed distributions
3Target encodingHigh-cardinality categories
4Interaction featuresDomain knowledge suggests feature pairs matter
5Cyclical encodingPeriodic features (hour, day, month)
6Aggregation featuresRelational/grouped data
7Lag featuresTime series
8Feature selectionAfter generating many candidates

Key Takeaways

ConceptRemember
Target encoding replaces categories with target meansMust use out-of-fold encoding to prevent leakage
Bayesian smoothing regularizes rare categoriesm controls pull toward global mean
Interaction features capture joint effectsDomain-driven beats exhaustive generation
Featuretools automates relational feature creationDFS generates hundreds of candidates automatically
Always compare multiple importance methodsNo single method is universally correct
Permutation importance is model-agnosticWorks with any model, measures real predictive value
Impurity importance is biased toward high cardinalityUse permutation importance instead
Feature selection reduces overfittingFilter → Wrapper → Embedded, in order of speed

"What I cannot create, I do not understand." — Richard Feynman