Skip to content
Unverified — AI-generated content. Help verify this page

Univariate Categorical Analysis

Categorical variables are everywhere — product types, customer segments, regions, status codes, error messages, survey responses. They look simple, but they hide traps that can wreck your models. A column with 5 clean categories and one with 10,000 typo-ridden free-text entries are both "categorical," but they demand completely different treatment.

This page covers every tool you need: value counts, frequency tables, bar plots, why pie charts lie, entropy as a measure of diversity, Pareto analysis for the vital few, and how to detect and handle category imbalance.

The Dataset

We will use a synthetic e-commerce dataset with realistic categorical distributions.

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)
n = 5000

# Product category: power-law distribution (few popular, many niche)
categories = [
    "Electronics", "Clothing", "Home & Garden", "Sports", "Books",
    "Toys", "Automotive", "Health", "Food", "Beauty",
    "Office", "Pet Supplies", "Music", "Software", "Crafts",
]
cat_probs = np.array([0.25, 0.18, 0.12, 0.10, 0.08, 0.06, 0.05, 0.04, 0.03,
                       0.025, 0.02, 0.015, 0.01, 0.008, 0.002])
cat_probs = cat_probs / cat_probs.sum()
product_category = np.random.choice(categories, size=n, p=cat_probs)

# Payment method: moderate imbalance
payment_methods = ["Credit Card", "Debit Card", "PayPal", "Apple Pay", "Crypto"]
pay_probs = [0.45, 0.25, 0.18, 0.10, 0.02]
payment = np.random.choice(payment_methods, size=n, p=pay_probs)

# Region: roughly balanced
regions = ["North America", "Europe", "Asia", "South America", "Africa", "Oceania"]
reg_probs = [0.30, 0.28, 0.22, 0.10, 0.05, 0.05]
region = np.random.choice(regions, size=n, p=reg_probs)

# Order status: heavily imbalanced
statuses = ["Delivered", "Shipped", "Processing", "Cancelled", "Returned", "Fraud"]
status_probs = [0.65, 0.15, 0.10, 0.05, 0.04, 0.01]
order_status = np.random.choice(statuses, size=n, p=status_probs)

# Satisfaction rating: ordinal
ratings = ["Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"]
rat_probs = [0.05, 0.10, 0.25, 0.35, 0.25]
satisfaction = np.random.choice(ratings, size=n, p=rat_probs)

df = pd.DataFrame({
    "product_category": product_category,
    "payment_method": payment,
    "region": region,
    "order_status": order_status,
    "satisfaction": satisfaction,
})

print(df.shape)
print(df.dtypes)
df.head()

Value Counts: The Starting Point

Value counts are to categorical variables what describe() is to numerical ones. Always look at both absolute and relative frequencies.

python
def detailed_value_counts(series, name="variable"):
    """Compute value counts with cumulative percentages."""
    vc = series.value_counts()
    result = pd.DataFrame({
        "count": vc,
        "pct": (vc / len(series) * 100).round(2),
        "cumulative_count": vc.cumsum(),
        "cumulative_pct": (vc.cumsum() / len(series) * 100).round(2),
    })
    print(f"\n{'='*60}")
    print(f"  Value Counts: {name}")
    print(f"  Unique values: {series.nunique()}")
    print(f"  Missing: {series.isna().sum()} ({series.isna().mean()*100:.1f}%)")
    print(f"  Mode: {series.mode().iloc[0]} ({vc.iloc[0]:,} / {vc.iloc[0]/len(series)*100:.1f}%)")
    print(f"{'='*60}")
    print(result.to_string())
    return result

for col in df.columns:
    detailed_value_counts(df[col], col)

Bar Plots: The Right Way to Visualize Categories

Bar plots are the correct default for categorical data. Horizontal bars work better when category names are long.

python
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Vertical bar — good for few categories
vc = df["payment_method"].value_counts()
axes[0, 0].bar(vc.index, vc.values, color="steelblue", edgecolor="black")
axes[0, 0].set_title("Payment Method (Vertical Bar)", fontsize=13)
for i, (val, count) in enumerate(zip(vc.index, vc.values)):
    axes[0, 0].text(i, count + 20, f"{count:,}\n({count/n*100:.1f}%)",
                    ha="center", fontsize=9)

# Horizontal bar — better for many categories or long names
vc = df["product_category"].value_counts()
axes[0, 1].barh(vc.index[::-1], vc.values[::-1], color="steelblue", edgecolor="black")
axes[0, 1].set_title("Product Category (Horizontal Bar)", fontsize=13)
for i, (val, count) in enumerate(zip(vc.values[::-1], vc.index[::-1])):
    axes[0, 1].text(val + 10, i, f" {val:,}", va="center", fontsize=9)

# Ordered by frequency — always do this
vc = df["order_status"].value_counts()
colors = ["#2ecc71" if v > n*0.1 else "#e74c3c" if v < n*0.02 else "#f39c12" for v in vc.values]
axes[1, 0].bar(vc.index, vc.values, color=colors, edgecolor="black")
axes[1, 0].set_title("Order Status (Color-coded by frequency)", fontsize=13)
axes[1, 0].set_ylabel("Count")

# Percentage bar — often more informative than counts
vc_pct = df["region"].value_counts(normalize=True) * 100
axes[1, 1].barh(vc_pct.index[::-1], vc_pct.values[::-1], color="steelblue", edgecolor="black")
axes[1, 1].set_title("Region (Percentage)", fontsize=13)
axes[1, 1].set_xlabel("Percentage (%)")
for i, pct in enumerate(vc_pct.values[::-1]):
    axes[1, 1].text(pct + 0.2, i, f" {pct:.1f}%", va="center", fontsize=9)

plt.tight_layout()
plt.savefig("bar_plots.png", dpi=150, bbox_inches="tight")
plt.show()

Why Pie Charts Are Almost Always Wrong

Pie charts are the most overused and least informative chart type. Human brains are terrible at comparing angles and areas.

python
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# The same data shown three ways
vc = df["region"].value_counts()

# Pie chart — hard to compare similar-sized slices
axes[0].pie(vc.values, labels=vc.index, autopct="%1.1f%%", startangle=90)
axes[0].set_title("Pie Chart\n(Can you tell North America vs Europe?)", fontsize=12)

# Bar chart — easy to compare
axes[1].barh(vc.index[::-1], vc.values[::-1], color="steelblue", edgecolor="black")
axes[1].set_title("Bar Chart\n(Much easier to compare)", fontsize=12)

# Worst case: too many categories
vc_many = df["product_category"].value_counts()
axes[2].pie(vc_many.values, labels=vc_many.index, autopct="%1.1f%%",
            startangle=90, textprops={"fontsize": 7})
axes[2].set_title("Pie Chart with Many Categories\n(Completely unreadable)", fontsize=12)

plt.tight_layout()
plt.savefig("pie_vs_bar.png", dpi=150, bbox_inches="tight")
plt.show()

When are pie charts acceptable?

Almost never. The only defensible use case is showing that one category dominates (>60%) everything else. Even then, a stacked bar at 100% width is clearer. If you must use a pie chart, never use more than 5 slices and always add percentage labels.

The Perception Problem

Frequency Tables and Cross-Tabulation

Beyond simple value counts, frequency tables let you compute marginal and joint distributions.

python
# Frequency table with multiple aggregations
def frequency_table(series, name="variable"):
    """Build a detailed frequency table."""
    vc = series.value_counts().sort_index()
    total = len(series)

    table = pd.DataFrame({
        "count": vc,
        "relative_freq": vc / total,
        "percentage": (vc / total * 100).round(2),
        "cumulative_pct": (vc.cumsum() / total * 100).round(2),
    })

    # Add expected count under uniform distribution
    n_cats = series.nunique()
    table["expected_uniform"] = total / n_cats
    table["observed_vs_expected"] = (table["count"] / table["expected_uniform"]).round(3)

    return table

freq = frequency_table(df["product_category"], "product_category")
print(freq.sort_values("count", ascending=False).to_string())

Entropy: Measuring Category Diversity

Entropy quantifies the "unpredictability" or "diversity" of a categorical distribution. Maximum entropy means all categories are equally likely. Minimum entropy (zero) means only one category exists.

python
from scipy.stats import entropy as sp_entropy

def category_entropy(series, name="variable"):
    """Compute entropy and normalized entropy for a categorical variable."""
    vc = series.value_counts(normalize=True)
    n_cats = series.nunique()

    # Shannon entropy (base 2, in bits)
    h = sp_entropy(vc, base=2)

    # Maximum possible entropy (uniform distribution)
    h_max = np.log2(n_cats)

    # Normalized entropy (0 = single category, 1 = perfectly uniform)
    h_norm = h / h_max if h_max > 0 else 0

    print(f"\n--- Entropy Analysis: {name} ---")
    print(f"  Unique categories:    {n_cats}")
    print(f"  Shannon entropy:      {h:.4f} bits")
    print(f"  Maximum entropy:      {h_max:.4f} bits")
    print(f"  Normalized entropy:   {h_norm:.4f} (1.0 = perfectly balanced)")
    print(f"  Interpretation:       ", end="")
    if h_norm > 0.9:
        print("Very balanced — near-uniform distribution")
    elif h_norm > 0.7:
        print("Moderately balanced — some categories dominate")
    elif h_norm > 0.5:
        print("Imbalanced — few categories carry most of the data")
    else:
        print("Highly imbalanced — dominated by one or two categories")

    return {"entropy": h, "max_entropy": h_max, "normalized_entropy": h_norm}

# Compare entropy across all categorical columns
entropy_results = {}
for col in df.columns:
    entropy_results[col] = category_entropy(df[col], col)

# Visualize
ent_df = pd.DataFrame(entropy_results).T
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.barh(ent_df.index, ent_df["normalized_entropy"],
               color=["#e74c3c" if v < 0.7 else "#2ecc71" for v in ent_df["normalized_entropy"]],
               edgecolor="black")
ax.set_xlabel("Normalized Entropy (1.0 = perfectly balanced)")
ax.set_title("Category Balance by Variable", fontsize=14)
ax.set_xlim(0, 1.05)
for bar, val in zip(bars, ent_df["normalized_entropy"]):
    ax.text(val + 0.01, bar.get_y() + bar.get_height()/2, f"{val:.3f}",
            va="center", fontsize=10)
plt.tight_layout()
plt.savefig("entropy_comparison.png", dpi=150, bbox_inches="tight")
plt.show()

Why Entropy Matters

Pareto Analysis (The 80/20 Rule)

Pareto analysis tells you which categories carry most of the weight. In practice, you often find that 20% of categories account for 80% of observations.

python
def pareto_chart(series, name="variable", figsize=(12, 6)):
    """Create a Pareto chart with cumulative line."""
    vc = series.value_counts()
    cumulative_pct = vc.cumsum() / vc.sum() * 100

    fig, ax1 = plt.subplots(figsize=figsize)

    # Bars
    bars = ax1.bar(range(len(vc)), vc.values, color="steelblue",
                   edgecolor="black", alpha=0.8)
    ax1.set_xticks(range(len(vc)))
    ax1.set_xticklabels(vc.index, rotation=45, ha="right")
    ax1.set_ylabel("Count", color="steelblue")
    ax1.tick_params(axis="y", labelcolor="steelblue")

    # Cumulative line
    ax2 = ax1.twinx()
    ax2.plot(range(len(vc)), cumulative_pct, color="crimson",
             marker="o", linewidth=2, markersize=6)
    ax2.set_ylabel("Cumulative %", color="crimson")
    ax2.tick_params(axis="y", labelcolor="crimson")
    ax2.set_ylim(0, 105)

    # 80% reference line
    ax2.axhline(y=80, color="gray", linestyle="--", alpha=0.7, label="80% line")

    # Find how many categories hit 80%
    n_80 = (cumulative_pct <= 80).sum() + 1
    ax1.axvspan(-0.5, n_80 - 0.5, alpha=0.1, color="crimson",
                label=f"Top {n_80} categories = ~80%")

    ax1.set_title(f"Pareto Chart: {name}\n"
                  f"Top {n_80} of {len(vc)} categories cover ~80% of data",
                  fontsize=14)
    ax1.legend(loc="upper left")
    plt.tight_layout()
    plt.savefig(f"pareto_{name}.png", dpi=150, bbox_inches="tight")
    plt.show()

    return n_80, len(vc)

n80, total = pareto_chart(df["product_category"], "Product Category")
print(f"\n80/20 Analysis: {n80} categories ({n80/total*100:.0f}%) cover ~80% of orders")

Category Imbalance Detection

Imbalanced categories cause real problems in machine learning — minority classes get ignored, models optimize for the majority, and your precision/recall on rare classes tanks.

python
def imbalance_report(series, name="variable"):
    """Detect and quantify category imbalance."""
    vc = series.value_counts()
    n_cats = series.nunique()
    expected = len(series) / n_cats

    # Imbalance ratio: majority count / minority count
    imbalance_ratio = vc.iloc[0] / vc.iloc[-1]

    # Chi-square goodness of fit (vs uniform)
    chi2, p_value = stats.chisquare(vc)

    # Gini impurity
    probs = vc / vc.sum()
    gini = 1 - (probs ** 2).sum()
    gini_max = 1 - 1/n_cats
    gini_normalized = gini / gini_max if gini_max > 0 else 0

    print(f"\n{'='*60}")
    print(f"  Imbalance Report: {name}")
    print(f"{'='*60}")
    print(f"  Categories:        {n_cats}")
    print(f"  Most common:       {vc.index[0]} ({vc.iloc[0]:,}, {vc.iloc[0]/len(series)*100:.1f}%)")
    print(f"  Least common:      {vc.index[-1]} ({vc.iloc[-1]:,}, {vc.iloc[-1]/len(series)*100:.1f}%)")
    print(f"  Imbalance ratio:   {imbalance_ratio:.1f}:1")
    print(f"  Gini impurity:     {gini:.4f} (normalized: {gini_normalized:.4f})")
    print(f"  Chi² vs uniform:   {chi2:.1f} (p={p_value:.2e})")

    # Flag severely underrepresented categories
    threshold = expected * 0.1  # less than 10% of expected
    rare = vc[vc < threshold]
    if len(rare) > 0:
        print(f"\n  WARNING: {len(rare)} categories have < 10% of expected count:")
        for cat, count in rare.items():
            print(f"    - {cat}: {count:,} (expected ~{expected:.0f})")

    return {
        "imbalance_ratio": imbalance_ratio,
        "gini": gini,
        "gini_normalized": gini_normalized,
    }

for col in df.columns:
    imbalance_report(df[col], col)

Imbalance Severity Scale

Imbalance RatioSeverityImpact on ML
1:1 to 3:1MildUsually fine without special treatment
3:1 to 10:1ModerateUse stratified sampling, consider class weights
10:1 to 100:1SevereSMOTE, cost-sensitive learning, specialized metrics
> 100:1ExtremeAnomaly detection framing may be better than classification

Handling Dirty Categories

Real-world categorical data is messy. The same category appears as "New York", "new york", "NY", "N.Y.", and "NewYork".

python
# Simulate dirty data
dirty_categories = np.random.choice(
    ["Electronics", "electronics", "ELECTRONICS", "Electrnics",
     "Clothing", "clothing", "Clothng", "Clothes",
     "Home & Garden", "Home and Garden", "Home&Garden", "home garden"],
    size=500,
)
dirty_series = pd.Series(dirty_categories)

print("Before cleaning:")
print(dirty_series.value_counts())
print(f"Unique values: {dirty_series.nunique()}")

# Step 1: Normalize case and whitespace
cleaned = dirty_series.str.lower().str.strip()

# Step 2: Remove special characters
cleaned = cleaned.str.replace(r"[&+]", " and ", regex=True)
cleaned = cleaned.str.replace(r"\s+", " ", regex=True)

# Step 3: Fuzzy matching for typos
from difflib import get_close_matches

canonical = ["electronics", "clothing", "home and garden"]

def fuzzy_map(value, canonical_list, cutoff=0.7):
    matches = get_close_matches(value, canonical_list, n=1, cutoff=cutoff)
    return matches[0] if matches else value

cleaned = cleaned.apply(lambda x: fuzzy_map(x, canonical))

print("\nAfter cleaning:")
print(cleaned.value_counts())
print(f"Unique values: {cleaned.nunique()}")

Complete Categorical Profile Function

python
def categorical_profile(series, name="variable", top_n=15, figsize=(16, 10)):
    """Generate a complete categorical variable profile."""
    fig = plt.figure(figsize=figsize)

    vc = series.value_counts()
    vc_top = vc.head(top_n)
    pct = vc / len(series) * 100
    cumulative_pct = vc.cumsum() / len(series) * 100
    probs = vc / vc.sum()

    # Layout
    ax1 = fig.add_subplot(2, 2, 1)  # Bar plot
    ax2 = fig.add_subplot(2, 2, 2)  # Pareto
    ax3 = fig.add_subplot(2, 2, 3)  # Proportion with CI
    ax4 = fig.add_subplot(2, 2, 4)  # Stats text

    # 1. Horizontal bar
    ax1.barh(vc_top.index[::-1], vc_top.values[::-1],
             color="steelblue", edgecolor="black")
    ax1.set_title(f"Top {min(top_n, len(vc))} Categories", fontsize=12)
    ax1.set_xlabel("Count")

    # 2. Pareto (cumulative)
    ax2.bar(range(len(vc_top)), vc_top.values, color="steelblue",
            edgecolor="black", alpha=0.7)
    ax2_twin = ax2.twinx()
    ax2_twin.plot(range(len(vc_top)), cumulative_pct.head(top_n),
                  color="crimson", marker="o", linewidth=2)
    ax2_twin.axhline(80, color="gray", linestyle="--", alpha=0.5)
    ax2_twin.set_ylabel("Cumulative %")
    ax2.set_xticks(range(len(vc_top)))
    ax2.set_xticklabels(vc_top.index, rotation=45, ha="right", fontsize=8)
    ax2.set_title("Pareto (Cumulative %)", fontsize=12)

    # 3. Proportion with 95% confidence intervals
    n_total = len(series)
    for i, (cat, count) in enumerate(vc_top.items()):
        p = count / n_total
        se = np.sqrt(p * (1 - p) / n_total)
        ci_low = max(0, p - 1.96 * se)
        ci_high = min(1, p + 1.96 * se)
        ax3.barh(i, p, color="steelblue", edgecolor="black", alpha=0.7)
        ax3.errorbar(p, i, xerr=[[p - ci_low], [ci_high - p]],
                     fmt="none", color="black", capsize=4)
    ax3.set_yticks(range(len(vc_top)))
    ax3.set_yticklabels(vc_top.index)
    ax3.set_xlabel("Proportion (with 95% CI)")
    ax3.set_title("Proportions with Confidence Intervals", fontsize=12)

    # 4. Stats summary
    ax4.axis("off")
    h = sp_entropy(probs, base=2)
    h_max = np.log2(series.nunique())
    gini = 1 - (probs ** 2).sum()
    imb = vc.iloc[0] / vc.iloc[-1] if vc.iloc[-1] > 0 else float("inf")

    stats_text = (
        f"Total observations:   {len(series):,}\n"
        f"Unique categories:    {series.nunique()}\n"
        f"Missing values:       {series.isna().sum()}\n\n"
        f"Mode:                 {vc.index[0]}\n"
        f"Mode frequency:       {vc.iloc[0]:,} ({pct.iloc[0]:.1f}%)\n"
        f"Least common:         {vc.index[-1]}\n"
        f"Least common freq:    {vc.iloc[-1]:,} ({pct.iloc[-1]:.1f}%)\n\n"
        f"Imbalance ratio:      {imb:.1f}:1\n"
        f"Shannon entropy:      {h:.3f} bits\n"
        f"Normalized entropy:   {h/h_max:.3f}\n"
        f"Gini impurity:        {gini:.3f}"
    )
    ax4.text(0.05, 0.5, stats_text, transform=ax4.transAxes, fontsize=11,
             fontfamily="monospace", verticalalignment="center",
             bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8))
    ax4.set_title("Summary Statistics", fontsize=12)

    fig.suptitle(f"Categorical Profile: {name}", fontsize=16, fontweight="bold")
    plt.tight_layout()
    plt.savefig(f"cat_profile_{name}.png", dpi=150, bbox_inches="tight")
    plt.show()

for col in df.columns:
    categorical_profile(df[col], col)

Ordinal vs. Nominal: It Matters

python
# Ordinal encoding that preserves order
satisfaction_order = {
    "Very Dissatisfied": 1,
    "Dissatisfied": 2,
    "Neutral": 3,
    "Satisfied": 4,
    "Very Satisfied": 5,
}

df["satisfaction_ordinal"] = df["satisfaction"].map(satisfaction_order)

# Now you can meaningfully compute mean, median, etc.
print(f"Mean satisfaction: {df['satisfaction_ordinal'].mean():.2f}")
print(f"Median satisfaction: {df['satisfaction_ordinal'].median():.1f}")

# Distribution of ordinal variable
fig, ax = plt.subplots(figsize=(10, 5))
order = list(satisfaction_order.keys())
vc = df["satisfaction"].value_counts().reindex(order)
colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(order)))
ax.bar(order, vc.values, color=colors, edgecolor="black")
ax.set_title("Satisfaction Distribution (Ordered)", fontsize=14)
ax.set_ylabel("Count")
for i, (cat, count) in enumerate(zip(order, vc.values)):
    ax.text(i, count + 20, f"{count:,}\n({count/n*100:.1f}%)", ha="center", fontsize=9)
plt.tight_layout()
plt.savefig("ordinal_distribution.png", dpi=150, bbox_inches="tight")
plt.show()

Rare Category Strategies

python
def group_rare_categories(series, min_pct=2.0, other_label="Other"):
    """Group categories below a minimum percentage threshold."""
    vc = series.value_counts(normalize=True) * 100
    rare_cats = vc[vc < min_pct].index.tolist()

    grouped = series.copy()
    grouped[grouped.isin(rare_cats)] = other_label

    print(f"Grouped {len(rare_cats)} rare categories into '{other_label}':")
    for cat in rare_cats:
        print(f"  - {cat} ({vc[cat]:.2f}%)")
    print(f"\nBefore: {series.nunique()} categories")
    print(f"After:  {grouped.nunique()} categories")

    return grouped

grouped_category = group_rare_categories(df["product_category"], min_pct=3.0)
print("\nNew distribution:")
print(grouped_category.value_counts())

Practical Checklist

Before moving on from univariate categorical analysis, verify the following for every categorical column:

  1. Cardinality: How many unique values? Is it nominal, ordinal, or secretly high-cardinality?
  2. Frequency distribution: Is it balanced, skewed, or dominated by one category?
  3. Rare categories: Are there categories with very few observations? How will you handle them?
  4. Data quality: Are there typos, inconsistent casing, or duplicate representations?
  5. Missing values: Are they random or concentrated in certain categories?
  6. Entropy: How diverse is the distribution? Is the imbalance expected or a data issue?
  7. Ordering: If ordinal, have you encoded the order explicitly?

Key Takeaways

  • Always compute value counts with both absolute and cumulative percentages.
  • Bar plots are the correct default visualization for categories. Pie charts fail at comparison.
  • Use entropy to quantify category balance — a normalized entropy below 0.7 signals meaningful imbalance.
  • Pareto analysis reveals the vital few categories that carry most of the data.
  • Imbalance ratios above 10:1 require special treatment in ML pipelines (stratified splits, class weights, or resampling).
  • Clean dirty categories before analysis: normalize case, trim whitespace, fix typos with fuzzy matching.
  • Distinguish ordinal from nominal — encoding ordinal variables as unordered categories destroys information.

Try It Yourself

Exercise 1: An e-commerce dataset has a product_category column with 200 unique values. The top 5 categories make up 78% of orders, and 150 categories each have fewer than 10 orders (out of 50,000 total). Write code to do a Pareto analysis, compute entropy, and group rare categories into an "Other" bucket.

Solution
python
import pandas as pd
import numpy as np
from scipy.stats import entropy as sp_entropy

vc = df['product_category'].value_counts()
cumulative_pct = vc.cumsum() / vc.sum() * 100

# Pareto analysis
n_80 = (cumulative_pct <= 80).sum() + 1
print(f"Pareto: {n_80} of {len(vc)} categories cover 80% of orders")
print(f"That is {n_80/len(vc)*100:.1f}% of categories")

# Entropy
probs = vc / vc.sum()
h = sp_entropy(probs, base=2)
h_max = np.log2(len(vc))
h_norm = h / h_max
print(f"\nShannon entropy: {h:.3f} bits")
print(f"Normalized entropy: {h_norm:.3f} (1.0 = balanced)")
print(f"Interpretation: {'Highly imbalanced' if h_norm < 0.5 else 'Imbalanced'}")

# Group rare categories
min_count = 50  # categories with fewer than 50 orders -> "Other"
rare_cats = vc[vc < min_count].index.tolist()
df['product_category_grouped'] = df['product_category'].copy()
df.loc[df['product_category_grouped'].isin(rare_cats), 'product_category_grouped'] = 'Other'

print(f"\nBefore grouping: {df['product_category'].nunique()} categories")
print(f"After grouping:  {df['product_category_grouped'].nunique()} categories")
print(f"Grouped {len(rare_cats)} rare categories into 'Other'")
print(f"\nNew top 10:")
print(df['product_category_grouped'].value_counts().head(10))

Exercise 2: You have a survey column satisfaction with values: "Very Satisfied", "Satisfied", "Neutral", "Dissatisfied", "Very Dissatisfied", plus some messy entries like "very satisfied", "V. Satisfied", "N/A", and empty strings. Write code to clean it, recognize it as ordinal, and produce a proper ordered bar chart.

Solution
python
import pandas as pd
import numpy as np
from difflib import get_close_matches
import matplotlib.pyplot as plt

# Step 1: Clean — normalize case, strip whitespace
cleaned = df['satisfaction'].str.strip().str.lower()

# Step 2: Map known null values
null_values = {'', 'n/a', 'na', 'none', 'null', '-'}
cleaned = cleaned.replace(null_values, np.nan)

# Step 3: Fuzzy match to canonical values
canonical = ['very satisfied', 'satisfied', 'neutral', 'dissatisfied', 'very dissatisfied']

def fuzzy_map(val, canonical_list, cutoff=0.6):
    if pd.isna(val):
        return np.nan
    matches = get_close_matches(val, canonical_list, n=1, cutoff=cutoff)
    return matches[0] if matches else np.nan

cleaned = cleaned.apply(lambda x: fuzzy_map(x, canonical))

# Step 4: Convert to ordered categorical
order = ['Very Dissatisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']
cleaned = cleaned.str.title()
df['satisfaction_clean'] = pd.Categorical(cleaned, categories=order, ordered=True)

print(f"Parse success: {df['satisfaction_clean'].notna().sum()}/{len(df)}")
print(df['satisfaction_clean'].value_counts().reindex(order))

# Step 5: Ordered bar chart
fig, ax = plt.subplots(figsize=(10, 5))
vc = df['satisfaction_clean'].value_counts().reindex(order)
colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(order)))
ax.bar(order, vc.values, color=colors, edgecolor='black')
ax.set_title('Satisfaction Distribution (Ordered)')
ax.set_ylabel('Count')
plt.tight_layout()
plt.show()

Exercise 3: A dataset has a payment_method column. The imbalance ratio is 45:1 (Credit Card: 90%, Crypto: 2%). Write code to compute the imbalance ratio, Gini impurity, and chi-squared goodness-of-fit test against a uniform distribution. Then recommend whether special handling is needed for ML modeling.

Solution
python
import pandas as pd
import numpy as np
from scipy import stats

vc = df['payment_method'].value_counts()
n_cats = df['payment_method'].nunique()
total = len(df)
expected = total / n_cats

# Imbalance ratio
imbalance_ratio = vc.iloc[0] / vc.iloc[-1]
print(f"Most common:  {vc.index[0]} ({vc.iloc[0]:,}, {vc.iloc[0]/total*100:.1f}%)")
print(f"Least common: {vc.index[-1]} ({vc.iloc[-1]:,}, {vc.iloc[-1]/total*100:.1f}%)")
print(f"Imbalance ratio: {imbalance_ratio:.1f}:1")

# Gini impurity
probs = vc / total
gini = 1 - (probs ** 2).sum()
gini_max = 1 - 1/n_cats
gini_norm = gini / gini_max
print(f"\nGini impurity: {gini:.4f} (normalized: {gini_norm:.4f})")

# Chi-squared vs uniform
chi2, p_value = stats.chisquare(vc)
print(f"Chi-squared vs uniform: chi2={chi2:.1f}, p={p_value:.2e}")

# Recommendation
print(f"\n--- ML Recommendation ---")
if imbalance_ratio > 10:
    print(f"Imbalance ratio {imbalance_ratio:.0f}:1 is SEVERE.")
    print("1. Use stratified train/test splits")
    print("2. Consider grouping Crypto into 'Other' if it has <50 samples")
    print("3. If payment_method is the TARGET, use SMOTE or class weights")
    print("4. If it is a FEATURE, use target encoding instead of one-hot")
    print("   (one-hot creates a near-zero-variance column for Crypto)")

Quick Quiz

1. Why are pie charts almost always the wrong choice for categorical data?

  • a) They require too much computation
  • b) Humans are poor at comparing angles and areas, leading to 10-25% perception error
  • c) They cannot display more than 3 categories
Answer

b) Humans are poor at comparing angles and areas, leading to 10-25% perception error. Research in visual perception shows that humans compare positions on a common scale (bar charts) with about 0.5% error, but compare angles and areas (pie charts) with 10-25% error. Two slices at 28% and 32% look nearly identical in a pie chart but are clearly different in a bar chart.

2. What does normalized entropy of 0.45 tell you about a categorical variable?

  • a) The variable is nearly uniformly distributed
  • b) The variable is highly imbalanced, dominated by one or two categories
  • c) The variable has too many categories
Answer

b) The variable is highly imbalanced, dominated by one or two categories. Normalized entropy ranges from 0 (single category) to 1 (perfectly uniform). A value of 0.45 means the distribution uses less than half of its possible "diversity." This typically means a few categories carry most of the data while many others are rare.

3. When should you use ordinal encoding instead of one-hot encoding?

  • a) When the categories have a natural, meaningful order (e.g., education level)
  • b) When there are more than 10 categories
  • c) When using tree-based models
Answer

a) When the categories have a natural, meaningful order (e.g., education level). Ordinal encoding assigns integers that respect the inherent order: HS=1, BS=2, MS=3, PhD=4. This preserves the information that PhD > MS > BS > HS. One-hot encoding would treat them as completely unrelated categories, destroying the ordering information.

4. What is a Pareto analysis used for in categorical EDA?

  • a) Testing if a variable is normally distributed
  • b) Identifying the vital few categories that account for most of the data (80/20 rule)
  • c) Computing the correlation between two categorical variables
Answer

b) Identifying the vital few categories that account for most of the data (80/20 rule). A Pareto chart plots categories in descending frequency with a cumulative percentage line. It reveals that often 20% of categories account for 80% of observations. This guides decisions about which categories to keep, which to group into "Other," and where to focus analysis.

5. You have a column with values "New York", "new york", "NY", "N.Y.", and "NewYork". What is the correct cleaning approach?

  • a) Delete all rows with non-standard values
  • b) Normalize case, strip whitespace, then use fuzzy matching to map variants to canonical values
  • c) One-hot encode all variants as separate categories
Answer

b) Normalize case, strip whitespace, then use fuzzy matching to map variants to canonical values. The correct approach is: (1) str.lower().str.strip() to normalize case and whitespace, (2) replace known abbreviations (NY -> new york), and (3) use difflib.get_close_matches() with a similarity cutoff to catch typos. This collapses all 5 variants into a single canonical "new york" category.

"What I cannot create, I do not understand." — Richard Feynman