Univariate Categorical Analysis

Categorical variables are everywhere — product types, customer segments, regions, status codes, error messages, survey responses. They look simple, but they hide traps that can wreck your models. A column with 5 clean categories and one with 10,000 typo-ridden free-text entries are both "categorical," but they demand completely different treatment.

This page covers every tool you need: value counts, frequency tables, bar plots, why pie charts lie, entropy as a measure of diversity, Pareto analysis for the vital few, and how to detect and handle category imbalance.

The Dataset

We will use a synthetic e-commerce dataset with realistic categorical distributions.

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)
n = 5000

# Product category: power-law distribution (few popular, many niche)
categories = [
    "Electronics", "Clothing", "Home & Garden", "Sports", "Books",
    "Toys", "Automotive", "Health", "Food", "Beauty",
    "Office", "Pet Supplies", "Music", "Software", "Crafts",
]
cat_probs = np.array([0.25, 0.18, 0.12, 0.10, 0.08, 0.06, 0.05, 0.04, 0.03,
                       0.025, 0.02, 0.015, 0.01, 0.008, 0.002])
cat_probs = cat_probs / cat_probs.sum()
product_category = np.random.choice(categories, size=n, p=cat_probs)

# Payment method: moderate imbalance
payment_methods = ["Credit Card", "Debit Card", "PayPal", "Apple Pay", "Crypto"]
pay_probs = [0.45, 0.25, 0.18, 0.10, 0.02]
payment = np.random.choice(payment_methods, size=n, p=pay_probs)

# Region: roughly balanced
regions = ["North America", "Europe", "Asia", "South America", "Africa", "Oceania"]
reg_probs = [0.30, 0.28, 0.22, 0.10, 0.05, 0.05]
region = np.random.choice(regions, size=n, p=reg_probs)

# Order status: heavily imbalanced
statuses = ["Delivered", "Shipped", "Processing", "Cancelled", "Returned", "Fraud"]
status_probs = [0.65, 0.15, 0.10, 0.05, 0.04, 0.01]
order_status = np.random.choice(statuses, size=n, p=status_probs)

# Satisfaction rating: ordinal
ratings = ["Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"]
rat_probs = [0.05, 0.10, 0.25, 0.35, 0.25]
satisfaction = np.random.choice(ratings, size=n, p=rat_probs)

df = pd.DataFrame({
    "product_category": product_category,
    "payment_method": payment,
    "region": region,
    "order_status": order_status,
    "satisfaction": satisfaction,
})

print(df.shape)
print(df.dtypes)
df.head()

Value Counts: The Starting Point

Value counts are to categorical variables what describe() is to numerical ones. Always look at both absolute and relative frequencies.

python

def detailed_value_counts(series, name="variable"):
    """Compute value counts with cumulative percentages."""
    vc = series.value_counts()
    result = pd.DataFrame({
        "count": vc,
        "pct": (vc / len(series) * 100).round(2),
        "cumulative_count": vc.cumsum(),
        "cumulative_pct": (vc.cumsum() / len(series) * 100).round(2),
    })
    print(f"\n{'='*60}")
    print(f"  Value Counts: {name}")
    print(f"  Unique values: {series.nunique()}")
    print(f"  Missing: {series.isna().sum()} ({series.isna().mean()*100:.1f}%)")
    print(f"  Mode: {series.mode().iloc[0]} ({vc.iloc[0]:,} / {vc.iloc[0]/len(series)*100:.1f}%)")
    print(f"{'='*60}")
    print(result.to_string())
    return result

for col in df.columns:
    detailed_value_counts(df[col], col)

Bar Plots: The Right Way to Visualize Categories

Bar plots are the correct default for categorical data. Horizontal bars work better when category names are long.

python

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Vertical bar — good for few categories
vc = df["payment_method"].value_counts()
axes[0, 0].bar(vc.index, vc.values, color="steelblue", edgecolor="black")
axes[0, 0].set_title("Payment Method (Vertical Bar)", fontsize=13)
for i, (val, count) in enumerate(zip(vc.index, vc.values)):
    axes[0, 0].text(i, count + 20, f"{count:,}\n({count/n*100:.1f}%)",
                    ha="center", fontsize=9)

# Horizontal bar — better for many categories or long names
vc = df["product_category"].value_counts()
axes[0, 1].barh(vc.index[::-1], vc.values[::-1], color="steelblue", edgecolor="black")
axes[0, 1].set_title("Product Category (Horizontal Bar)", fontsize=13)
for i, (val, count) in enumerate(zip(vc.values[::-1], vc.index[::-1])):
    axes[0, 1].text(val + 10, i, f" {val:,}", va="center", fontsize=9)

# Ordered by frequency — always do this
vc = df["order_status"].value_counts()
colors = ["#2ecc71" if v > n*0.1 else "#e74c3c" if v < n*0.02 else "#f39c12" for v in vc.values]
axes[1, 0].bar(vc.index, vc.values, color=colors, edgecolor="black")
axes[1, 0].set_title("Order Status (Color-coded by frequency)", fontsize=13)
axes[1, 0].set_ylabel("Count")

# Percentage bar — often more informative than counts
vc_pct = df["region"].value_counts(normalize=True) * 100
axes[1, 1].barh(vc_pct.index[::-1], vc_pct.values[::-1], color="steelblue", edgecolor="black")
axes[1, 1].set_title("Region (Percentage)", fontsize=13)
axes[1, 1].set_xlabel("Percentage (%)")
for i, pct in enumerate(vc_pct.values[::-1]):
    axes[1, 1].text(pct + 0.2, i, f" {pct:.1f}%", va="center", fontsize=9)

plt.tight_layout()
plt.savefig("bar_plots.png", dpi=150, bbox_inches="tight")
plt.show()

Why Pie Charts Are Almost Always Wrong

Pie charts are the most overused and least informative chart type. Human brains are terrible at comparing angles and areas.

python

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# The same data shown three ways
vc = df["region"].value_counts()

# Pie chart — hard to compare similar-sized slices
axes[0].pie(vc.values, labels=vc.index, autopct="%1.1f%%", startangle=90)
axes[0].set_title("Pie Chart\n(Can you tell North America vs Europe?)", fontsize=12)

# Bar chart — easy to compare
axes[1].barh(vc.index[::-1], vc.values[::-1], color="steelblue", edgecolor="black")
axes[1].set_title("Bar Chart\n(Much easier to compare)", fontsize=12)

# Worst case: too many categories
vc_many = df["product_category"].value_counts()
axes[2].pie(vc_many.values, labels=vc_many.index, autopct="%1.1f%%",
            startangle=90, textprops={"fontsize": 7})
axes[2].set_title("Pie Chart with Many Categories\n(Completely unreadable)", fontsize=12)

plt.tight_layout()
plt.savefig("pie_vs_bar.png", dpi=150, bbox_inches="tight")
plt.show()

When are pie charts acceptable?

Almost never. The only defensible use case is showing that one category dominates (>60%) everything else. Even then, a stacked bar at 100% width is clearer. If you must use a pie chart, never use more than 5 slices and always add percentage labels.

The Perception Problem

Frequency Tables and Cross-Tabulation

Beyond simple value counts, frequency tables let you compute marginal and joint distributions.

python

# Frequency table with multiple aggregations
def frequency_table(series, name="variable"):
    """Build a detailed frequency table."""
    vc = series.value_counts().sort_index()
    total = len(series)

    table = pd.DataFrame({
        "count": vc,
        "relative_freq": vc / total,
        "percentage": (vc / total * 100).round(2),
        "cumulative_pct": (vc.cumsum() / total * 100).round(2),
    })

    # Add expected count under uniform distribution
    n_cats = series.nunique()
    table["expected_uniform"] = total / n_cats
    table["observed_vs_expected"] = (table["count"] / table["expected_uniform"]).round(3)

    return table

freq = frequency_table(df["product_category"], "product_category")
print(freq.sort_values("count", ascending=False).to_string())

Entropy: Measuring Category Diversity

Entropy quantifies the "unpredictability" or "diversity" of a categorical distribution. Maximum entropy means all categories are equally likely. Minimum entropy (zero) means only one category exists.

python

from scipy.stats import entropy as sp_entropy

def category_entropy(series, name="variable"):
    """Compute entropy and normalized entropy for a categorical variable."""
    vc = series.value_counts(normalize=True)
    n_cats = series.nunique()

    # Shannon entropy (base 2, in bits)
    h = sp_entropy(vc, base=2)

    # Maximum possible entropy (uniform distribution)
    h_max = np.log2(n_cats)

    # Normalized entropy (0 = single category, 1 = perfectly uniform)
    h_norm = h / h_max if h_max > 0 else 0

    print(f"\n--- Entropy Analysis: {name} ---")
    print(f"  Unique categories:    {n_cats}")
    print(f"  Shannon entropy:      {h:.4f} bits")
    print(f"  Maximum entropy:      {h_max:.4f} bits")
    print(f"  Normalized entropy:   {h_norm:.4f} (1.0 = perfectly balanced)")
    print(f"  Interpretation:       ", end="")
    if h_norm > 0.9:
        print("Very balanced — near-uniform distribution")
    elif h_norm > 0.7:
        print("Moderately balanced — some categories dominate")
    elif h_norm > 0.5:
        print("Imbalanced — few categories carry most of the data")
    else:
        print("Highly imbalanced — dominated by one or two categories")

    return {"entropy": h, "max_entropy": h_max, "normalized_entropy": h_norm}

# Compare entropy across all categorical columns
entropy_results = {}
for col in df.columns:
    entropy_results[col] = category_entropy(df[col], col)

# Visualize
ent_df = pd.DataFrame(entropy_results).T
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.barh(ent_df.index, ent_df["normalized_entropy"],
               color=["#e74c3c" if v < 0.7 else "#2ecc71" for v in ent_df["normalized_entropy"]],
               edgecolor="black")
ax.set_xlabel("Normalized Entropy (1.0 = perfectly balanced)")
ax.set_title("Category Balance by Variable", fontsize=14)
ax.set_xlim(0, 1.05)
for bar, val in zip(bars, ent_df["normalized_entropy"]):
    ax.text(val + 0.01, bar.get_y() + bar.get_height()/2, f"{val:.3f}",
            va="center", fontsize=10)
plt.tight_layout()
plt.savefig("entropy_comparison.png", dpi=150, bbox_inches="tight")
plt.show()

Why Entropy Matters

Pareto Analysis (The 80/20 Rule)

Pareto analysis tells you which categories carry most of the weight. In practice, you often find that 20% of categories account for 80% of observations.

python

def pareto_chart(series, name="variable", figsize=(12, 6)):
    """Create a Pareto chart with cumulative line."""
    vc = series.value_counts()
    cumulative_pct = vc.cumsum() / vc.sum() * 100

    fig, ax1 = plt.subplots(figsize=figsize)

    # Bars
    bars = ax1.bar(range(len(vc)), vc.values, color="steelblue",
                   edgecolor="black", alpha=0.8)
    ax1.set_xticks(range(len(vc)))
    ax1.set_xticklabels(vc.index, rotation=45, ha="right")
    ax1.set_ylabel("Count", color="steelblue")
    ax1.tick_params(axis="y", labelcolor="steelblue")

    # Cumulative line
    ax2 = ax1.twinx()
    ax2.plot(range(len(vc)), cumulative_pct, color="crimson",
             marker="o", linewidth=2, markersize=6)
    ax2.set_ylabel("Cumulative %", color="crimson")
    ax2.tick_params(axis="y", labelcolor="crimson")
    ax2.set_ylim(0, 105)

    # 80% reference line
    ax2.axhline(y=80, color="gray", linestyle="--", alpha=0.7, label="80% line")

    # Find how many categories hit 80%
    n_80 = (cumulative_pct <= 80).sum() + 1
    ax1.axvspan(-0.5, n_80 - 0.5, alpha=0.1, color="crimson",
                label=f"Top {n_80} categories = ~80%")

    ax1.set_title(f"Pareto Chart: {name}\n"
                  f"Top {n_80} of {len(vc)} categories cover ~80% of data",
                  fontsize=14)
    ax1.legend(loc="upper left")
    plt.tight_layout()
    plt.savefig(f"pareto_{name}.png", dpi=150, bbox_inches="tight")
    plt.show()

    return n_80, len(vc)

n80, total = pareto_chart(df["product_category"], "Product Category")
print(f"\n80/20 Analysis: {n80} categories ({n80/total*100:.0f}%) cover ~80% of orders")

Category Imbalance Detection

Imbalanced categories cause real problems in machine learning — minority classes get ignored, models optimize for the majority, and your precision/recall on rare classes tanks.

python

def imbalance_report(series, name="variable"):
    """Detect and quantify category imbalance."""
    vc = series.value_counts()
    n_cats = series.nunique()
    expected = len(series) / n_cats

    # Imbalance ratio: majority count / minority count
    imbalance_ratio = vc.iloc[0] / vc.iloc[-1]

    # Chi-square goodness of fit (vs uniform)
    chi2, p_value = stats.chisquare(vc)

    # Gini impurity
    probs = vc / vc.sum()
    gini = 1 - (probs ** 2).sum()
    gini_max = 1 - 1/n_cats
    gini_normalized = gini / gini_max if gini_max > 0 else 0

    print(f"\n{'='*60}")
    print(f"  Imbalance Report: {name}")
    print(f"{'='*60}")
    print(f"  Categories:        {n_cats}")
    print(f"  Most common:       {vc.index[0]} ({vc.iloc[0]:,}, {vc.iloc[0]/len(series)*100:.1f}%)")
    print(f"  Least common:      {vc.index[-1]} ({vc.iloc[-1]:,}, {vc.iloc[-1]/len(series)*100:.1f}%)")
    print(f"  Imbalance ratio:   {imbalance_ratio:.1f}:1")
    print(f"  Gini impurity:     {gini:.4f} (normalized: {gini_normalized:.4f})")
    print(f"  Chi² vs uniform:   {chi2:.1f} (p={p_value:.2e})")

    # Flag severely underrepresented categories
    threshold = expected * 0.1  # less than 10% of expected
    rare = vc[vc < threshold]
    if len(rare) > 0:
        print(f"\n  WARNING: {len(rare)} categories have < 10% of expected count:")
        for cat, count in rare.items():
            print(f"    - {cat}: {count:,} (expected ~{expected:.0f})")

    return {
        "imbalance_ratio": imbalance_ratio,
        "gini": gini,
        "gini_normalized": gini_normalized,
    }

for col in df.columns:
    imbalance_report(df[col], col)

Imbalance Severity Scale

Imbalance Ratio	Severity	Impact on ML
1:1 to 3:1	Mild	Usually fine without special treatment
3:1 to 10:1	Moderate	Use stratified sampling, consider class weights
10:1 to 100:1	Severe	SMOTE, cost-sensitive learning, specialized metrics
> 100:1	Extreme	Anomaly detection framing may be better than classification

Handling Dirty Categories

Real-world categorical data is messy. The same category appears as "New York", "new york", "NY", "N.Y.", and "NewYork".

python

# Simulate dirty data
dirty_categories = np.random.choice(
    ["Electronics", "electronics", "ELECTRONICS", "Electrnics",
     "Clothing", "clothing", "Clothng", "Clothes",
     "Home & Garden", "Home and Garden", "Home&Garden", "home garden"],
    size=500,
)
dirty_series = pd.Series(dirty_categories)

print("Before cleaning:")
print(dirty_series.value_counts())
print(f"Unique values: {dirty_series.nunique()}")

# Step 1: Normalize case and whitespace
cleaned = dirty_series.str.lower().str.strip()

# Step 2: Remove special characters
cleaned = cleaned.str.replace(r"[&+]", " and ", regex=True)
cleaned = cleaned.str.replace(r"\s+", " ", regex=True)

# Step 3: Fuzzy matching for typos
from difflib import get_close_matches

canonical = ["electronics", "clothing", "home and garden"]

def fuzzy_map(value, canonical_list, cutoff=0.7):
    matches = get_close_matches(value, canonical_list, n=1, cutoff=cutoff)
    return matches[0] if matches else value

cleaned = cleaned.apply(lambda x: fuzzy_map(x, canonical))

print("\nAfter cleaning:")
print(cleaned.value_counts())
print(f"Unique values: {cleaned.nunique()}")

Complete Categorical Profile Function

python

def categorical_profile(series, name="variable", top_n=15, figsize=(16, 10)):
    """Generate a complete categorical variable profile."""
    fig = plt.figure(figsize=figsize)

    vc = series.value_counts()
    vc_top = vc.head(top_n)
    pct = vc / len(series) * 100
    cumulative_pct = vc.cumsum() / len(series) * 100
    probs = vc / vc.sum()

    # Layout
    ax1 = fig.add_subplot(2, 2, 1)  # Bar plot
    ax2 = fig.add_subplot(2, 2, 2)  # Pareto
    ax3 = fig.add_subplot(2, 2, 3)  # Proportion with CI
    ax4 = fig.add_subplot(2, 2, 4)  # Stats text

    # 1. Horizontal bar
    ax1.barh(vc_top.index[::-1], vc_top.values[::-1],
             color="steelblue", edgecolor="black")
    ax1.set_title(f"Top {min(top_n, len(vc))} Categories", fontsize=12)
    ax1.set_xlabel("Count")

    # 2. Pareto (cumulative)
    ax2.bar(range(len(vc_top)), vc_top.values, color="steelblue",
            edgecolor="black", alpha=0.7)
    ax2_twin = ax2.twinx()
    ax2_twin.plot(range(len(vc_top)), cumulative_pct.head(top_n),
                  color="crimson", marker="o", linewidth=2)
    ax2_twin.axhline(80, color="gray", linestyle="--", alpha=0.5)
    ax2_twin.set_ylabel("Cumulative %")
    ax2.set_xticks(range(len(vc_top)))
    ax2.set_xticklabels(vc_top.index, rotation=45, ha="right", fontsize=8)
    ax2.set_title("Pareto (Cumulative %)", fontsize=12)

    # 3. Proportion with 95% confidence intervals
    n_total = len(series)
    for i, (cat, count) in enumerate(vc_top.items()):
        p = count / n_total
        se = np.sqrt(p * (1 - p) / n_total)
        ci_low = max(0, p - 1.96 * se)
        ci_high = min(1, p + 1.96 * se)
        ax3.barh(i, p, color="steelblue", edgecolor="black", alpha=0.7)
        ax3.errorbar(p, i, xerr=[[p - ci_low], [ci_high - p]],
                     fmt="none", color="black", capsize=4)
    ax3.set_yticks(range(len(vc_top)))
    ax3.set_yticklabels(vc_top.index)
    ax3.set_xlabel("Proportion (with 95% CI)")
    ax3.set_title("Proportions with Confidence Intervals", fontsize=12)

    # 4. Stats summary
    ax4.axis("off")
    h = sp_entropy(probs, base=2)
    h_max = np.log2(series.nunique())
    gini = 1 - (probs ** 2).sum()
    imb = vc.iloc[0] / vc.iloc[-1] if vc.iloc[-1] > 0 else float("inf")

    stats_text = (
        f"Total observations:   {len(series):,}\n"
        f"Unique categories:    {series.nunique()}\n"
        f"Missing values:       {series.isna().sum()}\n\n"
        f"Mode:                 {vc.index[0]}\n"
        f"Mode frequency:       {vc.iloc[0]:,} ({pct.iloc[0]:.1f}%)\n"
        f"Least common:         {vc.index[-1]}\n"
        f"Least common freq:    {vc.iloc[-1]:,} ({pct.iloc[-1]:.1f}%)\n\n"
        f"Imbalance ratio:      {imb:.1f}:1\n"
        f"Shannon entropy:      {h:.3f} bits\n"
        f"Normalized entropy:   {h/h_max:.3f}\n"
        f"Gini impurity:        {gini:.3f}"
    )
    ax4.text(0.05, 0.5, stats_text, transform=ax4.transAxes, fontsize=11,
             fontfamily="monospace", verticalalignment="center",
             bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8))
    ax4.set_title("Summary Statistics", fontsize=12)

    fig.suptitle(f"Categorical Profile: {name}", fontsize=16, fontweight="bold")
    plt.tight_layout()
    plt.savefig(f"cat_profile_{name}.png", dpi=150, bbox_inches="tight")
    plt.show()

for col in df.columns:
    categorical_profile(df[col], col)

Ordinal vs. Nominal: It Matters

python

# Ordinal encoding that preserves order
satisfaction_order = {
    "Very Dissatisfied": 1,
    "Dissatisfied": 2,
    "Neutral": 3,
    "Satisfied": 4,
    "Very Satisfied": 5,
}

df["satisfaction_ordinal"] = df["satisfaction"].map(satisfaction_order)

# Now you can meaningfully compute mean, median, etc.
print(f"Mean satisfaction: {df['satisfaction_ordinal'].mean():.2f}")
print(f"Median satisfaction: {df['satisfaction_ordinal'].median():.1f}")

# Distribution of ordinal variable
fig, ax = plt.subplots(figsize=(10, 5))
order = list(satisfaction_order.keys())
vc = df["satisfaction"].value_counts().reindex(order)
colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(order)))
ax.bar(order, vc.values, color=colors, edgecolor="black")
ax.set_title("Satisfaction Distribution (Ordered)", fontsize=14)
ax.set_ylabel("Count")
for i, (cat, count) in enumerate(zip(order, vc.values)):
    ax.text(i, count + 20, f"{count:,}\n({count/n*100:.1f}%)", ha="center", fontsize=9)
plt.tight_layout()
plt.savefig("ordinal_distribution.png", dpi=150, bbox_inches="tight")
plt.show()

Rare Category Strategies

python

def group_rare_categories(series, min_pct=2.0, other_label="Other"):
    """Group categories below a minimum percentage threshold."""
    vc = series.value_counts(normalize=True) * 100
    rare_cats = vc[vc < min_pct].index.tolist()

    grouped = series.copy()
    grouped[grouped.isin(rare_cats)] = other_label

    print(f"Grouped {len(rare_cats)} rare categories into '{other_label}':")
    for cat in rare_cats:
        print(f"  - {cat} ({vc[cat]:.2f}%)")
    print(f"\nBefore: {series.nunique()} categories")
    print(f"After:  {grouped.nunique()} categories")

    return grouped

grouped_category = group_rare_categories(df["product_category"], min_pct=3.0)
print("\nNew distribution:")
print(grouped_category.value_counts())

Practical Checklist

Before moving on from univariate categorical analysis, verify the following for every categorical column:

Cardinality: How many unique values? Is it nominal, ordinal, or secretly high-cardinality?
Frequency distribution: Is it balanced, skewed, or dominated by one category?
Rare categories: Are there categories with very few observations? How will you handle them?
Data quality: Are there typos, inconsistent casing, or duplicate representations?
Missing values: Are they random or concentrated in certain categories?
Entropy: How diverse is the distribution? Is the imbalance expected or a data issue?
Ordering: If ordinal, have you encoded the order explicitly?

Key Takeaways

Always compute value counts with both absolute and cumulative percentages.
Bar plots are the correct default visualization for categories. Pie charts fail at comparison.
Use entropy to quantify category balance — a normalized entropy below 0.7 signals meaningful imbalance.
Pareto analysis reveals the vital few categories that carry most of the data.
Imbalance ratios above 10:1 require special treatment in ML pipelines (stratified splits, class weights, or resampling).
Clean dirty categories before analysis: normalize case, trim whitespace, fix typos with fuzzy matching.
Distinguish ordinal from nominal — encoding ordinal variables as unordered categories destroys information.

Try It Yourself

Exercise 1: An e-commerce dataset has a product_category column with 200 unique values. The top 5 categories make up 78% of orders, and 150 categories each have fewer than 10 orders (out of 50,000 total). Write code to do a Pareto analysis, compute entropy, and group rare categories into an "Other" bucket.

Solution

python

import pandas as pd
import numpy as np
from scipy.stats import entropy as sp_entropy

vc = df['product_category'].value_counts()
cumulative_pct = vc.cumsum() / vc.sum() * 100

# Pareto analysis
n_80 = (cumulative_pct <= 80).sum() + 1
print(f"Pareto: {n_80} of {len(vc)} categories cover 80% of orders")
print(f"That is {n_80/len(vc)*100:.1f}% of categories")

# Entropy
probs = vc / vc.sum()
h = sp_entropy(probs, base=2)
h_max = np.log2(len(vc))
h_norm = h / h_max
print(f"\nShannon entropy: {h:.3f} bits")
print(f"Normalized entropy: {h_norm:.3f} (1.0 = balanced)")
print(f"Interpretation: {'Highly imbalanced' if h_norm < 0.5 else 'Imbalanced'}")

# Group rare categories
min_count = 50  # categories with fewer than 50 orders -> "Other"
rare_cats = vc[vc < min_count].index.tolist()
df['product_category_grouped'] = df['product_category'].copy()
df.loc[df['product_category_grouped'].isin(rare_cats), 'product_category_grouped'] = 'Other'

print(f"\nBefore grouping: {df['product_category'].nunique()} categories")
print(f"After grouping:  {df['product_category_grouped'].nunique()} categories")
print(f"Grouped {len(rare_cats)} rare categories into 'Other'")
print(f"\nNew top 10:")
print(df['product_category_grouped'].value_counts().head(10))

Exercise 2: You have a survey column satisfaction with values: "Very Satisfied", "Satisfied", "Neutral", "Dissatisfied", "Very Dissatisfied", plus some messy entries like "very satisfied", "V. Satisfied", "N/A", and empty strings. Write code to clean it, recognize it as ordinal, and produce a proper ordered bar chart.

Solution

python

import pandas as pd
import numpy as np
from difflib import get_close_matches
import matplotlib.pyplot as plt

# Step 1: Clean — normalize case, strip whitespace
cleaned = df['satisfaction'].str.strip().str.lower()

# Step 2: Map known null values
null_values = {'', 'n/a', 'na', 'none', 'null', '-'}
cleaned = cleaned.replace(null_values, np.nan)

# Step 3: Fuzzy match to canonical values
canonical = ['very satisfied', 'satisfied', 'neutral', 'dissatisfied', 'very dissatisfied']

def fuzzy_map(val, canonical_list, cutoff=0.6):
    if pd.isna(val):
        return np.nan
    matches = get_close_matches(val, canonical_list, n=1, cutoff=cutoff)
    return matches[0] if matches else np.nan

cleaned = cleaned.apply(lambda x: fuzzy_map(x, canonical))

# Step 4: Convert to ordered categorical
order = ['Very Dissatisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']
cleaned = cleaned.str.title()
df['satisfaction_clean'] = pd.Categorical(cleaned, categories=order, ordered=True)

print(f"Parse success: {df['satisfaction_clean'].notna().sum()}/{len(df)}")
print(df['satisfaction_clean'].value_counts().reindex(order))

# Step 5: Ordered bar chart
fig, ax = plt.subplots(figsize=(10, 5))
vc = df['satisfaction_clean'].value_counts().reindex(order)
colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(order)))
ax.bar(order, vc.values, color=colors, edgecolor='black')
ax.set_title('Satisfaction Distribution (Ordered)')
ax.set_ylabel('Count')
plt.tight_layout()
plt.show()

Exercise 3: A dataset has a payment_method column. The imbalance ratio is 45:1 (Credit Card: 90%, Crypto: 2%). Write code to compute the imbalance ratio, Gini impurity, and chi-squared goodness-of-fit test against a uniform distribution. Then recommend whether special handling is needed for ML modeling.

Solution

python

import pandas as pd
import numpy as np
from scipy import stats

vc = df['payment_method'].value_counts()
n_cats = df['payment_method'].nunique()
total = len(df)
expected = total / n_cats

# Imbalance ratio
imbalance_ratio = vc.iloc[0] / vc.iloc[-1]
print(f"Most common:  {vc.index[0]} ({vc.iloc[0]:,}, {vc.iloc[0]/total*100:.1f}%)")
print(f"Least common: {vc.index[-1]} ({vc.iloc[-1]:,}, {vc.iloc[-1]/total*100:.1f}%)")
print(f"Imbalance ratio: {imbalance_ratio:.1f}:1")

# Gini impurity
probs = vc / total
gini = 1 - (probs ** 2).sum()
gini_max = 1 - 1/n_cats
gini_norm = gini / gini_max
print(f"\nGini impurity: {gini:.4f} (normalized: {gini_norm:.4f})")

# Chi-squared vs uniform
chi2, p_value = stats.chisquare(vc)
print(f"Chi-squared vs uniform: chi2={chi2:.1f}, p={p_value:.2e}")

# Recommendation
print(f"\n--- ML Recommendation ---")
if imbalance_ratio > 10:
    print(f"Imbalance ratio {imbalance_ratio:.0f}:1 is SEVERE.")
    print("1. Use stratified train/test splits")
    print("2. Consider grouping Crypto into 'Other' if it has <50 samples")
    print("3. If payment_method is the TARGET, use SMOTE or class weights")
    print("4. If it is a FEATURE, use target encoding instead of one-hot")
    print("   (one-hot creates a near-zero-variance column for Crypto)")

Quick Quiz

1. Why are pie charts almost always the wrong choice for categorical data?

a) They require too much computation
b) Humans are poor at comparing angles and areas, leading to 10-25% perception error
c) They cannot display more than 3 categories

Answer

b) Humans are poor at comparing angles and areas, leading to 10-25% perception error. Research in visual perception shows that humans compare positions on a common scale (bar charts) with about 0.5% error, but compare angles and areas (pie charts) with 10-25% error. Two slices at 28% and 32% look nearly identical in a pie chart but are clearly different in a bar chart.

2. What does normalized entropy of 0.45 tell you about a categorical variable?

a) The variable is nearly uniformly distributed
b) The variable is highly imbalanced, dominated by one or two categories
c) The variable has too many categories

Answer

b) The variable is highly imbalanced, dominated by one or two categories. Normalized entropy ranges from 0 (single category) to 1 (perfectly uniform). A value of 0.45 means the distribution uses less than half of its possible "diversity." This typically means a few categories carry most of the data while many others are rare.

3. When should you use ordinal encoding instead of one-hot encoding?

a) When the categories have a natural, meaningful order (e.g., education level)
b) When there are more than 10 categories
c) When using tree-based models

Answer

a) When the categories have a natural, meaningful order (e.g., education level). Ordinal encoding assigns integers that respect the inherent order: HS=1, BS=2, MS=3, PhD=4. This preserves the information that PhD > MS > BS > HS. One-hot encoding would treat them as completely unrelated categories, destroying the ordering information.

4. What is a Pareto analysis used for in categorical EDA?

a) Testing if a variable is normally distributed
b) Identifying the vital few categories that account for most of the data (80/20 rule)
c) Computing the correlation between two categorical variables

Answer

b) Identifying the vital few categories that account for most of the data (80/20 rule). A Pareto chart plots categories in descending frequency with a cumulative percentage line. It reveals that often 20% of categories account for 80% of observations. This guides decisions about which categories to keep, which to group into "Other," and where to focus analysis.

5. You have a column with values "New York", "new york", "NY", "N.Y.", and "NewYork". What is the correct cleaning approach?

a) Delete all rows with non-standard values
b) Normalize case, strip whitespace, then use fuzzy matching to map variants to canonical values
c) One-hot encode all variants as separate categories

Answer

b) Normalize case, strip whitespace, then use fuzzy matching to map variants to canonical values. The correct approach is: (1) str.lower().str.strip() to normalize case and whitespace, (2) replace known abbreviations (NY -> new york), and (3) use difflib.get_close_matches() with a similarity cutoff to catch typos. This collapses all 5 variants into a single canonical "new york" category.

Univariate Categorical Analysis ​

The Dataset ​

Value Counts: The Starting Point ​

Bar Plots: The Right Way to Visualize Categories ​

Why Pie Charts Are Almost Always Wrong ​

The Perception Problem ​

Frequency Tables and Cross-Tabulation ​

Entropy: Measuring Category Diversity ​

Why Entropy Matters ​

Pareto Analysis (The 80/20 Rule) ​

Category Imbalance Detection ​

Imbalance Severity Scale ​

Handling Dirty Categories ​

Complete Categorical Profile Function ​

Ordinal vs. Nominal: It Matters ​

Rare Category Strategies ​

Practical Checklist ​

Key Takeaways ​

Try It Yourself ​

Quick Quiz ​

Univariate Categorical Analysis

The Dataset

Value Counts: The Starting Point

Bar Plots: The Right Way to Visualize Categories

Why Pie Charts Are Almost Always Wrong

The Perception Problem

Frequency Tables and Cross-Tabulation

Entropy: Measuring Category Diversity

Why Entropy Matters

Pareto Analysis (The 80/20 Rule)

Category Imbalance Detection

Imbalance Severity Scale

Handling Dirty Categories

Complete Categorical Profile Function

Ordinal vs. Nominal: It Matters

Rare Category Strategies

Practical Checklist

Key Takeaways

Try It Yourself

Quick Quiz