Skip to content
Unverified — AI-generated content. Help verify this page

Univariate Numerical Analysis

Every exploratory analysis starts with a single column. Before you build models, compute correlations, or engineer features, you need to understand each numerical variable in isolation. What is its center? How spread out is it? Is it symmetric or skewed? Are there outliers? Is it roughly normal, or does it follow some other distribution entirely?

This page covers every tool you need to answer those questions — histograms, kernel density estimates, box plots, violin plots, QQ plots, descriptive statistics, and formal normality tests — with real data and full Python code.

The Dataset

We will use a synthetic but realistic dataset of 2,000 employee records throughout this page.

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)
n = 2000

# Realistic salary distribution: right-skewed with a long tail
base_salary = np.random.lognormal(mean=11.0, sigma=0.4, size=n)
salary = np.clip(base_salary, 30_000, 500_000).astype(int)

# Age: roughly normal, bounded
age = np.random.normal(loc=38, scale=10, size=n).astype(int)
age = np.clip(age, 18, 70)

# Tenure in months: exponential (many short, few long)
tenure_months = np.random.exponential(scale=36, size=n).astype(int)
tenure_months = np.clip(tenure_months, 0, 360)

# Performance score: slightly left-skewed (most people score OK or above)
performance = np.random.beta(a=5, b=2, size=n) * 100

df = pd.DataFrame({
    "salary": salary,
    "age": age,
    "tenure_months": tenure_months,
    "performance_score": performance,
})

print(df.shape)
print(df.dtypes)
df.head(10)

Descriptive Statistics: What Each Metric Means

Before any visualization, compute the summary statistics. But do not just glance at them — understand what each one tells you.

python
desc = df.describe().T
desc["skewness"] = df.skew()
desc["kurtosis"] = df.kurtosis()  # excess kurtosis (normal = 0)
desc["iqr"] = desc["75%"] - desc["25%"]
desc["cv"] = desc["std"] / desc["mean"]  # coefficient of variation
print(desc.round(2))

What Each Statistic Tells You

StatisticWhat It MeasuresWatch For
MeanArithmetic center of gravitySensitive to outliers. A single $10M salary drags the mean far right.
Median (50%)The middle value when sortedRobust to outliers. If median << mean, you have right skew.
StdAverage distance from the meanOnly meaningful for symmetric distributions.
Min / MaxExtremesCheck for impossible values (negative ages, 999999 placeholders).
25% / 75%Quartile boundariesThe IQR (75% - 25%) captures the middle 50%.
SkewnessAsymmetry direction and degree0 = symmetric, >0 = right tail, <0 = left tail.
KurtosisTail heaviness (excess)0 = normal tails, >0 = heavier tails, <0 = lighter tails.
CVRelative variability (std/mean)Allows comparing spread across different scales.

Mean vs. Median

If the mean and median differ by more than 10-20%, your distribution is meaningfully skewed. In that case, the median is almost always a better measure of "typical." Report both, but lead with the median.

python
# Demonstrate the mean vs median gap on salary
print(f"Salary mean:   ${df['salary'].mean():,.0f}")
print(f"Salary median: ${df['salary'].median():,.0f}")
print(f"Gap:           {((df['salary'].mean() / df['salary'].median()) - 1) * 100:.1f}%")

# Compare with age (roughly symmetric)
print(f"\nAge mean:   {df['age'].mean():.1f}")
print(f"Age median: {df['age'].median():.1f}")
print(f"Gap:        {((df['age'].mean() / df['age'].median()) - 1) * 100:.1f}%")

Histograms: The Foundation of Distribution Visualization

A histogram bins your data into intervals and counts how many observations fall into each bin. It is the single most important EDA plot.

The Bin Size Problem

Bin size changes everything. Too few bins hide the shape. Too many bins create noise. There is no universally correct answer, but there are principled defaults.

python
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Different bin strategies for salary
bin_strategies = {
    "5 bins (too few)": 5,
    "Sturges": "sturges",
    "Scott": "scott",
    "Freedman-Diaconis": "fd",
    "50 bins": 50,
    "200 bins (too many)": 200,
}

for ax, (name, bins) in zip(axes.flat, bin_strategies.items()):
    ax.hist(df["salary"], bins=bins, edgecolor="black", alpha=0.7, color="steelblue")
    if isinstance(bins, int):
        actual_bins = bins
    else:
        actual_bins = len(np.histogram(df["salary"], bins=bins)[0])
    ax.set_title(f"{name} ({actual_bins} bins)", fontsize=12)
    ax.set_xlabel("Salary ($)")
    ax.set_ylabel("Count")

plt.suptitle("How Bin Count Changes Your Perception", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.savefig("histogram_bins.png", dpi=150, bbox_inches="tight")
plt.show()

Bin Selection Rules

RuleFormulaBest For
Sturges1 + log2(n)Small, roughly normal data. Underbins for large n.
Scott3.49 * std * n^(-1/3)Normal or near-normal data.
Freedman-Diaconis2 * IQR * n^(-1/3)Robust to outliers and skew. Best general default.
Square rootsqrt(n)Simple, but no theoretical backing.
python
# Calculate each bin width explicitly for salary
n_obs = len(df["salary"])
salary_std = df["salary"].std()
salary_iqr = stats.iqr(df["salary"])

sturges_bins = int(np.ceil(1 + np.log2(n_obs)))
scott_width = 3.49 * salary_std * n_obs ** (-1/3)
fd_width = 2 * salary_iqr * n_obs ** (-1/3)
sqrt_bins = int(np.ceil(np.sqrt(n_obs)))

print(f"Sturges:           {sturges_bins} bins")
print(f"Scott:             width=${scott_width:,.0f} → ~{int((df['salary'].max() - df['salary'].min()) / scott_width)} bins")
print(f"Freedman-Diaconis: width=${fd_width:,.0f} → ~{int((df['salary'].max() - df['salary'].min()) / fd_width)} bins")
print(f"Square root:       {sqrt_bins} bins")

When in doubt, use Freedman-Diaconis

It uses the IQR instead of standard deviation, making it robust to outliers. For heavily skewed data (like salary), it usually gives better results than Scott or Sturges.

Kernel Density Estimation (KDE)

A histogram is a discrete approximation. KDE gives you a smooth, continuous estimate of the probability density function.

python
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# KDE with different bandwidths
bandwidths = [0.5, 1.0, 3.0]
for ax, bw_adjust in zip(axes, bandwidths):
    ax.hist(df["salary"], bins="fd", density=True, alpha=0.3,
            color="steelblue", edgecolor="black", label="Histogram")
    sns.kdeplot(df["salary"], bw_adjust=bw_adjust, ax=ax,
                color="crimson", linewidth=2, label=f"KDE (bw_adjust={bw_adjust})")
    ax.set_title(f"Bandwidth adjust = {bw_adjust}", fontsize=12)
    ax.set_xlabel("Salary ($)")
    ax.legend()

plt.suptitle("KDE Bandwidth Controls Smoothness", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.savefig("kde_bandwidth.png", dpi=150, bbox_inches="tight")
plt.show()

When KDE Misleads

KDE can create phantom density where no data exists — particularly at boundaries and in gaps.

python
# Demonstrate boundary artifact
bimodal = np.concatenate([
    np.random.normal(50, 5, 500),
    np.random.normal(80, 5, 500),
])

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(bimodal, bins=40, density=True, alpha=0.3, color="steelblue", edgecolor="black")
sns.kdeplot(bimodal, ax=ax, color="crimson", linewidth=2, label="Default KDE")
ax.axvspan(58, 72, alpha=0.1, color="red", label="Gap region — KDE invents density here")
ax.legend()
ax.set_title("KDE Can Hallucinate Density in Gaps", fontsize=14)
plt.tight_layout()
plt.savefig("kde_artifact.png", dpi=150, bbox_inches="tight")
plt.show()

Box Plots and Violin Plots

Box plots summarize the five-number summary (min, Q1, median, Q3, max) plus outliers. Violin plots add a KDE on each side to show the full distribution shape.

python
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Box plots for all numeric columns
df_melted = df.melt(var_name="variable", value_name="value")
# Normalize for comparable display
from sklearn.preprocessing import StandardScaler
df_scaled = pd.DataFrame(
    StandardScaler().fit_transform(df),
    columns=df.columns,
)
df_scaled_melted = df_scaled.melt(var_name="variable", value_name="z_score")

# Box plot
sns.boxplot(data=df_scaled_melted, x="variable", y="z_score", ax=axes[0],
            palette="Set2", flierprops=dict(marker="o", markersize=3, alpha=0.5))
axes[0].set_title("Box Plots (Standardized)", fontsize=14)
axes[0].set_ylabel("Z-Score")

# Violin plot
sns.violinplot(data=df_scaled_melted, x="variable", y="z_score", ax=axes[1],
               palette="Set2", inner="quartile")
axes[1].set_title("Violin Plots (Standardized)", fontsize=14)
axes[1].set_ylabel("Z-Score")

plt.tight_layout()
plt.savefig("box_violin.png", dpi=150, bbox_inches="tight")
plt.show()

Reading a Box Plot

When to Use Which

PlotBest ForLimitations
Box plotComparing distributions across groups, spotting outliersHides multimodality — two very different distributions can produce identical box plots
Violin plotSeeing the full shape while comparing groupsCan be confusing for non-technical audiences
HistogramDetailed shape of a single distributionHard to overlay many groups
KDESmooth comparison of 2-3 distributionsBandwidth-sensitive, boundary artifacts
python
# Box plots HIDE multimodality — demonstration
bimodal_data = np.concatenate([np.random.normal(30, 5, 500), np.random.normal(70, 5, 500)])
unimodal_data = np.random.normal(50, 20, 1000)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plots look similar
box_df = pd.DataFrame({
    "Bimodal": bimodal_data,
    "Unimodal": unimodal_data,
})
sns.boxplot(data=box_df, ax=axes[0], palette=["coral", "steelblue"])
axes[0].set_title("Box Plots Look Similar", fontsize=14)

# But histograms reveal the truth
axes[1].hist(bimodal_data, bins=40, alpha=0.5, color="coral", label="Bimodal", density=True)
axes[1].hist(unimodal_data, bins=40, alpha=0.5, color="steelblue", label="Unimodal", density=True)
axes[1].set_title("Histograms Reveal the Difference", fontsize=14)
axes[1].legend()

plt.tight_layout()
plt.savefig("boxplot_trap.png", dpi=150, bbox_inches="tight")
plt.show()

QQ Plots (Quantile-Quantile)

A QQ plot compares your data's quantiles against the theoretical quantiles of a reference distribution (usually normal). If the points fall on the diagonal line, your data matches that distribution.

python
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

columns = ["salary", "age", "tenure_months", "performance_score"]
titles = [
    "Salary (right-skewed — expect upward curve)",
    "Age (roughly normal — expect near-diagonal)",
    "Tenure (exponential — expect strong curve)",
    "Performance (left-skewed — expect downward curve at top)",
]

for ax, col, title in zip(axes.flat, columns, titles):
    stats.probplot(df[col], dist="norm", plot=ax)
    ax.set_title(title, fontsize=11)
    ax.get_lines()[0].set_markerfacecolor("steelblue")
    ax.get_lines()[0].set_alpha(0.5)
    ax.get_lines()[0].set_markersize(3)
    ax.get_lines()[1].set_color("crimson")

plt.suptitle("QQ Plots — How Departures from Normality Look", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.savefig("qq_plots.png", dpi=150, bbox_inches="tight")
plt.show()

Reading QQ Plot Patterns

Normality Tests

Visual inspection is necessary but not sufficient. Formal tests give you a p-value.

python
def test_normality(series, name):
    """Run multiple normality tests and return results."""
    results = {}

    # Shapiro-Wilk (best for n < 5000)
    if len(series) <= 5000:
        stat, p = stats.shapiro(series)
        results["Shapiro-Wilk"] = {"statistic": stat, "p_value": p}

    # D'Agostino-Pearson (requires n >= 20)
    stat, p = stats.normaltest(series)
    results["D'Agostino-Pearson"] = {"statistic": stat, "p_value": p}

    # Kolmogorov-Smirnov (against normal with sample mean/std)
    stat, p = stats.kstest(series, "norm", args=(series.mean(), series.std()))
    results["Kolmogorov-Smirnov"] = {"statistic": stat, "p_value": p}

    # Anderson-Darling
    result = stats.anderson(series, dist="norm")
    results["Anderson-Darling"] = {
        "statistic": result.statistic,
        "critical_values": dict(zip(result.significance_level, result.critical_values)),
    }

    # Jarque-Bera (tests skewness + kurtosis)
    stat, p = stats.jarque_bera(series)
    results["Jarque-Bera"] = {"statistic": stat, "p_value": p}

    return pd.DataFrame(results).T

for col in df.columns:
    print(f"\n{'='*60}")
    print(f"  Normality Tests: {col}")
    print(f"{'='*60}")
    result_df = test_normality(df[col], col)
    print(result_df.to_string())

Which Normality Test to Use

TestStrengthsWeaknesses
Shapiro-WilkBest overall power for n < 5000Computationally heavy for large n
D'Agostino-PearsonTests skewness and kurtosis specificallyNeeds n >= 20
Kolmogorov-SmirnovWorks for any reference distributionConservative; low power
Anderson-DarlingMore sensitive to tails than KSOnly for specific distributions
Jarque-BeraFast, based on momentsOnly detects skew/kurtosis deviations

All normality tests reject at large n

With n > 1000, even trivial departures from normality become statistically significant. A p-value of 0.001 does not mean your data is "far from normal" — it means you have enough data to detect a tiny deviation. Always combine tests with QQ plots and common sense.

Outlier Detection Methods

python
def detect_outliers(series, name="variable"):
    """Apply multiple outlier detection methods."""
    results = {}

    # 1. IQR method (Tukey's fences)
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower_iqr = q1 - 1.5 * iqr
    upper_iqr = q3 + 1.5 * iqr
    iqr_outliers = ((series < lower_iqr) | (series > upper_iqr)).sum()
    results["IQR (1.5x)"] = {
        "lower_bound": lower_iqr,
        "upper_bound": upper_iqr,
        "n_outliers": iqr_outliers,
        "pct": f"{iqr_outliers / len(series) * 100:.1f}%",
    }

    # 2. Z-score method
    z_scores = np.abs(stats.zscore(series))
    z_outliers = (z_scores > 3).sum()
    results["Z-score (|z|>3)"] = {
        "lower_bound": series.mean() - 3 * series.std(),
        "upper_bound": series.mean() + 3 * series.std(),
        "n_outliers": z_outliers,
        "pct": f"{z_outliers / len(series) * 100:.1f}%",
    }

    # 3. Modified Z-score (using median absolute deviation)
    median = series.median()
    mad = np.median(np.abs(series - median))
    modified_z = 0.6745 * (series - median) / mad if mad != 0 else pd.Series([0] * len(series))
    mad_outliers = (np.abs(modified_z) > 3.5).sum()
    results["Modified Z (MAD)"] = {
        "lower_bound": median - 3.5 * mad / 0.6745,
        "upper_bound": median + 3.5 * mad / 0.6745,
        "n_outliers": mad_outliers,
        "pct": f"{mad_outliers / len(series) * 100:.1f}%",
    }

    # 4. Percentile method
    lower_p = series.quantile(0.01)
    upper_p = series.quantile(0.99)
    pct_outliers = ((series < lower_p) | (series > upper_p)).sum()
    results["Percentile (1-99%)"] = {
        "lower_bound": lower_p,
        "upper_bound": upper_p,
        "n_outliers": pct_outliers,
        "pct": f"{pct_outliers / len(series) * 100:.1f}%",
    }

    return pd.DataFrame(results).T

for col in df.columns:
    print(f"\n--- Outlier Detection: {col} ---")
    print(detect_outliers(df[col], col).to_string())

Outlier Detection Decision Flow

Putting It All Together: Full Univariate Profile

python
def univariate_profile(series, name="variable", figsize=(18, 12)):
    """Generate a complete univariate profile for a numerical variable."""
    fig = plt.figure(figsize=figsize)

    # Layout: 2 rows, 3 columns
    ax1 = fig.add_subplot(2, 3, 1)  # Histogram + KDE
    ax2 = fig.add_subplot(2, 3, 2)  # Box plot
    ax3 = fig.add_subplot(2, 3, 3)  # QQ plot
    ax4 = fig.add_subplot(2, 3, 4)  # Violin plot
    ax5 = fig.add_subplot(2, 3, 5)  # ECDF
    ax6 = fig.add_subplot(2, 3, 6)  # Stats table

    # 1. Histogram + KDE
    ax1.hist(series, bins="fd", density=True, alpha=0.5, color="steelblue", edgecolor="black")
    sns.kdeplot(series, ax=ax1, color="crimson", linewidth=2)
    ax1.axvline(series.mean(), color="orange", linestyle="--", label=f"Mean: {series.mean():.1f}")
    ax1.axvline(series.median(), color="green", linestyle="-.", label=f"Median: {series.median():.1f}")
    ax1.legend(fontsize=9)
    ax1.set_title("Histogram + KDE")

    # 2. Box plot
    bp = ax2.boxplot(series, vert=True, patch_artist=True,
                     boxprops=dict(facecolor="steelblue", alpha=0.7),
                     flierprops=dict(marker="o", markersize=3, alpha=0.5))
    ax2.set_title("Box Plot")

    # 3. QQ plot
    stats.probplot(series, dist="norm", plot=ax3)
    ax3.set_title("QQ Plot (vs Normal)")

    # 4. Violin plot
    parts = ax4.violinplot(series, showmeans=True, showmedians=True)
    ax4.set_title("Violin Plot")

    # 5. ECDF
    sorted_data = np.sort(series)
    ecdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
    ax5.plot(sorted_data, ecdf, color="steelblue", linewidth=1.5)
    ax5.set_xlabel(name)
    ax5.set_ylabel("Cumulative Probability")
    ax5.set_title("Empirical CDF")
    ax5.grid(True, alpha=0.3)

    # 6. Stats summary as text
    ax6.axis("off")
    stat_text = (
        f"Count:      {len(series):,}\n"
        f"Mean:       {series.mean():,.2f}\n"
        f"Median:     {series.median():,.2f}\n"
        f"Std Dev:    {series.std():,.2f}\n"
        f"Skewness:   {series.skew():.3f}\n"
        f"Kurtosis:   {series.kurtosis():.3f}\n"
        f"Min:        {series.min():,.2f}\n"
        f"Max:        {series.max():,.2f}\n"
        f"IQR:        {series.quantile(0.75) - series.quantile(0.25):,.2f}\n"
        f"CV:         {series.std() / series.mean():.3f}\n"
        f"Missing:    {series.isna().sum()}"
    )
    ax6.text(0.1, 0.5, stat_text, transform=ax6.transAxes, fontsize=12,
             fontfamily="monospace", verticalalignment="center",
             bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8))
    ax6.set_title("Summary Statistics")

    fig.suptitle(f"Univariate Profile: {name}", fontsize=16, fontweight="bold")
    plt.tight_layout()
    plt.savefig(f"profile_{name}.png", dpi=150, bbox_inches="tight")
    plt.show()

    # Normality test summary
    _, shapiro_p = stats.shapiro(series[:5000])
    _, dagostino_p = stats.normaltest(series)
    print(f"\nNormality Tests for {name}:")
    print(f"  Shapiro-Wilk p-value:      {shapiro_p:.6f}")
    print(f"  D'Agostino-Pearson p-value: {dagostino_p:.6f}")

# Run for each column
for col in df.columns:
    univariate_profile(df[col], col)

Common Distribution Shapes and What They Mean

python
# Generate each distribution shape for visual reference
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
distributions = {
    "Normal": np.random.normal(50, 10, 5000),
    "Right-Skewed": np.random.lognormal(3.5, 0.5, 5000),
    "Left-Skewed": 100 - np.random.lognormal(3.0, 0.5, 5000),
    "Bimodal": np.concatenate([np.random.normal(30, 5, 2500), np.random.normal(70, 5, 2500)]),
    "Uniform": np.random.uniform(0, 100, 5000),
    "Exponential": np.random.exponential(20, 5000),
}

for ax, (name, data) in zip(axes.flat, distributions.items()):
    ax.hist(data, bins=50, density=True, alpha=0.5, color="steelblue", edgecolor="black")
    sns.kdeplot(data, ax=ax, color="crimson", linewidth=2)
    ax.set_title(f"{name}\nskew={pd.Series(data).skew():.2f}, kurt={pd.Series(data).kurtosis():.2f}",
                 fontsize=11)

plt.suptitle("Common Distribution Shapes", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.savefig("distribution_shapes.png", dpi=150, bbox_inches="tight")
plt.show()

Practical Checklist

Before you move on from univariate numerical analysis, make sure you can answer these questions for every numerical column:

  1. Center: What is the typical value? Is mean or median more appropriate?
  2. Spread: How variable is it? Is the spread large relative to the center (high CV)?
  3. Shape: Is it symmetric, skewed, bimodal, or uniform?
  4. Outliers: Are there extreme values? Are they data errors or genuine?
  5. Normality: Does it approximate normal? Does it matter for your downstream analysis?
  6. Bounds: Are there impossible values (negative counts, ages over 200)?
  7. Missing: How many values are missing? Is missingness random or systematic?

Rule of thumb for transformations

If skewness > 1 (or < -1), consider a log or Box-Cox transform before modeling. If kurtosis > 3 (excess > 0), your data has heavier tails than normal — robust methods or outlier-aware models may be needed.

Key Takeaways

  • Always start with describe() extended with skewness, kurtosis, IQR, and coefficient of variation.
  • Use Freedman-Diaconis bins as your histogram default. Manually try 2-3 bin counts to confirm.
  • Overlay KDE on histograms for smooth density estimation, but watch for boundary and gap artifacts.
  • QQ plots reveal how your data departs from normality better than any hypothesis test.
  • Formal normality tests are useful for small samples but over-reject at large n. Use them alongside visual evidence.
  • Box plots are excellent for comparison but dangerously hide multimodality. Pair with violin plots or histograms.
  • Outlier detection method should match your distribution shape: IQR/z-score for symmetric data, MAD for skewed data.

Try It Yourself

Exercise 1: You have a dataset of 5,000 employee salaries. The mean is $85,000, the median is $62,000, and the skewness is 2.3. Without plotting anything, diagnose the distribution shape, choose the appropriate measure of center, and recommend a transformation.

Solution
python
import numpy as np
import pandas as pd
from scipy import stats

# Diagnosis from the numbers alone:
# 1. Mean ($85K) >> Median ($62K) -> gap is 37%, strongly right-skewed
# 2. Skewness = 2.3 (> 1) -> confirms severe right skew
# 3. The right tail (high earners) is pulling the mean up

print("Diagnosis: Strongly right-skewed distribution")
print(f"Mean/Median gap: {(85000/62000 - 1)*100:.1f}% -> lead with MEDIAN")

# Appropriate measure of center: MEDIAN ($62,000)
# The mean is inflated by high earners and does not represent "typical"

# Recommended transformation: log transform (skewness > 2)
log_salary = np.log(df['salary'])
print(f"\nOriginal skewness: {df['salary'].skew():.2f}")
print(f"Log skewness: {log_salary.skew():.2f}")

# Also try Box-Cox for optimal normalization
bc_salary, bc_lambda = stats.boxcox(df['salary'])
print(f"Box-Cox skewness (lambda={bc_lambda:.2f}): {pd.Series(bc_salary).skew():.2f}")

# Report: "The median salary is $62,000. The mean ($85,000) is inflated
# by the right tail. Log-transformed salary has near-zero skewness
# and is appropriate for parametric modeling."

Exercise 2: Given an array of 2,000 values, write code to produce a complete univariate profile: histogram with Freedman-Diaconis bins, KDE overlay, QQ plot, box plot, and a table of descriptive statistics including skewness, kurtosis, and coefficient of variation.

Solution
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# data = your 2,000 values
series = pd.Series(data)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Histogram + KDE with Freedman-Diaconis bins
axes[0, 0].hist(series, bins='fd', density=True, alpha=0.5,
                color='steelblue', edgecolor='black')
sns.kdeplot(series, ax=axes[0, 0], color='crimson', linewidth=2)
axes[0, 0].axvline(series.mean(), color='orange', linestyle='--', label=f'Mean: {series.mean():.1f}')
axes[0, 0].axvline(series.median(), color='green', linestyle='-.', label=f'Median: {series.median():.1f}')
axes[0, 0].legend()
axes[0, 0].set_title('Histogram + KDE (Freedman-Diaconis)')

# 2. QQ Plot
stats.probplot(series, dist='norm', plot=axes[0, 1])
axes[0, 1].set_title('QQ Plot (vs Normal)')

# 3. Box Plot
axes[1, 0].boxplot(series, vert=True, patch_artist=True,
                    boxprops=dict(facecolor='steelblue', alpha=0.7))
axes[1, 0].set_title('Box Plot')

# 4. Descriptive Statistics Table
axes[1, 1].axis('off')
stat_text = (
    f"Count:     {len(series):,}\n"
    f"Mean:      {series.mean():,.2f}\n"
    f"Median:    {series.median():,.2f}\n"
    f"Std Dev:   {series.std():,.2f}\n"
    f"Skewness:  {series.skew():.3f}\n"
    f"Kurtosis:  {series.kurtosis():.3f}\n"
    f"IQR:       {series.quantile(0.75) - series.quantile(0.25):,.2f}\n"
    f"CV:        {series.std() / series.mean():.3f}\n"
    f"Min:       {series.min():,.2f}\n"
    f"Max:       {series.max():,.2f}"
)
axes[1, 1].text(0.1, 0.5, stat_text, fontsize=12, fontfamily='monospace',
                verticalalignment='center',
                bbox=dict(boxstyle='round', facecolor='lightyellow'))
axes[1, 1].set_title('Summary Statistics')

plt.suptitle('Complete Univariate Profile', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

Exercise 3: You run a Shapiro-Wilk normality test on a dataset of 10,000 values and get p = 0.0001, but the QQ plot looks nearly perfectly linear with only slight deviation at the tails. Explain why the test and the visual disagree and what you should conclude.

Solution
python
import numpy as np
from scipy import stats

# The disagreement is expected and is a well-known phenomenon:
# With large n (>1000), normality tests have extremely high power
# and will reject the null hypothesis for TRIVIAL departures from normality.

# Demonstration:
np.random.seed(42)
data = np.random.normal(0, 1, 10000)  # perfectly generated normal data
data = data + 0.001 * data**3         # add a TINY cubic distortion

stat, p = stats.shapiro(data[:5000])
print(f"Shapiro-Wilk: W={stat:.6f}, p={p:.6f}")
print(f"Skewness: {pd.Series(data).skew():.4f}")
print(f"Kurtosis: {pd.Series(data).kurtosis():.4f}")

# Conclusion:
print("\n--- What to conclude ---")
print("1. The p=0.0001 means: 'there IS a statistically detectable departure'")
print("2. The QQ plot means: 'the departure is practically negligible'")
print("3. The data is 'close enough to normal' for any practical purpose")
print("4. At n=10,000, even data generated FROM a normal distribution")
print("   can fail the test due to floating-point arithmetic")
print("\nRule: For n > 1000, trust the QQ plot over the p-value.")
print("Use the test for small samples where visual inspection is unreliable.")

Quick Quiz

1. The mean of a dataset is 100 and the median is 72. What does this tell you about the distribution?

  • a) It is left-skewed
  • b) It is right-skewed
  • c) It is bimodal
Answer

b) It is right-skewed. When the mean is substantially larger than the median (here, 39% higher), it means the right tail contains extreme values that pull the mean upward while the median remains at the true center. In this case, the median (72) is the better measure of "typical."

2. Which bin selection rule is most robust to outliers and skew?

  • a) Sturges
  • b) Scott
  • c) Freedman-Diaconis
Answer

c) Freedman-Diaconis. It uses the IQR instead of standard deviation to compute bin width (2 * IQR * n^(-1/3)). Since IQR is based on quartiles, it is not affected by extreme outliers or heavy tails, unlike Scott's rule which uses the standard deviation.

3. A QQ plot curves upward at the right end. What does this indicate?

  • a) The data has a lighter right tail than normal
  • b) The data has a heavier right tail than normal (right-skewed)
  • c) The data is perfectly normal
Answer

b) The data has a heavier right tail than normal (right-skewed). In a QQ plot against the normal distribution, points curving upward at the right mean the data's upper quantiles are larger than the normal distribution would predict. This is the signature of right skew or heavy right tails.

4. What does a high coefficient of variation (CV) indicate?

  • a) The data is normally distributed
  • b) The spread is large relative to the mean
  • c) There are no outliers
Answer

b) The spread is large relative to the mean. CV = std / mean, so a high CV means the standard deviation is a large fraction of the mean. This is useful for comparing variability across features on different scales. For example, a CV of 0.8 means the standard deviation is 80% of the mean, indicating high relative variability.

5. You see a box plot with the median line near the bottom of the box. What does this mean?

  • a) The data is left-skewed
  • b) The data is right-skewed — the upper 50% is more spread out than the lower 50%
  • c) The data is uniformly distributed
Answer

b) The data is right-skewed -- the upper 50% is more spread out than the lower 50%. When the median sits near Q1 (bottom of the box), it means the distance from Q1 to the median is small but the distance from the median to Q3 is large. This asymmetry indicates right skew, where most values cluster near the low end with a long tail stretching upward.

"What I cannot create, I do not understand." — Richard Feynman