Skip to content
Unverified — AI-generated content. Help verify this page

EDA Checklist

A systematic, 60-item checklist covering every phase of exploratory data analysis. Use this as a template for every new dataset. Not every item applies to every dataset, but scanning the full list ensures you never miss a critical step.


Phase 0: Before You Start

python
# Template: paste into your notebook header
"""
EDA SESSION LOG
===============
Dataset:     [name]
Source:       [origin, date acquired]
Goal:        [what question are you trying to answer?]
Analyst:     [your name]
Date:        [today]
Environment: Python 3.x, pandas x.x
"""

Checklist Items 1-8: Setup

#ItemStatusNotes
1Define the analysis goal/question[ ]What decision does this EDA support?
2Confirm data source and collection method[ ]How was data collected? Any known biases?
3Verify data access permissions[ ]Do you have authorization?
4Set up reproducible environment[ ]requirements.txt, random seed, Git
5Set random seed globally[ ]np.random.seed(42)
6Check data documentation/data dictionary[ ]Column definitions, units, codes
7Note the data freshness/time range[ ]When was data extracted?
8Identify the unit of observation[ ]What does one row represent?
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)
sns.set_theme(style='whitegrid')

# Item 7: Data freshness
# df = pd.read_csv('data.csv', parse_dates=['date_col'])
# print(f"Date range: {df['date_col'].min()} to {df['date_col'].max()}")

Phase 1: Data Loading and First Look

Checklist Items 9-18: Structure

#ItemStatusNotes
9Load data with explicit dtypes[ ]pd.read_csv(dtype=...)
10Check shape (rows x columns)[ ]df.shape
11View first/last rows[ ]df.head(), df.tail()
12Review column names[ ]Standardize: lowercase, no spaces
13Check data types per column[ ]df.dtypes, df.info()
14Verify correct parsing (dates, categories)[ ]Are dates strings? Are IDs numeric?
15Check memory usage[ ]df.memory_usage(deep=True).sum()
16Count duplicate rows[ ]df.duplicated().sum()
17Identify the primary key/ID column[ ]Is it unique?
18Check for constant columns[ ]df.nunique() == 1
python
def phase1_checklist(df):
    """Run Phase 1 checks automatically."""
    print("PHASE 1: DATA STRUCTURE")
    print("=" * 50)
    print(f"[10] Shape: {df.shape}")
    print(f"[15] Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print(f"[16] Duplicate rows: {df.duplicated().sum()}")

    # [13] Data types
    print(f"\n[13] Data types:")
    for dtype, count in df.dtypes.value_counts().items():
        print(f"  {dtype}: {count}")

    # [18] Constant columns
    constant = df.columns[df.nunique() <= 1].tolist()
    if constant:
        print(f"\n[18] ALERT: Constant columns: {constant}")

    # [12] Column name issues
    bad_names = [c for c in df.columns if ' ' in c or c != c.lower()]
    if bad_names:
        print(f"\n[12] ALERT: Non-standard column names: {bad_names}")

    # [17] Potential primary keys (all unique)
    potential_pks = [c for c in df.columns if df[c].nunique() == len(df)]
    print(f"\n[17] Potential primary keys: {potential_pks or 'None found'}")

# Usage: phase1_checklist(df)

Phase 2: Missing Data

Checklist Items 19-26: Nulls

#ItemStatusNotes
19Count missing values per column[ ]df.isna().sum()
20Calculate missing percentages[ ]Drop columns > 80% missing?
21Visualize missing pattern[ ]Nullity matrix, bar chart
22Check if missing is MCAR/MAR/MNAR[ ]Does missingness correlate with other columns?
23Test if missing correlates with target[ ]Critical for modeling
24Document imputation strategy[ ]Median? Group median? Drop?
25Verify no NaN-like strings[ ]'N/A', 'null', '', '-', '?'
26Check for sentinel values[ ]-999, 9999, 0 where 0 is impossible
python
def phase2_checklist(df, target=None):
    """Run Phase 2 missing data checks."""
    print("PHASE 2: MISSING DATA")
    print("=" * 50)

    # [19-20] Missing counts and percentages
    missing = df.isna().sum()
    missing_pct = (df.isna().mean() * 100).round(2)
    missing_df = pd.DataFrame({'count': missing, 'pct': missing_pct})
    missing_df = missing_df[missing_df['count'] > 0].sort_values('pct', ascending=False)

    if len(missing_df) == 0:
        print("[19] No missing values found")
    else:
        print(f"[19] Columns with missing values: {len(missing_df)}")
        print(missing_df)

    # [25] Check for NaN-like strings
    for col in df.select_dtypes(include='object').columns:
        nan_like = df[col].isin(['', 'N/A', 'NA', 'null', 'NULL', 'None', '-', '?', 'n/a']).sum()
        if nan_like > 0:
            print(f"\n[25] ALERT: '{col}' has {nan_like} NaN-like strings")

    # [26] Sentinel values in numeric columns
    for col in df.select_dtypes(include='number').columns:
        for sentinel in [-999, -1, 9999, 99999]:
            count = (df[col] == sentinel).sum()
            if count > 0:
                print(f"[26] ALERT: '{col}' has {count} values equal to {sentinel}")

    # [23] Missing vs target
    if target and target in df.columns:
        print(f"\n[23] Missing value correlation with target '{target}':")
        for col in missing_df.index:
            has_val = df[df[col].notna()][target].mean()
            no_val = df[df[col].isna()][target].mean()
            diff = abs(has_val - no_val)
            flag = "***" if diff > 0.1 else ""
            print(f"  {col}: present={has_val:.3f}, missing={no_val:.3f}, diff={diff:.3f} {flag}")

# Usage: phase2_checklist(df, target='survived')

Phase 3: Univariate Analysis

Checklist Items 27-36: Single Variables

#ItemStatusNotes
27Describe all numeric columns[ ]df.describe()
28Check skewness of each numeric column[ ]abs(skew) > 1 needs attention
29Check kurtosis (heavy tails?)[ ]Excess kurtosis > 3 = heavy tails
30Identify outliers (IQR and z-score)[ ]How many? Real or errors?
31Plot histograms for all numerics[ ]Bin size matters
32Plot box plots for all numerics[ ]See outliers
33Count value frequencies for categoricals[ ]value_counts()
34Check cardinality of categoricals[ ]High-cardinality = potential ID
35Check for rare categories[ ]Categories with < 1% frequency
36Test normality where needed[ ]Shapiro-Wilk, QQ plot
python
def phase3_checklist(df):
    """Run Phase 3 univariate checks."""
    print("PHASE 3: UNIVARIATE ANALYSIS")
    print("=" * 50)

    numeric = df.select_dtypes(include='number')

    # [28] Skewness alerts
    skew = numeric.skew()
    skewed = skew[abs(skew) > 1]
    if len(skewed) > 0:
        print("[28] Highly skewed columns (|skew| > 1):")
        for col, s in skewed.items():
            print(f"  {col}: skew={s:.2f}")

    # [30] Outlier counts (IQR method)
    print("\n[30] Outlier counts (IQR method):")
    for col in numeric.columns:
        q1, q3 = numeric[col].quantile([0.25, 0.75])
        iqr = q3 - q1
        n_outliers = ((numeric[col] < q1 - 1.5 * iqr) | (numeric[col] > q3 + 1.5 * iqr)).sum()
        pct = n_outliers / len(numeric) * 100
        flag = "***" if pct > 5 else ""
        print(f"  {col:<25} {n_outliers:>6} ({pct:.1f}%) {flag}")

    # [34] Cardinality of categoricals
    categorical = df.select_dtypes(include=['object', 'category'])
    if len(categorical.columns) > 0:
        print("\n[34] Categorical cardinality:")
        for col in categorical.columns:
            n_unique = df[col].nunique()
            ratio = n_unique / len(df)
            flag = "HIGH-CARD" if ratio > 0.5 else ""
            print(f"  {col:<25} {n_unique:>6} unique ({ratio:.1%}) {flag}")

    # [35] Rare categories
    print("\n[35] Rare categories (< 1%):")
    for col in categorical.columns:
        vc = df[col].value_counts(normalize=True)
        rare = vc[vc < 0.01]
        if len(rare) > 0:
            print(f"  {col}: {len(rare)} rare categories")

# Usage: phase3_checklist(df)

Phase 4: Bivariate Analysis

Checklist Items 37-44: Relationships

#ItemStatusNotes
37Compute correlation matrix[ ]Pearson and Spearman
38Identify highly correlated pairs[ ]abs(r) > 0.7
39Check target correlations[ ]Which features predict the target?
40Scatter plots for top correlations[ ]Confirm linearity
41Box plots: numeric vs categorical[ ]Compare distributions
42Chi-squared for categorical pairs[ ]Independence test
43Cross-tabulation for key pairs[ ]pd.crosstab()
44Check for nonlinear relationships[ ]Spearman vs Pearson difference
python
def phase4_checklist(df, target=None):
    """Run Phase 4 bivariate checks."""
    print("PHASE 4: BIVARIATE ANALYSIS")
    print("=" * 50)

    numeric = df.select_dtypes(include='number')

    # [37-38] Correlation
    if len(numeric.columns) >= 2:
        corr = numeric.corr()
        print("[38] Highly correlated pairs (|r| > 0.7):")
        for i in range(len(corr)):
            for j in range(i+1, len(corr)):
                r = corr.iloc[i, j]
                if abs(r) > 0.7:
                    print(f"  {corr.columns[i]} x {corr.columns[j]}: r={r:+.3f}")

    # [39] Target correlations
    if target and target in numeric.columns:
        target_corr = corr[target].drop(target).sort_values(key=abs, ascending=False)
        print(f"\n[39] Correlation with '{target}':")
        for col, r in target_corr.head(10).items():
            print(f"  {col:<25} r={r:+.3f}")

    # [44] Nonlinear relationships
    if len(numeric.columns) >= 2:
        pearson = numeric.corr(method='pearson')
        spearman = numeric.corr(method='spearman')
        diff = (spearman - pearson).abs()
        print("\n[44] Potential nonlinear relationships (|Spearman - Pearson| > 0.1):")
        for i in range(len(diff)):
            for j in range(i+1, len(diff)):
                if diff.iloc[i, j] > 0.1:
                    print(f"  {diff.columns[i]} x {diff.columns[j]}: "
                          f"Pearson={pearson.iloc[i,j]:.3f}, Spearman={spearman.iloc[i,j]:.3f}")

# Usage: phase4_checklist(df, target='target_column')

Phase 5: Multivariate and Feature Engineering

Checklist Items 45-52: Deeper Analysis

#ItemStatusNotes
45Check for multicollinearity (VIF)[ ]VIF > 5 is concerning
46Create interaction features[ ]Product of correlated predictors
47Bin continuous variables[ ]pd.cut(), pd.qcut()
48Encode categorical variables[ ]One-hot, ordinal, target encoding
49Create time-based features[ ]Day of week, month, lag features
50PCA / dimensionality analysis[ ]How many components explain 95%?
51Cluster analysis (exploratory)[ ]Natural groupings?
52Cross-tabulation with target[ ]Interactions between features
python
def phase5_checklist(df):
    """Run Phase 5 multivariate checks."""
    print("PHASE 5: MULTIVARIATE")
    print("=" * 50)

    numeric = df.select_dtypes(include='number')

    # [45] VIF (Variance Inflation Factor)
    if len(numeric.columns) >= 2:
        from numpy.linalg import inv
        clean = numeric.dropna()
        if len(clean) > 0:
            try:
                X = clean.values
                X_std = (X - X.mean(axis=0)) / X.std(axis=0)
                corr_matrix = np.corrcoef(X_std, rowvar=False)
                inv_corr = inv(corr_matrix)
                vif = pd.Series(np.diag(inv_corr), index=numeric.columns)
                high_vif = vif[vif > 5]
                if len(high_vif) > 0:
                    print("[45] High VIF columns (VIF > 5):")
                    for col, v in high_vif.items():
                        print(f"  {col}: VIF={v:.1f}")
                else:
                    print("[45] No multicollinearity issues (all VIF < 5)")
            except Exception:
                print("[45] Could not compute VIF (singular matrix)")

    # [50] PCA dimensionality
    if len(numeric.columns) >= 3:
        clean = numeric.dropna()
        if len(clean) > 10:
            from numpy.linalg import svd
            centered = clean.values - clean.values.mean(axis=0)
            _, s, _ = svd(centered, full_matrices=False)
            explained = (s ** 2) / (s ** 2).sum()
            cumulative = np.cumsum(explained)
            n_95 = (cumulative < 0.95).sum() + 1
            print(f"\n[50] PCA: {n_95}/{len(numeric.columns)} components explain 95% variance")

# Usage: phase5_checklist(df)

Phase 6: Data Quality and Integrity

Checklist Items 53-58: Quality Assurance

#ItemStatusNotes
53Verify value ranges make sense[ ]Age: 0-120, percentages: 0-100
54Check referential integrity (joins)[ ]All FKs have matching PKs?
55Look for data entry errors[ ]Typos in categoricals
56Check temporal consistency[ ]End date > start date?
57Validate business rules[ ]Total = qty x price?
58Check for data leakage features[ ]Future info in training data?

Phase 7: Summary and Communication

Checklist Items 59-60: Deliverables

#ItemStatusNotes
59Write findings summary (top 5-10 insights)[ ]Plain language, actionable
60Document next steps and recommendations[ ]What further analysis is needed?

Full Automated Checklist Runner

python
def run_full_eda_checklist(df, target=None):
    """Run all automated checklist items."""
    print("=" * 60)
    print("FULL EDA CHECKLIST REPORT")
    print("=" * 60)
    print(f"Dataset: {df.shape[0]:,} rows x {df.shape[1]} columns")
    print(f"Target: {target or 'Not specified'}")
    print()

    phase1_checklist(df)
    print()
    phase2_checklist(df, target)
    print()
    phase3_checklist(df)
    print()
    phase4_checklist(df, target)
    print()
    phase5_checklist(df)

    print("\n" + "=" * 60)
    print("MANUAL ITEMS REMAINING")
    print("=" * 60)
    manual_items = [
        "[1] Define analysis goal",
        "[2] Confirm data source",
        "[6] Review data dictionary",
        "[22] Classify missing mechanism (MCAR/MAR/MNAR)",
        "[40] Create scatter plots for top correlations",
        "[51] Run cluster analysis if appropriate",
        "[53] Validate value ranges with domain expert",
        "[57] Check business rule consistency",
        "[58] Check for data leakage",
        "[59] Write findings summary",
        "[60] Document next steps",
    ]
    for item in manual_items:
        print(f"  [ ] {item}")

# Usage:
# run_full_eda_checklist(df, target='survived')

EDA Workflow


Key Takeaways

  • This checklist is a template, not a rigid procedure — skip items that do not apply to your dataset
  • The automated runner catches 70% of issues; the remaining 30% require domain knowledge and judgment
  • Items 22, 30, and 58 (missing mechanism, outlier investigation, leakage detection) are the most commonly skipped and most impactful
  • Run the checklist iteratively — insights from Phase 4 often send you back to Phase 2 for re-imputation
  • Always end with written findings (Item 59) — code without a summary is incomplete EDA
  • Keep the checklist in your notebook template so it becomes automatic

"What I cannot create, I do not understand." — Richard Feynman