Skip to content
Unverified — AI-generated content. Help verify this page

Pandera Schema Validation

Pandera is a lightweight, Pythonic library for validating pandas DataFrames. Where Great Expectations is a full platform with configuration, checkpoints, and HTML reports, Pandera is a library that integrates directly into your Python code. You define schemas as Python classes, validate DataFrames with a single function call, and get clear error messages when validation fails. It is the fastest path from "no validation" to "validated pipeline."


Why Pandera over Great Expectations

AspectPanderaGreat Expectations
Learning curve15 minutes2-3 hours
Setuppip install panderaProject initialization
Integration stylePython code, decoratorsConfiguration files, checkpoints
Best forIn-code validationPlatform-level validation
HTML reportsNo (use with pytest)Yes (Data Docs)
ComplexitySimpleFull-featured

Use Pandera when: you want validation embedded in your Python pipeline code. Use Great Expectations when: you need a validation platform with reporting and alerting.


DataFrameSchema Basics

python
# schema_basics.py — Define and validate DataFrame schemas
import pandera as pa
from pandera import Column, Check, Index, DataFrameSchema
import pandas as pd
import numpy as np


# Method 1: Object-based schema definition
product_schema = DataFrameSchema(
    columns={
        "id": Column(
            int,
            checks=[
                Check.greater_than(0),
                Check.unique(),  # CUSTOM: removed Check() alias
            ],
            nullable=False,
        ),
        "name": Column(
            str,
            checks=[
                Check.str_length(min_value=1, max_value=500),
            ],
            nullable=False,
        ),
        "price": Column(
            float,
            checks=[
                Check.greater_than_or_equal_to(0),
                Check.less_than(1_000_000),
            ],
            nullable=False,
        ),
        "category": Column(
            str,
            checks=[
                Check.isin(["electronics", "clothing", "food", "books", "home"]),
            ],
            nullable=True,
        ),
        "rating": Column(
            float,
            checks=[
                Check.in_range(0, 5),
            ],
            nullable=True,
        ),
        "created_at": Column(
            "datetime64[ns]",
            nullable=False,
        ),
    },
    # DataFrame-level checks
    checks=[
        Check(lambda df: len(df) > 0, error="DataFrame must not be empty"),
        Check(lambda df: df["price"].mean() < 10000, error="Average price too high"),
    ],
    index=Index(int),
    coerce=True,  # Auto-coerce types before validation
    strict=False,  # Allow extra columns not in schema
)


# Validate
df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Widget", "Gadget", "Doohickey"],
    "price": [19.99, 29.99, 9.99],
    "category": ["electronics", "electronics", "home"],
    "rating": [4.5, 3.8, 4.2],
    "created_at": pd.to_datetime(["2024-01-01", "2024-01-02", "2024-01-03"]),
})

validated_df = product_schema.validate(df)
print("Validation passed!")


# Method 2: Class-based schema (recommended for complex schemas)
class ProductSchema(pa.DataFrameModel):
    """Schema for product data using class-based API."""

    id: pa.typing.Series[int] = pa.Field(gt=0, unique=True, nullable=False)
    name: pa.typing.Series[str] = pa.Field(str_length={"min_value": 1, "max_value": 500}, nullable=False)
    price: pa.typing.Series[float] = pa.Field(ge=0, lt=1_000_000, nullable=False)
    category: pa.typing.Series[str] = pa.Field(
        isin=["electronics", "clothing", "food", "books", "home"],
        nullable=True,
    )
    rating: pa.typing.Series[float] = pa.Field(ge=0, le=5, nullable=True)
    created_at: pa.typing.Series[pa.DateTime] = pa.Field(nullable=False)

    class Config:
        coerce = True
        strict = False

    @pa.check("price")
    def price_not_suspiciously_round(cls, series: pd.Series) -> pd.Series:
        """Flag prices that are exactly round numbers (possible placeholders)."""
        return series % 1 != 0  # At least some should have decimals

    @pa.dataframe_check
    def at_least_one_row(cls, df: pd.DataFrame) -> bool:
        return len(df) > 0


# Validate with class-based schema
validated = ProductSchema.validate(df)

Column Checks

python
# column_checks.py — Every type of column check
import pandera as pa
from pandera import Column, Check
import pandas as pd
import numpy as np


# Built-in checks
comprehensive_schema = pa.DataFrameSchema({
    # Numeric checks
    "age": Column(int, [
        Check.greater_than_or_equal_to(0),
        Check.less_than_or_equal_to(150),
        Check.not_equal_to(-1),  # Sentinel value check
    ]),

    "score": Column(float, [
        Check.in_range(0, 100),
        Check(lambda s: s.std() > 0, error="Score has zero variance"),
    ]),

    # String checks
    "email": Column(str, [
        Check.str_matches(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"),
        Check.str_length(min_value=5, max_value=254),
    ]),

    "phone": Column(str, [
        Check.str_matches(r"^\+?1?\d{10,11}$"),
    ], nullable=True),

    "country_code": Column(str, [
        Check.str_length(min_value=2, max_value=2),
        Check.str_matches(r"^[A-Z]{2}$"),
    ]),

    # Categorical checks
    "status": Column(str, [
        Check.isin(["active", "inactive", "pending", "suspended"]),
    ]),

    # Date checks
    "created_at": Column("datetime64[ns]", [
        Check.greater_than(pd.Timestamp("2020-01-01")),
        Check.less_than(pd.Timestamp("2030-01-01")),
    ]),

    # Custom lambda checks
    "balance": Column(float, [
        Check(
            lambda s: (s >= 0).mean() > 0.95,
            error="More than 5% of balances are negative",
        ),
    ]),
})


# Custom reusable checks
def check_no_whitespace_padding(series: pd.Series) -> pd.Series:
    """Check that strings have no leading/trailing whitespace."""
    return series.apply(
        lambda x: x == x.strip() if isinstance(x, str) else True
    )

def check_monotonically_increasing(series: pd.Series) -> bool:
    """Check that values are monotonically increasing."""
    return series.is_monotonic_increasing

no_whitespace = Check(check_no_whitespace_padding, element_wise=False)
monotonic = Check(check_monotonically_increasing, element_wise=False)

# Apply custom checks
clean_schema = pa.DataFrameSchema({
    "name": Column(str, [no_whitespace]),
    "timestamp": Column("datetime64[ns]", [monotonic]),
})

Hypothesis Tests

python
# hypothesis_tests.py — Statistical hypothesis testing in schemas
import pandera as pa
from pandera import Column, Check, Hypothesis
import pandas as pd
import numpy as np


schema_with_hypotheses = pa.DataFrameSchema({
    "group": Column(str, Check.isin(["control", "treatment"])),
    "value": Column(float),
    "category": Column(str, Check.isin(["A", "B", "C"])),
}, checks=[
    # Two-sample t-test: treatment group should not differ
    # significantly from control (p > 0.05)
    Hypothesis.two_sample_ttest(
        sample1="value",
        sample2="value",
        groupby="group",
        relationship="equal",
        alpha=0.05,
    ),
])


# Custom hypothesis checks
class ExperimentSchema(pa.DataFrameModel):
    group: pa.typing.Series[str] = pa.Field(isin=["control", "treatment"])
    conversion: pa.typing.Series[int] = pa.Field(isin=[0, 1])
    revenue: pa.typing.Series[float] = pa.Field(ge=0)

    @pa.dataframe_check
    def balanced_groups(cls, df: pd.DataFrame) -> bool:
        """Groups should be approximately balanced (within 10%)."""
        counts = df["group"].value_counts()
        if len(counts) < 2:
            return False
        ratio = counts.min() / counts.max()
        return ratio > 0.9

    @pa.dataframe_check
    def minimum_sample_size(cls, df: pd.DataFrame) -> bool:
        """Each group should have at least 100 observations."""
        counts = df["group"].value_counts()
        return counts.min() >= 100

Schema Inference

python
# schema_inference.py — Auto-generate schemas from data
import pandera as pa
import pandas as pd
import json


def infer_and_export_schema(df: pd.DataFrame) -> pa.DataFrameSchema:
    """
    Infer a schema from data — starting point for refinement.

    WARNING: Inferred schemas are a starting point only.
    Always review and tighten constraints manually.
    """
    schema = pa.infer_schema(df)

    # Print the inferred schema as Python code
    print(schema.to_script())

    # Export as YAML for version control
    yaml_str = schema.to_yaml()
    print(yaml_str)

    return schema


def schema_from_yaml(yaml_path: str) -> pa.DataFrameSchema:
    """Load schema from YAML file (version-controlled)."""
    return pa.DataFrameSchema.from_yaml(yaml_path)


# Example YAML schema
YAML_SCHEMA = """
schema_type: dataframe
version: 0.18.0
columns:
  id:
    dtype: int64
    nullable: false
    checks:
      greater_than: 0
    unique: true
  name:
    dtype: str
    nullable: false
    checks:
      str_length:
        min_value: 1
        max_value: 500
  price:
    dtype: float64
    nullable: false
    checks:
      greater_than_or_equal_to: 0
      less_than: 1000000
  category:
    dtype: str
    nullable: true
    checks:
      isin:
        - electronics
        - clothing
        - food
        - books
coerce: true
strict: false
"""

Lazy Validation

python
# lazy_validation.py — Collect all errors instead of failing on first
import pandera as pa
from pandera import Column, Check
from pandera.errors import SchemaErrors
import pandas as pd


schema = pa.DataFrameSchema({
    "id": Column(int, Check.greater_than(0), nullable=False),
    "name": Column(str, Check.str_length(min_value=1), nullable=False),
    "price": Column(float, Check.greater_than_or_equal_to(0), nullable=False),
    "email": Column(str, Check.str_matches(r".+@.+\..+"), nullable=True),
})


# Bad data with multiple issues
bad_df = pd.DataFrame({
    "id": [1, -2, 3, 0],         # -2 and 0 violate > 0
    "name": ["Widget", "", None, "OK"],  # "" and None violate constraints
    "price": [19.99, -5.0, 29.99, 0],   # -5.0 violates >= 0
    "email": ["a@b.com", "invalid", None, "c@d.com"],  # "invalid" fails regex
})


# Lazy validation: collect ALL errors
try:
    schema.validate(bad_df, lazy=True)
except SchemaErrors as e:
    print(f"Found {len(e.failure_cases)} validation failures:\n")

    # DataFrame of all failures
    failure_df = e.failure_cases
    print(failure_df.to_string())

    # Structured access to errors
    for _, failure in failure_df.iterrows():
        print(
            f"  Column: {failure.get('column', 'N/A')}, "
            f"  Check: {failure.get('check', 'N/A')}, "
            f"  Index: {failure.get('index', 'N/A')}, "
            f"  Value: {failure.get('failure_case', 'N/A')}"
        )

    # Get the error DataFrame for programmatic handling
    # e.failure_cases is a DataFrame with columns:
    # schema_context, column, check, check_number, failure_case, index

Decorator-Based Validation

python
# decorator_validation.py — Validate function inputs/outputs automatically
import pandera as pa
from pandera.typing import DataFrame
import pandas as pd


class InputSchema(pa.DataFrameModel):
    """Schema for raw input data."""
    id: pa.typing.Series[int] = pa.Field(gt=0)
    name: pa.typing.Series[str] = pa.Field(nullable=False)
    price: pa.typing.Series[str] = pa.Field(nullable=False)  # String (raw)


class OutputSchema(pa.DataFrameModel):
    """Schema for cleaned output data."""
    id: pa.typing.Series[int] = pa.Field(gt=0, unique=True)
    name: pa.typing.Series[str] = pa.Field(str_length={"min_value": 1})
    price: pa.typing.Series[float] = pa.Field(ge=0)
    name_length: pa.typing.Series[int] = pa.Field(ge=0)


@pa.check_input(InputSchema)
@pa.check_output(OutputSchema)
def clean_products(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    """
    Clean product data.
    Input is validated against InputSchema before execution.
    Output is validated against OutputSchema after execution.
    If either fails, a SchemaError is raised.
    """
    result = df.copy()
    result["name"] = result["name"].str.strip().str.title()
    result["price"] = pd.to_numeric(result["price"], errors="coerce")
    result = result.dropna(subset=["price"])
    result = result[result["price"] >= 0]
    result = result.drop_duplicates(subset=["id"])
    result["name_length"] = result["name"].str.len()
    return result


# The decorators automatically validate on every call
df_raw = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["  widget  ", "Gadget", "Doohickey"],
    "price": ["19.99", "29.99", "invalid"],
})

df_clean = clean_products(df_raw)  # Validates input AND output

Integration with Polars

python
# polars_validation.py — Pandera with Polars DataFrames
import pandera.polars as pa
import polars as pl


class PolarsProductSchema(pa.DataFrameModel):
    """Schema for validating Polars DataFrames."""
    id: int = pa.Field(gt=0, unique=True)
    name: str = pa.Field(nullable=False)
    price: float = pa.Field(ge=0, lt=1_000_000)
    category: str = pa.Field(
        isin=["electronics", "clothing", "food"],
        nullable=True,
    )

    class Config:
        coerce = True


# Validate a Polars DataFrame
df_polars = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Widget", "Gadget", "Thing"],
    "price": [19.99, 29.99, 9.99],
    "category": ["electronics", "clothing", "food"],
})

validated = PolarsProductSchema.validate(df_polars)

CI Pipeline Integration

python
# test_data_schemas.py — pytest tests for data validation
import pytest
import pandera as pa
from pandera.errors import SchemaError, SchemaErrors
import pandas as pd
from my_pipeline.schemas import ProductSchema, OrderSchema
from my_pipeline.transforms import clean_products


class TestProductSchema:
    """Test product data schema validation."""

    def test_valid_data_passes(self):
        """Valid data should pass validation."""
        df = pd.DataFrame({
            "id": [1, 2, 3],
            "name": ["Widget", "Gadget", "Thing"],
            "price": [19.99, 29.99, 9.99],
            "category": ["electronics", "clothing", "food"],
            "created_at": pd.to_datetime(["2024-01-01"] * 3),
        })
        validated = ProductSchema.validate(df)
        assert len(validated) == 3

    def test_negative_price_fails(self):
        """Negative prices should fail validation."""
        df = pd.DataFrame({
            "id": [1],
            "name": ["Widget"],
            "price": [-5.0],
            "category": ["electronics"],
            "created_at": pd.to_datetime(["2024-01-01"]),
        })
        with pytest.raises(SchemaError):
            ProductSchema.validate(df)

    def test_null_name_fails(self):
        """Null names should fail validation."""
        df = pd.DataFrame({
            "id": [1],
            "name": [None],
            "price": [19.99],
            "category": ["electronics"],
            "created_at": pd.to_datetime(["2024-01-01"]),
        })
        with pytest.raises(SchemaError):
            ProductSchema.validate(df)

    def test_transform_produces_valid_output(self):
        """Transformation should always produce schema-valid output."""
        raw = pd.DataFrame({
            "id": [1, 2, 2],  # Has duplicate
            "name": ["  Widget  ", "Gadget", "Gadget"],
            "price": ["19.99", "invalid", "29.99"],
            "category": ["electronics", "clothing", "clothing"],
            "created_at": ["2024-01-01", "2024-01-02", "2024-01-02"],
        })
        result = clean_products(raw)
        # Should not raise
        ProductSchema.validate(result)


# Run with: pytest test_data_schemas.py -v

Quick Reference

Check TypeExampleDescription
Check.gt(0)Greater than 0Numeric lower bound (exclusive)
Check.ge(0)Greater than or equal to 0Numeric lower bound (inclusive)
Check.lt(100)Less than 100Numeric upper bound (exclusive)
Check.in_range(0, 100)Between 0 and 100Numeric range
Check.isin(["a", "b"])In setCategorical membership
Check.str_matches(r"...")Regex matchString pattern
Check.str_length(min, max)String lengthLength bounds
Check.unique()All values uniqueUniqueness
Check(lambda s: ...)Custom checkAny custom logic
@pa.dataframe_checkDataFrame-levelCross-column checks
FeatureSyntax
Coerce typescoerce=True in Config
Allow extra columnsstrict=False
Collect all errorsschema.validate(df, lazy=True)
Input validation@pa.check_input(Schema)
Output validation@pa.check_output(Schema)
YAML exportschema.to_yaml()
Schema inferencepa.infer_schema(df)

Key Takeaway

  • Pandera embeds validation directly in Python code with decorators and type annotations, making schemas part of the function signature rather than external configuration.
  • Lazy validation (lazy=True) collects all errors in one pass instead of failing on the first violation, making debugging faster.
  • Schema inference generates a starting schema from a DataFrame sample, which you then tighten with domain-specific constraints.
Exercise

Create a Pandera Schema for Customer Data

Define a Pandera DataFrameModel that validates:

  1. customer_id: positive integer, unique, not nullable.
  2. email: string matching a basic email regex, not nullable.
  3. age: integer between 18 and 120, nullable.
  4. signup_date: datetime, not nullable, must be in the past.
  5. tier: one of "free", "basic", "premium", "enterprise".
  6. DataFrame-level check: email must contain no duplicates.

Test it by creating a DataFrame with intentional violations and using lazy=True to see all errors.

Solution Sketch

python
import pandera as pa
from pandera import Column, Check, DataFrameSchema
import pandas as pd
from datetime import datetime

schema = DataFrameSchema({
    "customer_id": Column(int, [Check.gt(0), Check.unique()], nullable=False),
    "email": Column(str, Check.str_matches(r".+@.+\..+"), nullable=False),
    "age": Column(int, Check.in_range(18, 120), nullable=True, coerce=True),
    "signup_date": Column("datetime64[ns]", Check(lambda s: s <= datetime.now()), nullable=False),
    "tier": Column(str, Check.isin(["free", "basic", "premium", "enterprise"])),
})

# Test with bad data
bad_df = pd.DataFrame({
    "customer_id": [1, 1, -3],  # duplicate, negative
    "email": ["good@test.com", "bad", "ok@test.com"],  # invalid email
    "age": [25, 200, None],  # out of range
    "signup_date": pd.to_datetime(["2024-01-01", "2030-01-01", "2024-06-01"]),  # future
    "tier": ["free", "invalid", "premium"],  # invalid tier
})

try:
    schema.validate(bad_df, lazy=True)
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)
Debugging Scenario

Your Pandera schema validation passes in tests but fails in production with SchemaError: expected int64 but got object on a column that contains integers.

Diagnose and fix it.

Answer

The production data is loaded from a CSV, and pandas inferred the column as object (string) because one or more values are non-numeric (e.g., "N/A", empty string, or a header row mixed into data). In tests, you construct the DataFrame with explicit integer values.

Fixes:

  1. Enable coercion: add coerce=True to the Column definition or in the schema Config. Pandera will attempt to cast the column to the expected type before validation.
  2. Clean before validation: ensure the pipeline casts types (pd.to_numeric(col, errors="coerce")) before Pandera validation runs.
  3. Use nullable integer: if the column can have NaN, use pa.Int64 (nullable integer) instead of int, since NaN forces pandas to use float64 or object.

Common Misconceptions

  • "Pandera replaces Great Expectations." Pandera is a code-level validation library. Great Expectations is a validation platform with reporting, checkpoints, and alerting. They serve different scales and use cases.
  • "Schema inference gives a production-ready schema." Inferred schemas describe what the data looks like now, not what it should look like. They may encode current bugs as valid patterns and set bounds too tightly.
  • "coerce=True fixes all type issues." Coercion silently converts values, potentially losing information (e.g., "123abc" coerced to NaN). Always validate after coercion to catch unexpected conversions.
  • "Pandera only works with pandas." Pandera supports pandas, Polars, Spark DataFrames (via pyspark), and Modin, making it usable across the Python data ecosystem.
Quiz

1. What is the difference between Check.gt(0) and Check.ge(0)?

Check.gt(0) requires values to be strictly greater than 0 (excludes 0). Check.ge(0) requires values to be greater than or equal to 0 (includes 0).

2. What does lazy=True do in Pandera validation?

Instead of stopping at the first validation error, lazy validation collects all errors across all columns and returns them together in a SchemaErrors exception, making it easier to see every issue at once.

3. How does @pa.check_input differ from calling schema.validate()?

@pa.check_input(Schema) is a decorator that automatically validates the function's input DataFrame before the function executes. It integrates validation into the function signature rather than requiring explicit validation calls.

4. What is schema inference, and how do you use it?

pa.infer_schema(df) examines a DataFrame and generates a schema with detected types, ranges, and null patterns. Use it as a starting point, then manually tighten constraints based on domain knowledge.

5. Can Pandera validate relationships between columns?

Yes, using @pa.dataframe_check decorators in DataFrameModel classes or Check objects with multi-column lambda functions. For example, verifying that end_date > start_date.

One-Liner Summary: Pandera is the fastest path from "no validation" to "validated pipeline" -- define schemas as Python code, validate with decorators, and catch every error with lazy mode.

"What I cannot create, I do not understand." — Richard Feynman