Design Search Ranking System

Search ranking is how Google, Amazon, LinkedIn, and every search-powered product decide which results to show you first. The challenge is not finding results — inverted indexes do that. The challenge is ordering millions of candidates by relevance in under 200 milliseconds, while balancing freshness, personalization, diversity, and business objectives.

This walkthrough covers the full search ranking pipeline: query understanding, candidate retrieval, multi-stage ranking with learning-to-rank models, feature engineering, and evaluation frameworks.

1. Problem Statement & Requirements

Functional Requirements

Query processing — Accept text queries, handle typos, synonyms, and intent classification
Candidate retrieval — Fetch relevant documents from a corpus of billions
Ranking — Order candidates by relevance using ML models
Personalization — Adjust ranking based on user history and preferences
Freshness — Recent content ranked appropriately (news, social posts)
Spell correction — "pythn tutoral" should return Python tutorials
Query suggestions — Autocomplete and "did you mean" suggestions
Filters and facets — Filter by category, price range, date, etc.
Result snippets — Generate relevant text snippets highlighting query terms

Non-Functional Requirements

Latency — P99 < 200 ms end-to-end
Throughput — 100K queries per second at peak
Availability — 99.99% (search is the core product)
Corpus size — 10 billion documents
Freshness — New documents searchable within 5 minutes
Scalability — Support 10x growth without architecture changes

Clarifying Questions

Questions to Ask

What type of content are we searching? (Web pages, products, people, posts?)
Is this a general search engine or domain-specific (e-commerce, job search)?
How important is personalization vs universal relevance?
What is the current click-through rate we are trying to improve?
Do we need to handle multiple languages?
Are there any content quality or safety constraints?
Is real-time indexing required (social) or is batch acceptable (web)?

2. Back-of-Envelope Estimation

Traffic

100K QPS at peak, 40K QPS average
Each query touches the ranking pipeline

Daily queries = 40,000 \times 86,400 = 3.46 billion/day

Ranking Compute

Candidate retrieval returns ~10,000 documents per query
L1 ranker reduces to ~500 candidates
L2 ranker scores 500 candidates with full feature set

L2 scoring throughput = 100 K QPS \times 500 candidates = 50 M scorings/second

Index Size

10 billion documents, average 5 KB per document
Raw corpus: 50 TB
Inverted index: ~15% of corpus = 7.5 TB
Forward index (features): ~20% = 10 TB
Total index storage: ~70 TB (sharded across hundreds of machines)

Latency Budget

Stage	Budget	Parallelism
Query understanding	10 ms	Serial
Candidate retrieval	30 ms	Parallel across shards
L1 ranking	20 ms	Parallel
L2 ranking	50 ms	Sequential on reduced set
Snippet generation	15 ms	Parallel
Blending / business rules	5 ms	Sequential
Total	~130 ms

3. High-Level Design

4. Deep Dive: Query Understanding

Query understanding transforms raw user input into a structured intent before retrieval begins.

Query Understanding Pipeline

Spell Correction

python

class SpellCorrector:
    """Noisy channel model: P(correction|query) ~ P(query|correction) * P(correction)"""

    def __init__(self, language_model, error_model):
        self.lm = language_model        # P(correction) from query logs
        self.em = error_model            # P(query|correction) from edit distance

    def correct(self, query: str) -> str:
        tokens = query.split()
        corrected = []

        for token in tokens:
            if token in self.lm.vocab:
                corrected.append(token)
                continue

            # Generate candidates within edit distance 2
            candidates = self.generate_candidates(token, max_edit_dist=2)

            # Score by noisy channel model
            best = max(
                candidates,
                key=lambda c: (
                    self.em.log_prob(token, c) +  # Error model
                    self.lm.log_prob(c)            # Language model
                )
            )
            corrected.append(best)

        return ' '.join(corrected)

Intent Classification

Query	Intent	Ranking Strategy
"python tutorials"	Informational	Prioritize educational, high-quality content
"buy iphone 15 pro"	Transactional	Prioritize product pages, price, reviews
"amazon.com"	Navigational	Direct to the specific site
"weather nyc"	Instant answer	Show answer card above results
"pizza near me"	Local	Trigger local map results

5. Deep Dive: Candidate Retrieval

Retrieval reduces the 10-billion-document corpus to ~10,000 candidates. This stage optimizes for recall — do not miss relevant results. Precision is handled by the ranker.

Multi-Path Retrieval

Term-Based Retrieval (BM25)

The classic inverted index with BM25 scoring:

python

# BM25 scoring (simplified)
# score(D, Q) = SUM over query terms t:
#   IDF(t) * (tf(t,D) * (k1 + 1)) / (tf(t,D) + k1 * (1 - b + b * |D|/avgdl))

def bm25_score(query_terms, document, corpus_stats, k1=1.2, b=0.75):
    score = 0.0
    doc_len = len(document)
    avg_doc_len = corpus_stats.avg_doc_length

    for term in query_terms:
        tf = document.term_frequency(term)
        df = corpus_stats.doc_frequency(term)
        N = corpus_stats.total_docs

        idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
        tf_component = (tf * (k1 + 1)) / (
            tf + k1 * (1 - b + b * doc_len / avg_doc_len)
        )
        score += idf * tf_component

    return score

Semantic Retrieval (Dense Vectors)

For queries where term matching fails (e.g., "how to not feel sad" should match "dealing with depression"):

python

# Encode query and documents with a bi-encoder (e.g., sentence-transformers)
from sentence_transformers import SentenceTransformer
import faiss

# Offline: build ANN index
model = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = model.encode(all_documents)  # Shape: (N, 384)

# Build FAISS index for approximate nearest neighbor
index = faiss.IndexIVFFlat(
    faiss.IndexFlatIP(384),  # Inner product (cosine after normalization)
    384,                      # Dimension
    1024                      # Number of clusters (nlist)
)
index.train(doc_embeddings)
index.add(doc_embeddings)

# Online: query
query_embedding = model.encode(["how to not feel sad"])
distances, indices = index.search(query_embedding, k=1000)

TIP

Use hybrid retrieval — merge BM25 results with semantic results. Term matching catches exact keyword matches (product names, error codes), while semantic search catches conceptual matches. The ranker learns to reconcile both signals.

6. Deep Dive: Multi-Stage Ranking

The Ranking Funnel

L1 Ranker (Lightweight)

The L1 ranker scores 10K candidates in < 20 ms. It uses a simple model with cheap-to-compute features:

python

class L1Ranker:
    """Lightweight ranker: logistic regression or small GBDT."""

    def __init__(self):
        self.model = load_model("l1_ranker_v12.onnx")

    def score(self, query, candidates):
        features = []
        for doc in candidates:
            f = [
                doc.bm25_score,
                doc.semantic_similarity,
                doc.page_rank,
                doc.freshness_score,
                doc.click_through_rate,
                doc.domain_authority,
                len(set(query.terms) & set(doc.title_terms)),  # title overlap
                len(set(query.terms) & set(doc.url_terms)),    # URL overlap
            ]
            features.append(f)

        scores = self.model.predict(np.array(features))
        return sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )[:500]

L2 Ranker (Heavy ML)

The L2 ranker scores 500 candidates with a full feature set (~200 features) using a cross-encoder or large GBDT:

Feature Category	Count	Examples
Query-document	40	BM25, semantic sim, title/body/URL term overlap
Document quality	30	PageRank, domain authority, content length, reading level
Freshness	10	Age, update frequency, time decay
User context	25	Past clicks on this domain, query history similarity
Engagement	30	Historical CTR, dwell time, bounce rate, share rate
Semantic	20	Cross-encoder score, entity match, topic overlap
Business	15	Sponsored flag, content type boost, diversity penalty
Click model	30	Position-debiased CTR, examination probability

Learning-to-Rank (LTR)

Approach	Loss Function	Pros	Cons
Pointwise	MSE on relevance label	Simple, fast training	Ignores relative ordering
Pairwise (LambdaMART)	Cross-entropy on pairs	Best accuracy for GBDT	O(n^2) pairs per query
Listwise (ApproxNDCG)	Differentiable NDCG	Directly optimizes target metric	Harder to train, needs neural model

python

import lightgbm as lgb

# LambdaMART with LightGBM
train_data = lgb.Dataset(
    X_train,
    label=y_train,
    group=query_groups_train  # How many docs per query
)

params = {
    'objective': 'lambdarank',
    'metric': 'ndcg',
    'ndcg_eval_at': [5, 10],
    'num_leaves': 255,
    'learning_rate': 0.05,
    'min_data_in_leaf': 50,
    'feature_fraction': 0.8,
}

model = lgb.train(
    params,
    train_data,
    num_boost_round=1000,
    valid_sets=[valid_data],
    callbacks=[lgb.early_stopping(50)]
)

7. Deep Dive: Click Models and Position Bias

Users are far more likely to click results at position 1 than position 5, regardless of relevance. Raw click data is biased by position.

Position Bias

Position	Examination Prob	Observed CTR	True Relevance CTR
1	100%	35%	35%
2	85%	22%	26%
3	70%	14%	20%
5	45%	6%	13%
10	15%	1.5%	10%

Inverse Propensity Weighting

python

def debias_click_label(click, position, propensity_model):
    """Debias click signal using inverse propensity weighting."""
    exam_prob = propensity_model.examination_probability(position)

    if click:
        # Upweight clicks at low positions (they are more meaningful)
        return 1.0 / exam_prob
    else:
        # A non-click at position 1 means "not relevant"
        # A non-click at position 10 might mean "not examined"
        return 0.0  # Or use a soft label based on exam_prob

# Propensity estimated from randomization experiments
class PropensityModel:
    def examination_probability(self, position):
        # Estimated from position-swap experiments
        return 1.0 / (1.0 + 0.2 * position)

WARNING

If you train a ranking model on raw clicks without position debiasing, the model learns to rank already-top-ranked documents higher, creating a feedback loop that prevents new good content from surfacing. Always debias.

8. Deep Dive: Online and Offline Evaluation

Offline Metrics

Metric	What It Measures	Formula
NDCG@K	Quality of top K results	Normalized Discounted Cumulative Gain
MAP	Average precision across queries	Mean of per-query Average Precision
MRR	Position of first relevant result	Mean of 1/rank of first relevant result
Precision@K	Fraction of top K that are relevant	Relevant in top K / K
Recall@K	Fraction of relevant docs in top K	Relevant in top K / total relevant

python

def ndcg_at_k(relevance_scores, k=10):
    """Compute NDCG@K for a ranked list."""
    dcg = sum(
        rel / math.log2(i + 2)  # i+2 because log2(1) = 0
        for i, rel in enumerate(relevance_scores[:k])
    )

    ideal = sorted(relevance_scores, reverse=True)[:k]
    idcg = sum(
        rel / math.log2(i + 2)
        for i, rel in enumerate(ideal)
    )

    return dcg / idcg if idcg > 0 else 0.0

Online Metrics (A/B Testing)

Metric	Direction	What It Signals
Click-through rate	Higher is better	Results look relevant
Mean reciprocal rank of clicks	Higher is better	Relevant results ranked higher
Dwell time	Longer is better	Users found useful content
Pogo-sticking rate	Lower is better	Users do not click back quickly
Queries per session	Depends	Lower can mean better (found answer), or worse (gave up)
Zero-result rate	Lower is better	Fewer queries with no results
Abandonment rate	Lower is better	Users engage with results

TIP

No single metric tells the whole story. A model that increases CTR but also increases pogo-sticking (quick back-clicks) is showing clickbait, not better results. Always monitor a dashboard of complementary metrics.

Interleaving Experiments

Interleaving is faster than A/B testing for detecting ranking differences. Instead of splitting users, interleave results from both models on every query:

python

def team_draft_interleave(ranking_a, ranking_b, k=10):
    """Interleave two rankings using team-draft method."""
    interleaved = []
    team_a, team_b = [], []
    i, j = 0, 0

    while len(interleaved) < k and (i < len(ranking_a) or j < len(ranking_b)):
        # Alternate which team picks first (coin flip for first)
        if len(team_a) <= len(team_b):
            # Team A picks
            while i < len(ranking_a) and ranking_a[i] in interleaved:
                i += 1
            if i < len(ranking_a):
                interleaved.append(ranking_a[i])
                team_a.append(ranking_a[i])
                i += 1
        else:
            # Team B picks
            while j < len(ranking_b) and ranking_b[j] in interleaved:
                j += 1
            if j < len(ranking_b):
                interleaved.append(ranking_b[j])
                team_b.append(ranking_b[j])
                j += 1

    return interleaved, team_a, team_b

9. System Architecture Details

Index Sharding

Sharding Strategy	How	When
Document-based	Docs partitioned by hash(doc_id)	General purpose, even load
Term-based	Each shard owns certain terms	Rare, complex to manage
Tiered	Hot tier (recent, popular) + cold tier (archival)	Web search, news
Geographic	Docs sharded by region	Local search, maps

Real-Time Indexing

TIP

Use a dual-index architecture: a large immutable base index rebuilt periodically, plus a small real-time delta index. Queries hit both, with results merged. The delta index is periodically folded into the base index. This is the approach used by Elasticsearch, Solr, and most production search systems.

10. Handling Edge Cases

Query-Specific Challenges

Challenge	Solution
No results	Relax query (drop terms), suggest alternatives, show related content
Ambiguous queries	Show diverse results ("apple" = company + fruit), add disambiguation card
Very long queries	Extract key terms, use semantic understanding to summarize intent
Adversarial queries	SEO spam detection, content quality classifier in ranking features
Trending queries	Boost freshness weight, pre-compute popular query results
Sensitive queries	Content safety classifier, safe search enforcement

Key Takeaways

Multi-stage funnel — retrieval (recall-optimized), L1 ranker (lightweight filter), L2 ranker (heavy ML), blender (diversity/business)
Hybrid retrieval — combine term-based (BM25) and semantic (ANN) retrieval for best recall
LambdaMART dominates — for GBDT-based ranking, pairwise learning-to-rank is the industry standard
Position bias is real — always debias click data before training ranking models
Feature engineering matters more than model architecture — 200 well-crafted features outperform a fancy model with poor features
Interleaving > A/B testing — interleaving detects ranking quality differences 10x faster
Latency budget is sacred — every millisecond in the ranking pipeline must be justified; pre-compute everything possible

Design Search Ranking System ​

1. Problem Statement & Requirements ​

Functional Requirements ​

Non-Functional Requirements ​

Clarifying Questions ​

2. Back-of-Envelope Estimation ​

Traffic ​

Ranking Compute ​

Index Size ​

Latency Budget ​

3. High-Level Design ​

4. Deep Dive: Query Understanding ​

Query Understanding Pipeline ​

Spell Correction ​

Intent Classification ​

5. Deep Dive: Candidate Retrieval ​

Multi-Path Retrieval ​

Term-Based Retrieval (BM25) ​

Semantic Retrieval (Dense Vectors) ​

6. Deep Dive: Multi-Stage Ranking ​

The Ranking Funnel ​

L1 Ranker (Lightweight) ​

L2 Ranker (Heavy ML) ​

Learning-to-Rank (LTR) ​

7. Deep Dive: Click Models and Position Bias ​

Position Bias ​

Inverse Propensity Weighting ​

8. Deep Dive: Online and Offline Evaluation ​

Offline Metrics ​

Online Metrics (A/B Testing) ​

Interleaving Experiments ​

9. System Architecture Details ​

Index Sharding ​

Real-Time Indexing ​

10. Handling Edge Cases ​

Query-Specific Challenges ​

Key Takeaways ​

Related Pages

Design Search Ranking System

1. Problem Statement & Requirements

Functional Requirements

Non-Functional Requirements

Clarifying Questions

2. Back-of-Envelope Estimation

Traffic

Ranking Compute

Index Size

Latency Budget

3. High-Level Design

4. Deep Dive: Query Understanding

Query Understanding Pipeline

Spell Correction

Intent Classification

5. Deep Dive: Candidate Retrieval

Multi-Path Retrieval

Term-Based Retrieval (BM25)

Semantic Retrieval (Dense Vectors)

6. Deep Dive: Multi-Stage Ranking

The Ranking Funnel

L1 Ranker (Lightweight)

L2 Ranker (Heavy ML)

Learning-to-Rank (LTR)

7. Deep Dive: Click Models and Position Bias

Position Bias

Inverse Propensity Weighting

8. Deep Dive: Online and Offline Evaluation

Offline Metrics

Online Metrics (A/B Testing)

Interleaving Experiments

9. System Architecture Details

Index Sharding

Real-Time Indexing

10. Handling Edge Cases

Query-Specific Challenges

Key Takeaways