NLP Fundamentals

Natural language processing bridges human language and machine computation. Before transformers can process text, raw characters must become numbers. This page builds that pipeline from the ground up: tokenization algorithms (including BPE from scratch), word embedding methods (Word2Vec, GloVe, FastText) with full mathematical derivations, and an end-to-end IMDB text classifier.

Text Preprocessing Pipeline

Tokenization

Word-Level Tokenization

Split on whitespace and punctuation. Simple but creates huge vocabularies and cannot handle out-of-vocabulary (OOV) words.

python

def word_tokenize(text):
    import re
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text)
    return tokens

word_tokenize("The cat sat on the mat.")
# ['the', 'cat', 'sat', 'on', 'the', 'mat']

Problems: "unhappiness" is OOV if not in training data. Vocabulary of 100K+ words needed for English.

Character-Level Tokenization

Every character is a token. Tiny vocabulary (~256) but very long sequences, and the model must learn word structure from scratch.

Subword Tokenization: BPE

Byte Pair Encoding (Sennrich et al., 2016) finds the optimal middle ground. Start with characters, iteratively merge the most frequent pair.

BPE from scratch:

python

def get_pair_counts(vocab):
    """Count frequency of all adjacent symbol pairs."""
    pairs = {}
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i + 1])
            pairs[pair] = pairs.get(pair, 0) + freq
    return pairs

def merge_pair(pair, vocab):
    """Merge all occurrences of a pair in the vocabulary."""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word, freq in vocab.items():
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = freq
    return new_vocab

def learn_bpe(corpus, num_merges=100):
    """Learn BPE merge operations from a corpus.

    Args:
        corpus: dict mapping word -> frequency
        num_merges: number of merge operations to learn
    Returns:
        merges: list of (pair, merged) tuples
    """
    # Initialize: split each word into characters + end-of-word marker
    vocab = {}
    for word, freq in corpus.items():
        symbols = ' '.join(list(word)) + ' </w>'
        vocab[symbols] = freq

    merges = []
    for i in range(num_merges):
        pairs = get_pair_counts(vocab)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        vocab = merge_pair(best_pair, vocab)
        merges.append(best_pair)
        if (i + 1) % 20 == 0:
            print(f"Merge {i+1}: {best_pair} (freq={pairs[best_pair]})")

    return merges, vocab

# Example usage
corpus = {
    'low': 5, 'lower': 2, 'newest': 6, 'widest': 3,
    'new': 4, 'wide': 2, 'the': 10, 'there': 3,
}
merges, final_vocab = learn_bpe(corpus, num_merges=30)

print("\nFinal vocabulary tokens:")
all_tokens = set()
for word in final_vocab:
    all_tokens.update(word.split())
print(sorted(all_tokens))

BPE Tokenization (Applying Learned Merges)

python

def apply_bpe(word, merges):
    """Tokenize a word using learned BPE merges."""
    symbols = list(word) + ['</w>']

    for pair in merges:
        i = 0
        while i < len(symbols) - 1:
            if symbols[i] == pair[0] and symbols[i + 1] == pair[1]:
                symbols = symbols[:i] + [''.join(pair)] + symbols[i + 2:]
            else:
                i += 1

    return symbols

# Example
tokens = apply_bpe('lowest', merges)
print(tokens)  # e.g., ['low', 'est</w>'] after sufficient merges

WordPiece (Used by BERT)

Similar to BPE but selects the pair that maximizes the likelihood of the training data:

score (a, b) = \frac{freq (a b)}{freq (a) \times freq (b)}

Subword tokens are prefixed with ## (e.g., "playing" becomes ["play", "##ing"]).

Worked Example — BPE Merge Steps

Input corpus:

Initial vocabulary (character-level with end-of-word marker):

l o w </w> : 5
l o w e s t </w> : 2
n e w </w> : 4
n e w e s t </w> : 6

Step 1: Count all adjacent pairs:

(e, s): 2 + 6 = 8
(s, t): 2 + 6 = 8
(n, e): 4 + 6 = 10
(e, w): 4 + 6 = 10
(l, o): 5 + 2 = 7
(o, w): 5 + 2 = 7
(w, </w>): 5 + 4 = 9
...

Most frequent pair: tie between (n, e) and (e, w) at 10. Pick (e, s) or break ties. Let's say (n, e) wins.

Merge "n" + "e" -> "ne":

l o w </w> : 5
l o w e s t </w> : 2
ne w </w> : 4
ne w e s t </w> : 6

Step 2: Recount. Now (e, s) = 2 + 6 = 8, (ne, w) = 4 + 6 = 10. Merge "ne" + "w" -> "new":

l o w </w> : 5
l o w e s t </w> : 2
new </w> : 4
new e s t </w> : 6

Step 3: (e, s) = 8, (l, o) = 7, (new, </w>) = 4, etc. Merge "e" + "s" -> "es":

new es t </w> : 6
l o w es t </w> : 2

After more merges: "est", "low", "newest", "lowest" become single tokens.

Result: BPE learned that "est" and "new" are frequent subwords. The word "newest" tokenizes as ["new", "est"] --- meaningful morphological units discovered automatically from frequency alone.

SentencePiece

Language-agnostic tokenizer that treats the input as a raw byte stream (no pre-tokenization). Used by T5, LLaMA, and most multilingual models.

Tokenizer Comparison

Tokenizer	Used By	Vocab Size	Approach
BPE	GPT-2, GPT-3	50K	Frequency-based merging
WordPiece	BERT	30K	Likelihood-based merging
SentencePiece (Unigram)	T5, LLaMA	32K	Probabilistic subword
Tiktoken (BPE)	GPT-4, Claude	100K	Byte-level BPE

Word2Vec

Skip-gram Model

Given a center word $w_{c}$ , predict context words $w_{o}$ within a window.

The probability of context word $w_{o}$ given center word $w_{c}$ :

P (w_{o} | w_{c}) = \frac{\exp (v_{w_{o}}^{'} \cdot v_{w_{c}})}{\sum_{w = 1}^{V} \exp (v_{w}^{'} \cdot v_{w_{c}})}

where $v_{w_{c}}$ is the center word embedding and $v_{w_{o}}^{'}$ is the context word embedding. $V$ is the vocabulary size.

Objective: Maximize the log-likelihood over the corpus:

L = \sum_{t = 1}^{T} \sum_{- c \leq j \leq c, j \neq 0} \log P (w_{t + j} | w_{t})

Negative Sampling

Computing the softmax denominator over the entire vocabulary is too expensive. Negative sampling approximates it by training a binary classifier:

For each positive pair $(w_{c}, w_{o})$ , sample $k$ negative words $w_{n e g}$ from a noise distribution $P_{n} (w) \propto f (w)^{3 / 4}$ (unigram distribution raised to the 3/4 power).

L_{NEG} = \log σ (v_{w_{o}}^{'} \cdot v_{w_{c}}) + \sum_{i = 1}^{k} E_{w_{i} \sim P_{n}} [\log σ (- v_{w_{i}}^{'} \cdot v_{w_{c}})]

This turns the problem from $V$ -class classification to $k + 1$ binary classifications.

Worked Example — Word2Vec Skip-gram Probability

Setup: Vocabulary of 5 words. Center word "cat" (index 2), context word "sat" (index 3).

Center embeddings ( $v$ , 2-dim):

$v_{cat} = [0.5, 0.8]$

Context embeddings ( $v^{'}$ , 2-dim):

$v_{the}^{'} = [0.1, 0.3]$
$v_{dog}^{'} = [0.6, 0.7]$
$v_{cat}^{'} = [0.4, 0.5]$
$v_{sat}^{'} = [0.9, 0.2]$
$v_{on}^{'} = [0.2, 0.1]$

Step 1: Compute dot products $v_{w}^{'} \cdot v_{cat}$ for all words:

the: $0.1 (0.5) + 0.3 (0.8) = 0.29$
dog: $0.6 (0.5) + 0.7 (0.8) = 0.86$
cat: $0.4 (0.5) + 0.5 (0.8) = 0.60$
sat: $0.9 (0.5) + 0.2 (0.8) = 0.61$
on: $0.2 (0.5) + 0.1 (0.8) = 0.18$

Step 2: Softmax:

\exp values = [1.336, 2.363, 1.822, 1.840, 1.197]

sum = 8.558

P (sat | cat) = \frac{1.840}{8.558} = 0.215

Step 3: With negative sampling ( $k = 2$ , negatives: "the", "on"):

L = \log σ (0.61) + \log σ (- 0.29) + \log σ (- 0.18)

= \log (0.648) + \log (0.428) + \log (0.455)

= - 0.435 + (- 0.849) + (- 0.787) = - 2.071

Result: The skip-gram probability $P (sat | cat) = 0.215$ . Training will increase the dot product $v_{sat}^{'} \cdot v_{cat}$ (making "cat" and "sat" closer in embedding space) while decreasing dot products with negative samples.

Word2Vec Implementation

python

import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import numpy as np

class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.center_embeddings = nn.Embedding(vocab_size, embed_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embed_dim)
        # Initialize
        nn.init.xavier_uniform_(self.center_embeddings.weight)
        nn.init.zeros_(self.context_embeddings.weight)

    def forward(self, center, context, negatives):
        """
        center: (batch,) center word IDs
        context: (batch,) positive context word IDs
        negatives: (batch, k) negative sample IDs
        """
        center_emb = self.center_embeddings(center)     # (batch, dim)
        context_emb = self.context_embeddings(context)   # (batch, dim)
        neg_emb = self.context_embeddings(negatives)     # (batch, k, dim)

        # Positive score
        pos_score = torch.sum(center_emb * context_emb, dim=1)  # (batch,)
        pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)

        # Negative scores
        neg_score = torch.bmm(neg_emb, center_emb.unsqueeze(2)).squeeze(2)
        neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-10), dim=1)

        return (pos_loss + neg_loss).mean()

def train_word2vec(corpus, embed_dim=100, window=5, k_neg=5, epochs=5):
    # Build vocabulary
    word_counts = Counter(corpus)
    vocab = sorted(word_counts.keys())
    word2idx = {w: i for i, w in enumerate(vocab)}
    vocab_size = len(vocab)

    # Negative sampling distribution: P(w) ∝ freq(w)^(3/4)
    freqs = np.array([word_counts[w] for w in vocab], dtype=np.float64)
    freqs = freqs ** 0.75
    neg_dist = freqs / freqs.sum()

    model = Word2Vec(vocab_size, embed_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    indices = [word2idx[w] for w in corpus]

    for epoch in range(epochs):
        total_loss = 0
        pairs = []
        for i, center_idx in enumerate(indices):
            for j in range(max(0, i - window), min(len(indices), i + window + 1)):
                if j != i:
                    pairs.append((center_idx, indices[j]))

        # Mini-batch training
        np.random.shuffle(pairs)
        batch_size = 512
        for start in range(0, len(pairs), batch_size):
            batch = pairs[start:start + batch_size]
            centers = torch.tensor([p[0] for p in batch])
            contexts = torch.tensor([p[1] for p in batch])
            negatives = torch.tensor(
                np.random.choice(vocab_size, (len(batch), k_neg), p=neg_dist)
            )

            optimizer.zero_grad()
            loss = model(centers, contexts, negatives)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1}: Loss={total_loss:.4f}")

    return model, word2idx, vocab

GloVe: Global Vectors

GloVe (Pennington et al., 2014) combines the benefits of count-based and prediction-based methods.

Objective

Minimize:

J = \sum_{i, j = 1}^{V} f (X_{i j}) {(w_{i}^{T} {\tilde{w}}_{j} + b_{i} + {\tilde{b}}_{j} - \log X_{i j})}^{2}

where $X_{i j}$ is the co-occurrence count, $f (x) = min ({(\frac{x}{x_{max}})}^{3 / 4}, 1)$ is a weighting function that downweights very frequent pairs.

Key insight: The ratio of co-occurrence probabilities encodes semantic relationships:

\frac{P (k | ice)}{P (k | steam)} \approx {\begin{cases} large & if k = solid \\ small & if k = gas \\ \approx 1 & if k = water or fashion \end{cases}

FastText

FastText (Bojanowski et al., 2017) extends Word2Vec by representing each word as a bag of character n-grams.

For the word "where" with $n = 3$ : <wh, whe, her, ere, re>, plus the whole word <where>.

The word vector is the sum of its n-gram vectors:

v_{w} = \sum_{g \in G (w)} z_{g}

Advantage: Can generate embeddings for OOV words by composing their n-grams.

Embedding Properties

Analogies

The famous $king - man + woman \approx queen$ works because embeddings capture linear relationships:

python

def analogy(word2idx, embeddings, a, b, c, top_k=5):
    """a is to b as c is to ?"""
    vec = embeddings[word2idx[b]] - embeddings[word2idx[a]] + embeddings[word2idx[c]]
    # Find nearest neighbors
    sims = torch.cosine_similarity(vec.unsqueeze(0), embeddings)
    # Exclude input words
    for w in [a, b, c]:
        sims[word2idx[w]] = -1
    top_indices = sims.topk(top_k).indices
    return [vocab[i] for i in top_indices]

Bias in Embeddings

Word embeddings trained on real-world text encode societal biases:

\vec{doctor} - \vec{man} + \vec{woman} \approx \vec{nurse}

Debiasing techniques include post-hoc projection (Bolukbasi et al., 2016) and training-time interventions.

Text Classification: IMDB

End-to-end text classification using pretrained embeddings and a simple CNN:

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class TextCNN(nn.Module):
    """Yoon Kim (2014) CNN for text classification."""
    def __init__(self, vocab_size, embed_dim, num_classes,
                 filter_sizes=[3, 4, 5], num_filters=100, dropout=0.5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.convs = nn.ModuleList([
            nn.Conv1d(embed_dim, num_filters, fs) for fs in filter_sizes
        ])
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(num_filters * len(filter_sizes), num_classes)

    def forward(self, x):
        """x: (batch, seq_len) token IDs."""
        x = self.embedding(x)       # (batch, seq_len, embed_dim)
        x = x.transpose(1, 2)       # (batch, embed_dim, seq_len) for Conv1d

        # Apply each filter size and max-pool
        conv_outputs = []
        for conv in self.convs:
            c = F.relu(conv(x))              # (batch, num_filters, seq_len - fs + 1)
            c = F.max_pool1d(c, c.size(2))   # (batch, num_filters, 1)
            conv_outputs.append(c.squeeze(2))

        # Concatenate all filter outputs
        x = torch.cat(conv_outputs, dim=1)   # (batch, num_filters * len(filter_sizes))
        x = self.dropout(x)
        return self.fc(x)

# Training
model = TextCNN(vocab_size=30000, embed_dim=128, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Use IMDB dataset loading from the RNN-LSTM page
# Expected accuracy: ~87-89%

Pretrained Embeddings

Loading GloVe

python

def load_glove(path, word2idx, embed_dim=300):
    """Load GloVe vectors for words in vocabulary."""
    embeddings = torch.randn(len(word2idx), embed_dim) * 0.01
    found = 0
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            if word in word2idx:
                vector = torch.tensor([float(v) for v in values[1:]])
                embeddings[word2idx[word]] = vector
                found += 1

    print(f"Found {found}/{len(word2idx)} words in GloVe")
    return embeddings

# Use in model
pretrained = load_glove('glove.6B.300d.txt', word2idx)
model.embedding.weight.data.copy_(pretrained)
model.embedding.weight.requires_grad = False  # Freeze initially

Text Preprocessing Best Practices

Cleaning Pipeline

python

import re
import html

def clean_text(text):
    """Standard text cleaning pipeline."""
    # Decode HTML entities
    text = html.unescape(text)
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '[URL]', text)
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '[EMAIL]', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove repeated punctuation
    text = re.sub(r'([!?.]){2,}', r'\1', text)
    return text

# Example
raw = "<p>Check out  https://example.com !! It's great!!</p>"
print(clean_text(raw))
# "Check out [URL] ! It's great!"

Handling Out-of-Vocabulary Words

Strategy	Approach	Trade-off
Subword tokenization	BPE, WordPiece, Unigram	Best general approach
Character fallback	Unknown words → character n-grams	Handles any input
Hash embedding	Hash word to fixed bucket	Fixed vocab, collision risk
UNK token	Replace rare words with `<UNK>`	Simple but loses information

Vocabulary Size Trade-offs

Vocab Size	Pros	Cons
Small (1K-5K)	Short sequences, fast	Many multi-token words, loses meaning
Medium (30K-50K)	Good balance (BERT, GPT-2)	Standard choice
Large (100K+)	Fewer tokens per word, multilingual	Large embedding matrix

TF-IDF: The Classical Baseline

Before deep learning, TF-IDF was the standard text representation:

TF-IDF (t, d, D) = TF (t, d) \times IDF (t, D)

TF (t, d) = \frac{count (t, d)}{| d |}, IDF (t, D) = \log \frac{| D |}{| {d \in D : t \in d} |}

Worked Example — TF-IDF Calculation

Setup: 3 documents, computing TF-IDF for the word "cat":

Doc 1: "the cat sat on the mat" (6 words, "cat" appears 1 time)
Doc 2: "the cat cat played" (4 words, "cat" appears 2 times)
Doc 3: "the dog ran fast" (4 words, "cat" appears 0 times)

Step 1 --- Term Frequency:

T F (cat, d_{1}) = 1 / 6 = 0.167, T F (cat, d_{2}) = 2 / 4 = 0.500, T F (cat, d_{3}) = 0 / 4 = 0

Step 2 --- Inverse Document Frequency: "cat" appears in 2 out of 3 documents:

I D F (cat) = \log \frac{3}{2} = \log (1.5) = 0.405

Step 3 --- TF-IDF:

TF-IDF (cat, d_{1}) = 0.167 \times 0.405 = 0.068

TF-IDF (cat, d_{2}) = 0.500 \times 0.405 = 0.203

TF-IDF (cat, d_{3}) = 0 \times 0.405 = 0

Compare with "the": appears in all 3 docs, so $I D F = \log (3 / 3) = 0$ and TF-IDF = 0 for all docs.

Result: "cat" has a moderate TF-IDF in Doc 2 (0.203) because it's frequent there but not universal. "the" has TF-IDF = 0 everywhere because it appears in every document --- IDF kills universally common words. This is why TF-IDF is effective: it highlights distinctive terms.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# TF-IDF + Logistic Regression baseline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000)),
])

pipeline.fit(train_texts, train_labels)
accuracy = pipeline.score(test_texts, test_labels)
print(f"TF-IDF Baseline: {accuracy:.4f}")
# Often achieves 85-88% on IMDB --- a strong baseline!

Always Start with TF-IDF

TF-IDF + logistic regression is a surprisingly strong baseline for text classification. It trains in seconds, needs no GPU, and often achieves 90%+ of BERT performance. Always try it first.

Embedding Evaluation

Intrinsic Evaluation

Test embedding quality directly on linguistic tasks:

Task	Dataset	What It Tests
Word similarity	SimLex-999, WS-353	Semantic similarity
Analogies	Google analogy dataset	Relational structure
Categorization	AP dataset	Cluster quality

Extrinsic Evaluation

Test embeddings on downstream tasks:

Task	Metric	Shows
Text classification	Accuracy/F1	Discriminative quality
NER	F1	Token-level representation
Semantic similarity	Spearman correlation	Sentence-level quality

Common NLP Pitfalls

Pitfall	Impact	Fix
Not lowercasing (when appropriate)	Vocabulary explosion	Lowercase for uncased models
Ignoring class imbalance	Model predicts majority class	Weighted loss, oversampling
Truncating long documents	Losing important context	Summarize, chunk, or use Longformer
Not handling special characters	Tokenizer errors	Clean text before tokenizing
Using wrong pretrained embeddings	Poor downstream performance	Match domain (biomedical? legal?)

Cross-References

Modern NLP: Language Models --- from Word2Vec to GPT
Transformers: Transformers --- self-attention for NLP
BERT: BERT Family --- contextual embeddings
Generation: Text Generation --- decoding strategies
Sequences: RNN and LSTM --- recurrent models for text

NLP Fundamentals ​

Text Preprocessing Pipeline ​

Tokenization ​

Word-Level Tokenization ​

Character-Level Tokenization ​

Subword Tokenization: BPE ​

BPE Tokenization (Applying Learned Merges) ​

WordPiece (Used by BERT) ​

SentencePiece ​

Tokenizer Comparison ​

Word2Vec ​

Skip-gram Model ​

Negative Sampling ​

Word2Vec Implementation ​

GloVe: Global Vectors ​

Objective ​

FastText ​

Embedding Properties ​

Analogies ​

Bias in Embeddings ​

Text Classification: IMDB ​

Pretrained Embeddings ​

Loading GloVe ​

Text Preprocessing Best Practices ​

Cleaning Pipeline ​

Handling Out-of-Vocabulary Words ​

Vocabulary Size Trade-offs ​

TF-IDF: The Classical Baseline ​

Embedding Evaluation ​

Intrinsic Evaluation ​

Extrinsic Evaluation ​

Common NLP Pitfalls ​

Cross-References ​

NLP Fundamentals

Text Preprocessing Pipeline

Tokenization

Word-Level Tokenization

Character-Level Tokenization

Subword Tokenization: BPE

BPE Tokenization (Applying Learned Merges)

WordPiece (Used by BERT)

SentencePiece

Tokenizer Comparison

Word2Vec

Skip-gram Model

Negative Sampling

Word2Vec Implementation

GloVe: Global Vectors

Objective

FastText

Embedding Properties

Analogies

Bias in Embeddings

Text Classification: IMDB

Pretrained Embeddings

Loading GloVe

Text Preprocessing Best Practices

Cleaning Pipeline

Handling Out-of-Vocabulary Words

Vocabulary Size Trade-offs

TF-IDF: The Classical Baseline

Embedding Evaluation

Intrinsic Evaluation

Extrinsic Evaluation

Common NLP Pitfalls

Cross-References