NLP Fundamentals
Natural language processing bridges human language and machine computation. Before transformers can process text, raw characters must become numbers. This page builds that pipeline from the ground up: tokenization algorithms (including BPE from scratch), word embedding methods (Word2Vec, GloVe, FastText) with full mathematical derivations, and an end-to-end IMDB text classifier.
Text Preprocessing Pipeline
Tokenization
Word-Level Tokenization
Split on whitespace and punctuation. Simple but creates huge vocabularies and cannot handle out-of-vocabulary (OOV) words.
def word_tokenize(text):
import re
text = text.lower()
tokens = re.findall(r'\b\w+\b', text)
return tokens
word_tokenize("The cat sat on the mat.")
# ['the', 'cat', 'sat', 'on', 'the', 'mat']Problems: "unhappiness" is OOV if not in training data. Vocabulary of 100K+ words needed for English.
Character-Level Tokenization
Every character is a token. Tiny vocabulary (~256) but very long sequences, and the model must learn word structure from scratch.
Subword Tokenization: BPE
Byte Pair Encoding (Sennrich et al., 2016) finds the optimal middle ground. Start with characters, iteratively merge the most frequent pair.
BPE from scratch:
def get_pair_counts(vocab):
"""Count frequency of all adjacent symbol pairs."""
pairs = {}
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pair = (symbols[i], symbols[i + 1])
pairs[pair] = pairs.get(pair, 0) + freq
return pairs
def merge_pair(pair, vocab):
"""Merge all occurrences of a pair in the vocabulary."""
new_vocab = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word, freq in vocab.items():
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = freq
return new_vocab
def learn_bpe(corpus, num_merges=100):
"""Learn BPE merge operations from a corpus.
Args:
corpus: dict mapping word -> frequency
num_merges: number of merge operations to learn
Returns:
merges: list of (pair, merged) tuples
"""
# Initialize: split each word into characters + end-of-word marker
vocab = {}
for word, freq in corpus.items():
symbols = ' '.join(list(word)) + ' </w>'
vocab[symbols] = freq
merges = []
for i in range(num_merges):
pairs = get_pair_counts(vocab)
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
vocab = merge_pair(best_pair, vocab)
merges.append(best_pair)
if (i + 1) % 20 == 0:
print(f"Merge {i+1}: {best_pair} (freq={pairs[best_pair]})")
return merges, vocab
# Example usage
corpus = {
'low': 5, 'lower': 2, 'newest': 6, 'widest': 3,
'new': 4, 'wide': 2, 'the': 10, 'there': 3,
}
merges, final_vocab = learn_bpe(corpus, num_merges=30)
print("\nFinal vocabulary tokens:")
all_tokens = set()
for word in final_vocab:
all_tokens.update(word.split())
print(sorted(all_tokens))BPE Tokenization (Applying Learned Merges)
def apply_bpe(word, merges):
"""Tokenize a word using learned BPE merges."""
symbols = list(word) + ['</w>']
for pair in merges:
i = 0
while i < len(symbols) - 1:
if symbols[i] == pair[0] and symbols[i + 1] == pair[1]:
symbols = symbols[:i] + [''.join(pair)] + symbols[i + 2:]
else:
i += 1
return symbols
# Example
tokens = apply_bpe('lowest', merges)
print(tokens) # e.g., ['low', 'est</w>'] after sufficient mergesWordPiece (Used by BERT)
Similar to BPE but selects the pair that maximizes the likelihood of the training data:
Subword tokens are prefixed with ## (e.g., "playing" becomes ["play", "##ing"]).
Worked Example — BPE Merge Steps
Input corpus:
Initial vocabulary (character-level with end-of-word marker):
l o w </w>: 5l o w e s t </w>: 2n e w </w>: 4n e w e s t </w>: 6
Step 1: Count all adjacent pairs:
- (e, s): 2 + 6 = 8
- (s, t): 2 + 6 = 8
- (n, e): 4 + 6 = 10
- (e, w): 4 + 6 = 10
- (l, o): 5 + 2 = 7
- (o, w): 5 + 2 = 7
- (w,
</w>): 5 + 4 = 9 - ...
Most frequent pair: tie between (n, e) and (e, w) at 10. Pick (e, s) or break ties. Let's say (n, e) wins.
Merge "n" + "e" -> "ne":
l o w </w>: 5l o w e s t </w>: 2ne w </w>: 4ne w e s t </w>: 6
Step 2: Recount. Now (e, s) = 2 + 6 = 8, (ne, w) = 4 + 6 = 10. Merge "ne" + "w" -> "new":
l o w </w>: 5l o w e s t </w>: 2new </w>: 4new e s t </w>: 6
Step 3: (e, s) = 8, (l, o) = 7, (new, </w>) = 4, etc. Merge "e" + "s" -> "es":
new es t </w>: 6l o w es t </w>: 2
After more merges: "est", "low", "newest", "lowest" become single tokens.
Result: BPE learned that "est" and "new" are frequent subwords. The word "newest" tokenizes as ["new", "est"] --- meaningful morphological units discovered automatically from frequency alone.
SentencePiece
Language-agnostic tokenizer that treats the input as a raw byte stream (no pre-tokenization). Used by T5, LLaMA, and most multilingual models.
Tokenizer Comparison
| Tokenizer | Used By | Vocab Size | Approach |
|---|---|---|---|
| BPE | GPT-2, GPT-3 | 50K | Frequency-based merging |
| WordPiece | BERT | 30K | Likelihood-based merging |
| SentencePiece (Unigram) | T5, LLaMA | 32K | Probabilistic subword |
| Tiktoken (BPE) | GPT-4, Claude | 100K | Byte-level BPE |
Word2Vec
Skip-gram Model
Given a center word
The probability of context word
where
Objective: Maximize the log-likelihood over the corpus:
Negative Sampling
Computing the softmax denominator over the entire vocabulary is too expensive. Negative sampling approximates it by training a binary classifier:
For each positive pair
This turns the problem from
Worked Example — Word2Vec Skip-gram Probability
Setup: Vocabulary of 5 words. Center word "cat" (index 2), context word "sat" (index 3).
Center embeddings (
Context embeddings (
Step 1: Compute dot products
- the:
- dog:
- cat:
- sat:
- on:
Step 2: Softmax:
Step 3: With negative sampling (
Result: The skip-gram probability
Word2Vec Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import numpy as np
class Word2Vec(nn.Module):
def __init__(self, vocab_size, embed_dim):
super().__init__()
self.center_embeddings = nn.Embedding(vocab_size, embed_dim)
self.context_embeddings = nn.Embedding(vocab_size, embed_dim)
# Initialize
nn.init.xavier_uniform_(self.center_embeddings.weight)
nn.init.zeros_(self.context_embeddings.weight)
def forward(self, center, context, negatives):
"""
center: (batch,) center word IDs
context: (batch,) positive context word IDs
negatives: (batch, k) negative sample IDs
"""
center_emb = self.center_embeddings(center) # (batch, dim)
context_emb = self.context_embeddings(context) # (batch, dim)
neg_emb = self.context_embeddings(negatives) # (batch, k, dim)
# Positive score
pos_score = torch.sum(center_emb * context_emb, dim=1) # (batch,)
pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)
# Negative scores
neg_score = torch.bmm(neg_emb, center_emb.unsqueeze(2)).squeeze(2)
neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-10), dim=1)
return (pos_loss + neg_loss).mean()
def train_word2vec(corpus, embed_dim=100, window=5, k_neg=5, epochs=5):
# Build vocabulary
word_counts = Counter(corpus)
vocab = sorted(word_counts.keys())
word2idx = {w: i for i, w in enumerate(vocab)}
vocab_size = len(vocab)
# Negative sampling distribution: P(w) ∝ freq(w)^(3/4)
freqs = np.array([word_counts[w] for w in vocab], dtype=np.float64)
freqs = freqs ** 0.75
neg_dist = freqs / freqs.sum()
model = Word2Vec(vocab_size, embed_dim)
optimizer = optim.Adam(model.parameters(), lr=0.001)
indices = [word2idx[w] for w in corpus]
for epoch in range(epochs):
total_loss = 0
pairs = []
for i, center_idx in enumerate(indices):
for j in range(max(0, i - window), min(len(indices), i + window + 1)):
if j != i:
pairs.append((center_idx, indices[j]))
# Mini-batch training
np.random.shuffle(pairs)
batch_size = 512
for start in range(0, len(pairs), batch_size):
batch = pairs[start:start + batch_size]
centers = torch.tensor([p[0] for p in batch])
contexts = torch.tensor([p[1] for p in batch])
negatives = torch.tensor(
np.random.choice(vocab_size, (len(batch), k_neg), p=neg_dist)
)
optimizer.zero_grad()
loss = model(centers, contexts, negatives)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}: Loss={total_loss:.4f}")
return model, word2idx, vocabGloVe: Global Vectors
GloVe (Pennington et al., 2014) combines the benefits of count-based and prediction-based methods.
Objective
Minimize:
where
Key insight: The ratio of co-occurrence probabilities encodes semantic relationships:
FastText
FastText (Bojanowski et al., 2017) extends Word2Vec by representing each word as a bag of character n-grams.
For the word "where" with <wh, whe, her, ere, re>, plus the whole word <where>.
The word vector is the sum of its n-gram vectors:
Advantage: Can generate embeddings for OOV words by composing their n-grams.
Embedding Properties
Analogies
The famous
def analogy(word2idx, embeddings, a, b, c, top_k=5):
"""a is to b as c is to ?"""
vec = embeddings[word2idx[b]] - embeddings[word2idx[a]] + embeddings[word2idx[c]]
# Find nearest neighbors
sims = torch.cosine_similarity(vec.unsqueeze(0), embeddings)
# Exclude input words
for w in [a, b, c]:
sims[word2idx[w]] = -1
top_indices = sims.topk(top_k).indices
return [vocab[i] for i in top_indices]Bias in Embeddings
Word embeddings trained on real-world text encode societal biases:
Debiasing techniques include post-hoc projection (Bolukbasi et al., 2016) and training-time interventions.
Text Classification: IMDB
End-to-end text classification using pretrained embeddings and a simple CNN:
import torch
import torch.nn as nn
import torch.nn.functional as F
class TextCNN(nn.Module):
"""Yoon Kim (2014) CNN for text classification."""
def __init__(self, vocab_size, embed_dim, num_classes,
filter_sizes=[3, 4, 5], num_filters=100, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.convs = nn.ModuleList([
nn.Conv1d(embed_dim, num_filters, fs) for fs in filter_sizes
])
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(num_filters * len(filter_sizes), num_classes)
def forward(self, x):
"""x: (batch, seq_len) token IDs."""
x = self.embedding(x) # (batch, seq_len, embed_dim)
x = x.transpose(1, 2) # (batch, embed_dim, seq_len) for Conv1d
# Apply each filter size and max-pool
conv_outputs = []
for conv in self.convs:
c = F.relu(conv(x)) # (batch, num_filters, seq_len - fs + 1)
c = F.max_pool1d(c, c.size(2)) # (batch, num_filters, 1)
conv_outputs.append(c.squeeze(2))
# Concatenate all filter outputs
x = torch.cat(conv_outputs, dim=1) # (batch, num_filters * len(filter_sizes))
x = self.dropout(x)
return self.fc(x)
# Training
model = TextCNN(vocab_size=30000, embed_dim=128, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# Use IMDB dataset loading from the RNN-LSTM page
# Expected accuracy: ~87-89%Pretrained Embeddings
Loading GloVe
def load_glove(path, word2idx, embed_dim=300):
"""Load GloVe vectors for words in vocabulary."""
embeddings = torch.randn(len(word2idx), embed_dim) * 0.01
found = 0
with open(path, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
if word in word2idx:
vector = torch.tensor([float(v) for v in values[1:]])
embeddings[word2idx[word]] = vector
found += 1
print(f"Found {found}/{len(word2idx)} words in GloVe")
return embeddings
# Use in model
pretrained = load_glove('glove.6B.300d.txt', word2idx)
model.embedding.weight.data.copy_(pretrained)
model.embedding.weight.requires_grad = False # Freeze initiallyText Preprocessing Best Practices
Cleaning Pipeline
import re
import html
def clean_text(text):
"""Standard text cleaning pipeline."""
# Decode HTML entities
text = html.unescape(text)
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '[URL]', text)
# Remove email addresses
text = re.sub(r'\S+@\S+', '[EMAIL]', text)
# Normalize whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Remove repeated punctuation
text = re.sub(r'([!?.]){2,}', r'\1', text)
return text
# Example
raw = "<p>Check out https://example.com !! It's great!!</p>"
print(clean_text(raw))
# "Check out [URL] ! It's great!"Handling Out-of-Vocabulary Words
| Strategy | Approach | Trade-off |
|---|---|---|
| Subword tokenization | BPE, WordPiece, Unigram | Best general approach |
| Character fallback | Unknown words → character n-grams | Handles any input |
| Hash embedding | Hash word to fixed bucket | Fixed vocab, collision risk |
| UNK token | Replace rare words with <UNK> | Simple but loses information |
Vocabulary Size Trade-offs
| Vocab Size | Pros | Cons |
|---|---|---|
| Small (1K-5K) | Short sequences, fast | Many multi-token words, loses meaning |
| Medium (30K-50K) | Good balance (BERT, GPT-2) | Standard choice |
| Large (100K+) | Fewer tokens per word, multilingual | Large embedding matrix |
TF-IDF: The Classical Baseline
Before deep learning, TF-IDF was the standard text representation:
Worked Example — TF-IDF Calculation
Setup: 3 documents, computing TF-IDF for the word "cat":
- Doc 1: "the cat sat on the mat" (6 words, "cat" appears 1 time)
- Doc 2: "the cat cat played" (4 words, "cat" appears 2 times)
- Doc 3: "the dog ran fast" (4 words, "cat" appears 0 times)
Step 1 --- Term Frequency:
Step 2 --- Inverse Document Frequency: "cat" appears in 2 out of 3 documents:
Step 3 --- TF-IDF:
Compare with "the": appears in all 3 docs, so
Result: "cat" has a moderate TF-IDF in Doc 2 (0.203) because it's frequent there but not universal. "the" has TF-IDF = 0 everywhere because it appears in every document --- IDF kills universally common words. This is why TF-IDF is effective: it highlights distinctive terms.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# TF-IDF + Logistic Regression baseline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
('clf', LogisticRegression(max_iter=1000)),
])
pipeline.fit(train_texts, train_labels)
accuracy = pipeline.score(test_texts, test_labels)
print(f"TF-IDF Baseline: {accuracy:.4f}")
# Often achieves 85-88% on IMDB --- a strong baseline!Always Start with TF-IDF
TF-IDF + logistic regression is a surprisingly strong baseline for text classification. It trains in seconds, needs no GPU, and often achieves 90%+ of BERT performance. Always try it first.
Embedding Evaluation
Intrinsic Evaluation
Test embedding quality directly on linguistic tasks:
| Task | Dataset | What It Tests |
|---|---|---|
| Word similarity | SimLex-999, WS-353 | Semantic similarity |
| Analogies | Google analogy dataset | Relational structure |
| Categorization | AP dataset | Cluster quality |
Extrinsic Evaluation
Test embeddings on downstream tasks:
| Task | Metric | Shows |
|---|---|---|
| Text classification | Accuracy/F1 | Discriminative quality |
| NER | F1 | Token-level representation |
| Semantic similarity | Spearman correlation | Sentence-level quality |
Common NLP Pitfalls
| Pitfall | Impact | Fix |
|---|---|---|
| Not lowercasing (when appropriate) | Vocabulary explosion | Lowercase for uncased models |
| Ignoring class imbalance | Model predicts majority class | Weighted loss, oversampling |
| Truncating long documents | Losing important context | Summarize, chunk, or use Longformer |
| Not handling special characters | Tokenizer errors | Clean text before tokenizing |
| Using wrong pretrained embeddings | Poor downstream performance | Match domain (biomedical? legal?) |
Cross-References
- Modern NLP: Language Models --- from Word2Vec to GPT
- Transformers: Transformers --- self-attention for NLP
- BERT: BERT Family --- contextual embeddings
- Generation: Text Generation --- decoding strategies
- Sequences: RNN and LSTM --- recurrent models for text