Skip to content
Unverified — AI-generated content. Help verify this page

NLP Fundamentals

Natural language processing bridges human language and machine computation. Before transformers can process text, raw characters must become numbers. This page builds that pipeline from the ground up: tokenization algorithms (including BPE from scratch), word embedding methods (Word2Vec, GloVe, FastText) with full mathematical derivations, and an end-to-end IMDB text classifier.

Text Preprocessing Pipeline

Tokenization

Word-Level Tokenization

Split on whitespace and punctuation. Simple but creates huge vocabularies and cannot handle out-of-vocabulary (OOV) words.

python
def word_tokenize(text):
    import re
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text)
    return tokens

word_tokenize("The cat sat on the mat.")
# ['the', 'cat', 'sat', 'on', 'the', 'mat']

Problems: "unhappiness" is OOV if not in training data. Vocabulary of 100K+ words needed for English.

Character-Level Tokenization

Every character is a token. Tiny vocabulary (~256) but very long sequences, and the model must learn word structure from scratch.

Subword Tokenization: BPE

Byte Pair Encoding (Sennrich et al., 2016) finds the optimal middle ground. Start with characters, iteratively merge the most frequent pair.

BPE from scratch:

python
def get_pair_counts(vocab):
    """Count frequency of all adjacent symbol pairs."""
    pairs = {}
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i + 1])
            pairs[pair] = pairs.get(pair, 0) + freq
    return pairs

def merge_pair(pair, vocab):
    """Merge all occurrences of a pair in the vocabulary."""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word, freq in vocab.items():
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = freq
    return new_vocab

def learn_bpe(corpus, num_merges=100):
    """Learn BPE merge operations from a corpus.

    Args:
        corpus: dict mapping word -> frequency
        num_merges: number of merge operations to learn
    Returns:
        merges: list of (pair, merged) tuples
    """
    # Initialize: split each word into characters + end-of-word marker
    vocab = {}
    for word, freq in corpus.items():
        symbols = ' '.join(list(word)) + ' </w>'
        vocab[symbols] = freq

    merges = []
    for i in range(num_merges):
        pairs = get_pair_counts(vocab)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        vocab = merge_pair(best_pair, vocab)
        merges.append(best_pair)
        if (i + 1) % 20 == 0:
            print(f"Merge {i+1}: {best_pair} (freq={pairs[best_pair]})")

    return merges, vocab

# Example usage
corpus = {
    'low': 5, 'lower': 2, 'newest': 6, 'widest': 3,
    'new': 4, 'wide': 2, 'the': 10, 'there': 3,
}
merges, final_vocab = learn_bpe(corpus, num_merges=30)

print("\nFinal vocabulary tokens:")
all_tokens = set()
for word in final_vocab:
    all_tokens.update(word.split())
print(sorted(all_tokens))

BPE Tokenization (Applying Learned Merges)

python
def apply_bpe(word, merges):
    """Tokenize a word using learned BPE merges."""
    symbols = list(word) + ['</w>']

    for pair in merges:
        i = 0
        while i < len(symbols) - 1:
            if symbols[i] == pair[0] and symbols[i + 1] == pair[1]:
                symbols = symbols[:i] + [''.join(pair)] + symbols[i + 2:]
            else:
                i += 1

    return symbols

# Example
tokens = apply_bpe('lowest', merges)
print(tokens)  # e.g., ['low', 'est</w>'] after sufficient merges

WordPiece (Used by BERT)

Similar to BPE but selects the pair that maximizes the likelihood of the training data:

score(a,b)=freq(ab)freq(a)×freq(b)

Subword tokens are prefixed with ## (e.g., "playing" becomes ["play", "##ing"]).

Worked Example — BPE Merge Steps

Input corpus:

Initial vocabulary (character-level with end-of-word marker):

  • l o w </w> : 5
  • l o w e s t </w> : 2
  • n e w </w> : 4
  • n e w e s t </w> : 6

Step 1: Count all adjacent pairs:

  • (e, s): 2 + 6 = 8
  • (s, t): 2 + 6 = 8
  • (n, e): 4 + 6 = 10
  • (e, w): 4 + 6 = 10
  • (l, o): 5 + 2 = 7
  • (o, w): 5 + 2 = 7
  • (w, </w>): 5 + 4 = 9
  • ...

Most frequent pair: tie between (n, e) and (e, w) at 10. Pick (e, s) or break ties. Let's say (n, e) wins.

Merge "n" + "e" -> "ne":

  • l o w </w> : 5
  • l o w e s t </w> : 2
  • ne w </w> : 4
  • ne w e s t </w> : 6

Step 2: Recount. Now (e, s) = 2 + 6 = 8, (ne, w) = 4 + 6 = 10. Merge "ne" + "w" -> "new":

  • l o w </w> : 5
  • l o w e s t </w> : 2
  • new </w> : 4
  • new e s t </w> : 6

Step 3: (e, s) = 8, (l, o) = 7, (new, </w>) = 4, etc. Merge "e" + "s" -> "es":

  • new es t </w> : 6
  • l o w es t </w> : 2

After more merges: "est", "low", "newest", "lowest" become single tokens.

Result: BPE learned that "est" and "new" are frequent subwords. The word "newest" tokenizes as ["new", "est"] --- meaningful morphological units discovered automatically from frequency alone.

SentencePiece

Language-agnostic tokenizer that treats the input as a raw byte stream (no pre-tokenization). Used by T5, LLaMA, and most multilingual models.

Tokenizer Comparison

TokenizerUsed ByVocab SizeApproach
BPEGPT-2, GPT-350KFrequency-based merging
WordPieceBERT30KLikelihood-based merging
SentencePiece (Unigram)T5, LLaMA32KProbabilistic subword
Tiktoken (BPE)GPT-4, Claude100KByte-level BPE

Word2Vec

Skip-gram Model

Given a center word wc, predict context words wo within a window.

The probability of context word wo given center word wc:

P(wo|wc)=exp(vwovwc)w=1Vexp(vwvwc)

where vwc is the center word embedding and vwo is the context word embedding. V is the vocabulary size.

Objective: Maximize the log-likelihood over the corpus:

L=t=1Tcjc,j0logP(wt+j|wt)

Negative Sampling

Computing the softmax denominator over the entire vocabulary is too expensive. Negative sampling approximates it by training a binary classifier:

For each positive pair (wc,wo), sample k negative words wneg from a noise distribution Pn(w)f(w)3/4 (unigram distribution raised to the 3/4 power).

LNEG=logσ(vwovwc)+i=1kEwiPn[logσ(vwivwc)]

This turns the problem from V-class classification to k+1 binary classifications.

Worked Example — Word2Vec Skip-gram Probability

Setup: Vocabulary of 5 words. Center word "cat" (index 2), context word "sat" (index 3).

Center embeddings (v, 2-dim):

  • vcat=[0.5,0.8]

Context embeddings (v, 2-dim):

  • vthe=[0.1,0.3]
  • vdog=[0.6,0.7]
  • vcat=[0.4,0.5]
  • vsat=[0.9,0.2]
  • von=[0.2,0.1]

Step 1: Compute dot products vwvcat for all words:

  • the: 0.1(0.5)+0.3(0.8)=0.29
  • dog: 0.6(0.5)+0.7(0.8)=0.86
  • cat: 0.4(0.5)+0.5(0.8)=0.60
  • sat: 0.9(0.5)+0.2(0.8)=0.61
  • on: 0.2(0.5)+0.1(0.8)=0.18

Step 2: Softmax:

exp values=[1.336,2.363,1.822,1.840,1.197]sum=8.558P(sat|cat)=1.8408.558=0.215

Step 3: With negative sampling (k=2, negatives: "the", "on"):

L=logσ(0.61)+logσ(0.29)+logσ(0.18)=log(0.648)+log(0.428)+log(0.455)=0.435+(0.849)+(0.787)=2.071

Result: The skip-gram probability P(sat|cat)=0.215. Training will increase the dot product vsatvcat (making "cat" and "sat" closer in embedding space) while decreasing dot products with negative samples.

Word2Vec Implementation

python
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import numpy as np

class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.center_embeddings = nn.Embedding(vocab_size, embed_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embed_dim)
        # Initialize
        nn.init.xavier_uniform_(self.center_embeddings.weight)
        nn.init.zeros_(self.context_embeddings.weight)

    def forward(self, center, context, negatives):
        """
        center: (batch,) center word IDs
        context: (batch,) positive context word IDs
        negatives: (batch, k) negative sample IDs
        """
        center_emb = self.center_embeddings(center)     # (batch, dim)
        context_emb = self.context_embeddings(context)   # (batch, dim)
        neg_emb = self.context_embeddings(negatives)     # (batch, k, dim)

        # Positive score
        pos_score = torch.sum(center_emb * context_emb, dim=1)  # (batch,)
        pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)

        # Negative scores
        neg_score = torch.bmm(neg_emb, center_emb.unsqueeze(2)).squeeze(2)
        neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-10), dim=1)

        return (pos_loss + neg_loss).mean()

def train_word2vec(corpus, embed_dim=100, window=5, k_neg=5, epochs=5):
    # Build vocabulary
    word_counts = Counter(corpus)
    vocab = sorted(word_counts.keys())
    word2idx = {w: i for i, w in enumerate(vocab)}
    vocab_size = len(vocab)

    # Negative sampling distribution: P(w) ∝ freq(w)^(3/4)
    freqs = np.array([word_counts[w] for w in vocab], dtype=np.float64)
    freqs = freqs ** 0.75
    neg_dist = freqs / freqs.sum()

    model = Word2Vec(vocab_size, embed_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    indices = [word2idx[w] for w in corpus]

    for epoch in range(epochs):
        total_loss = 0
        pairs = []
        for i, center_idx in enumerate(indices):
            for j in range(max(0, i - window), min(len(indices), i + window + 1)):
                if j != i:
                    pairs.append((center_idx, indices[j]))

        # Mini-batch training
        np.random.shuffle(pairs)
        batch_size = 512
        for start in range(0, len(pairs), batch_size):
            batch = pairs[start:start + batch_size]
            centers = torch.tensor([p[0] for p in batch])
            contexts = torch.tensor([p[1] for p in batch])
            negatives = torch.tensor(
                np.random.choice(vocab_size, (len(batch), k_neg), p=neg_dist)
            )

            optimizer.zero_grad()
            loss = model(centers, contexts, negatives)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1}: Loss={total_loss:.4f}")

    return model, word2idx, vocab

GloVe: Global Vectors

GloVe (Pennington et al., 2014) combines the benefits of count-based and prediction-based methods.

Objective

Minimize:

J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlogXij)2

where Xij is the co-occurrence count, f(x)=min((xxmax)3/4,1) is a weighting function that downweights very frequent pairs.

Key insight: The ratio of co-occurrence probabilities encodes semantic relationships:

P(k|ice)P(k|steam){largeif k=solidsmallif k=gas1if k=water or fashion

FastText

FastText (Bojanowski et al., 2017) extends Word2Vec by representing each word as a bag of character n-grams.

For the word "where" with n=3: <wh, whe, her, ere, re>, plus the whole word <where>.

The word vector is the sum of its n-gram vectors:

vw=gG(w)zg

Advantage: Can generate embeddings for OOV words by composing their n-grams.

Embedding Properties

Analogies

The famous kingman+womanqueen works because embeddings capture linear relationships:

python
def analogy(word2idx, embeddings, a, b, c, top_k=5):
    """a is to b as c is to ?"""
    vec = embeddings[word2idx[b]] - embeddings[word2idx[a]] + embeddings[word2idx[c]]
    # Find nearest neighbors
    sims = torch.cosine_similarity(vec.unsqueeze(0), embeddings)
    # Exclude input words
    for w in [a, b, c]:
        sims[word2idx[w]] = -1
    top_indices = sims.topk(top_k).indices
    return [vocab[i] for i in top_indices]

Bias in Embeddings

Word embeddings trained on real-world text encode societal biases:

doctorman+womannurse

Debiasing techniques include post-hoc projection (Bolukbasi et al., 2016) and training-time interventions.

Text Classification: IMDB

End-to-end text classification using pretrained embeddings and a simple CNN:

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class TextCNN(nn.Module):
    """Yoon Kim (2014) CNN for text classification."""
    def __init__(self, vocab_size, embed_dim, num_classes,
                 filter_sizes=[3, 4, 5], num_filters=100, dropout=0.5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.convs = nn.ModuleList([
            nn.Conv1d(embed_dim, num_filters, fs) for fs in filter_sizes
        ])
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(num_filters * len(filter_sizes), num_classes)

    def forward(self, x):
        """x: (batch, seq_len) token IDs."""
        x = self.embedding(x)       # (batch, seq_len, embed_dim)
        x = x.transpose(1, 2)       # (batch, embed_dim, seq_len) for Conv1d

        # Apply each filter size and max-pool
        conv_outputs = []
        for conv in self.convs:
            c = F.relu(conv(x))              # (batch, num_filters, seq_len - fs + 1)
            c = F.max_pool1d(c, c.size(2))   # (batch, num_filters, 1)
            conv_outputs.append(c.squeeze(2))

        # Concatenate all filter outputs
        x = torch.cat(conv_outputs, dim=1)   # (batch, num_filters * len(filter_sizes))
        x = self.dropout(x)
        return self.fc(x)

# Training
model = TextCNN(vocab_size=30000, embed_dim=128, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Use IMDB dataset loading from the RNN-LSTM page
# Expected accuracy: ~87-89%

Pretrained Embeddings

Loading GloVe

python
def load_glove(path, word2idx, embed_dim=300):
    """Load GloVe vectors for words in vocabulary."""
    embeddings = torch.randn(len(word2idx), embed_dim) * 0.01
    found = 0
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            if word in word2idx:
                vector = torch.tensor([float(v) for v in values[1:]])
                embeddings[word2idx[word]] = vector
                found += 1

    print(f"Found {found}/{len(word2idx)} words in GloVe")
    return embeddings

# Use in model
pretrained = load_glove('glove.6B.300d.txt', word2idx)
model.embedding.weight.data.copy_(pretrained)
model.embedding.weight.requires_grad = False  # Freeze initially

Text Preprocessing Best Practices

Cleaning Pipeline

python
import re
import html

def clean_text(text):
    """Standard text cleaning pipeline."""
    # Decode HTML entities
    text = html.unescape(text)
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '[URL]', text)
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '[EMAIL]', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove repeated punctuation
    text = re.sub(r'([!?.]){2,}', r'\1', text)
    return text

# Example
raw = "<p>Check out  https://example.com !! It's great!!</p>"
print(clean_text(raw))
# "Check out [URL] ! It's great!"

Handling Out-of-Vocabulary Words

StrategyApproachTrade-off
Subword tokenizationBPE, WordPiece, UnigramBest general approach
Character fallbackUnknown words → character n-gramsHandles any input
Hash embeddingHash word to fixed bucketFixed vocab, collision risk
UNK tokenReplace rare words with <UNK>Simple but loses information

Vocabulary Size Trade-offs

Vocab SizeProsCons
Small (1K-5K)Short sequences, fastMany multi-token words, loses meaning
Medium (30K-50K)Good balance (BERT, GPT-2)Standard choice
Large (100K+)Fewer tokens per word, multilingualLarge embedding matrix

TF-IDF: The Classical Baseline

Before deep learning, TF-IDF was the standard text representation:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)TF(t,d)=count(t,d)|d|,IDF(t,D)=log|D||{dD:td}|
Worked Example — TF-IDF Calculation

Setup: 3 documents, computing TF-IDF for the word "cat":

  • Doc 1: "the cat sat on the mat" (6 words, "cat" appears 1 time)
  • Doc 2: "the cat cat played" (4 words, "cat" appears 2 times)
  • Doc 3: "the dog ran fast" (4 words, "cat" appears 0 times)

Step 1 --- Term Frequency:

TF(cat,d1)=1/6=0.167,TF(cat,d2)=2/4=0.500,TF(cat,d3)=0/4=0

Step 2 --- Inverse Document Frequency: "cat" appears in 2 out of 3 documents:

IDF(cat)=log32=log(1.5)=0.405

Step 3 --- TF-IDF:

TF-IDF(cat,d1)=0.167×0.405=0.068TF-IDF(cat,d2)=0.500×0.405=0.203TF-IDF(cat,d3)=0×0.405=0

Compare with "the": appears in all 3 docs, so IDF=log(3/3)=0 and TF-IDF = 0 for all docs.

Result: "cat" has a moderate TF-IDF in Doc 2 (0.203) because it's frequent there but not universal. "the" has TF-IDF = 0 everywhere because it appears in every document --- IDF kills universally common words. This is why TF-IDF is effective: it highlights distinctive terms.

python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# TF-IDF + Logistic Regression baseline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000)),
])

pipeline.fit(train_texts, train_labels)
accuracy = pipeline.score(test_texts, test_labels)
print(f"TF-IDF Baseline: {accuracy:.4f}")
# Often achieves 85-88% on IMDB --- a strong baseline!

Always Start with TF-IDF

TF-IDF + logistic regression is a surprisingly strong baseline for text classification. It trains in seconds, needs no GPU, and often achieves 90%+ of BERT performance. Always try it first.

Embedding Evaluation

Intrinsic Evaluation

Test embedding quality directly on linguistic tasks:

TaskDatasetWhat It Tests
Word similaritySimLex-999, WS-353Semantic similarity
AnalogiesGoogle analogy datasetRelational structure
CategorizationAP datasetCluster quality

Extrinsic Evaluation

Test embeddings on downstream tasks:

TaskMetricShows
Text classificationAccuracy/F1Discriminative quality
NERF1Token-level representation
Semantic similaritySpearman correlationSentence-level quality

Common NLP Pitfalls

PitfallImpactFix
Not lowercasing (when appropriate)Vocabulary explosionLowercase for uncased models
Ignoring class imbalanceModel predicts majority classWeighted loss, oversampling
Truncating long documentsLosing important contextSummarize, chunk, or use Longformer
Not handling special charactersTokenizer errorsClean text before tokenizing
Using wrong pretrained embeddingsPoor downstream performanceMatch domain (biomedical? legal?)

Cross-References

"What I cannot create, I do not understand." — Richard Feynman