A comparative study of BPE vs. morpheme-based tokenizers for GPT-2 models on Hindi text, achieving 23% lower perplexity with linguistically-informed tokenization.
-
Two Tokenizer Implementations:
- Standard BPE tokenizer
- Morphological tokenizer (prefix/root/suffix decomposition)
-
Optimized GPT-2 Architecture:
- 8-layer model with 384-dim embeddings
- Trained on 100MB Hindi corpus
-
Evaluation Framework:
- Perplexity (PPL) metrics
- Output coherence analysis
Metric | BPE | Morpheme (Ours) |
---|---|---|
Perplexity | 24.3 | 18.7 |
Training Loss | 5.07 | 3.39 |
OOV Rate | 4.7% | 2.1% |
pip install -r requirements.txt # Includes:
# transformers==4.30.0
# tokenizers==0.13.3
# torch==2.0.1
- Source: Hindi Wikipedia dump (~1 GB)
- Cleaned corpus sizes: 10MB, 20MB, 50MB, 100MB
- Tokenization:
BPE
: Trained using standard byte-pair merging (12k vocab)Morpheme
: Custom-built using Hindi prefixes, suffixes, and roots
- 80+ prefixes
- 200+ suffixes
- 5,000+ root words
- Recursive segmentation algorithm
- Preserves meaning better than BPE
Component | Value |
---|---|
Layers | 8 |
Embedding Dim | 384 |
Attention Heads | 6 |
FFN Hidden Dim | 1536 |
Vocab Size | 12,000 |
Optimizer | AdamW |
Learning Rate | 5e-4 |
- Metric: Perplexity (PPL)
- Result:
- BPE Model: 24.3
- Morpheme Model: 18.7 ✅
Morpheme-based tokenizer reduced perplexity by 23%, proving its effectiveness in low-resource settings.
A simple web interface is included under gpt2_web_demo/
to compare BPE and Morpheme outputs interactively.
Challenge | Solution |
---|---|
Rare morphemes | Fallback to characters |
Training instability | Gradient clipping |
Compute limitations | Mixed precision (fp16) |
- Hindi Chatbots (education, support)
- Story and article generation
- Indian-language translation pipelines
- Hindi-only support (for now)
- Morpheme dictionary has ~85% coverage
- Expand to other Indian languages
- Combine morpheme and subword tokenization (hybrid)
- Scale to larger GPT models
- Jabbar, Haris. “MorphPiece: A Linguistic Tokenizer for Large Language Models.” arXiv:2307.07262
- Vaswani et al. “Attention Is All You Need.” NeurIPS 2017
- Chandan Kumar (2024PCS0022) : [email protected]
- Rohit Tiwari (2024PCS0036) : [email protected]
- Under the guidance of Dr. B. N. Subudhi (IIT)