This repository contains a custom-built language model trained on over 400,000 carefully curated paragraphs from a diverse corpus. The model architecture is based on the Transformer decoder, integrating multiple layers of attention mechanisms, pointwise feed-forward networks, and advanced tokenization techniques such as Byte Pair Encoding (BPE). Designed to generate high-quality text with a focus on handling large vocabularies, the model employs a warmup learning rate schedule to ensure optimal training.
The model utilizes a custom Transformer decoder architecture, featuring:
-
Tokenization is handled by SentencePiece using BPE, optimizing vocabulary size while retaining semantic integrity. Each paragraph is tokenized into sequences of integers, padded to a fixed length of 100 tokens.
-
Tokens are mapped to dense vectors of dimension 204 (d_model=204) via an embedding layer, followed by positional encoding to capture sequential token relationships.
-
The model employs multi-head attention with 6 heads (num_heads=6), each processing a depth of 34 (d_model/num_heads). This allows the model to simultaneously focus on various parts of the input sequence.
-
The PFFN comprises two dense layers, with the first expanding dimensionality to 512 (dff=512) followed by ReLU activation, and the second projecting back to 204 (d_model=204), enabling complex input transformations.
-
The architecture includes 8 decoder layers (num_layers=8), each incorporating multi-head attention and a PFFN, with residual connections and normalization enhancing model stability and performance.
-
Dropout with a rate of 0.3 is used to prevent overfitting.
The dataset comprises 400,000+ paragraphs, meticulously filtered using advanced NLP techniques to ensure quality and relevance. Key preparation steps include:
-
Removal of noise, special characters, and irrelevant content.
-
Tokenization was performed using SentencePiece with BPE, resulting in a balanced and manageable vocabulary.
-
Sequences were standardized to 300 tokens, ensuring consistency in model input.
Example Paragraph:
"a third person would mean an opportunity to come together with her and with konnor to help keep them all alive. that is, if konnor didn't go completely over the deep end. she would have to take time to better assess his condition. even as she considered doing that, she knew she would have problems."
The model employs BPE for tokenization, a technique that effectively handles rare words and optimizes vocabulary size (20,000) by merging frequent byte pairs, facilitating efficient subword representation.
To ensure robust training, the model employs a warmup learning rate schedule. The learning rate starts low, gradually increasing during the warmup phase, then decays smoothly using a cosine function, promoting stability and preventing premature convergence.
-
Successfully designed and implemented a Transformer-based language model optimized for text generation.
-
Integrated SentencePiece with BPE, enhancing vocabulary management and handling of rare words.
-
Implemented a warmup learning rate schedule, resulting in improved generalization and stable training dynamics.
-
Processed and curated a large-scale dataset of 400,000+ paragraphs, providing a solid foundation for language modeling.