You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project explores language modeling using LSTM-based architectures trained on the WikiText-2 dataset. Two models are implemented: a standard LSTM language model and an advanced AWD-LSTM variant with regularization techniques such as weight dropout and locked dropout. Given a text prompt, both models generate coherent sentence continuations.
Custom BPE tokenizer built from scratch on WikiText-2 (30k vocab). Covers data cleaning, deduplication, HuggingFace tokenizers training, evaluation (compression ratio, UNK-free coverage, consistency), and save/reload as PreTrainedTokenizerFast.