Transformer

Overview

This Transformer model is based on the original "Attention is all you need" paper and trained on the tinyShakespeare dataset. In this case we only implemented the decoder because we are not interested in translation tasks.
In the decoder block causal masked attention is used; hiding the trailing tokens during training. This is done to preserve the auto-regressive property of the model, meaning the output is only dependent on the tokens seen so far.
The token size is 1, so each token is one character. We used a dropout of 20% to decrease training time and generalize the model to prevent overfitting (regularzitation). It works by choosing 20% of the neurons arbitrarily and setting their activations to 0 and scaling the activation of the other neurons by 1 / (1 - 0.2) = 1 / (4/5) = 5/4 = 1.25, so that the sum of remaining activations stays the same.

Training Setup

To train the transformer model and generate a text its important to adjust the hyperparameters in the beginning of tranformer.py. With Cuda you can expect good results after 15 minutes of training with the current settings. We also include two default hyperparameter templates for a Google Colab connected to a Nvidia T4 GPU or a laptop.

Download the tinyShakesPeare dataset by uncommenting the wget line wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt or manually over this link
(Un)comment your preferred hyperparameters and adjust to your specs in the top of tranformer.py (read above).
Run python transformer.py

Files

bigram.py Contains a simple Bigrammodel, which uses an embedding space of (vocab_size, vocab_size) and is the baseline model for the transformer. In this type of model we count occurrences of token_a after token_b. This means that the generation of the next token only depends on the previous token.

transformer.py is the fully fledged decoder of the original Attention is all you need paper.

Embedding Layer: Maps tokens into a 384-dimensional space.
Positional Encoding: Added to embeddings to provide sequence order information.
Decoder Blocks: Each block contains multi-head self-attention, a feed-forward network, and layer normalization.
Output Layer: A final linear projection produces logits over the vocabulary(all possible tokens/characters), followed by softmax sampling for character generation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
bigram.py		bigram.py
custom_nn.py		custom_nn.py
gpt-dev.ipynb		gpt-dev.ipynb
input.txt		input.txt
transformer.py		transformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer

Overview

Training Setup

Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformer

Overview

Training Setup

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages