Downloads, extracts and tokenizes pile (https://the-eye.eu/public/AI/pile) data.
- download_data.py - script for downloading arcvive files
- extract_zst.py - script for archive extraction
- tokenize.py - full script utilizing Huggingface parallel processes for faster tokenization
- example.txt - first 100 rows on a sample file from the pile.