Skip to content

marianna13/pile_tokenizer

Repository files navigation

pile_tokenizer

Downloads, extracts and tokenizes pile (https://the-eye.eu/public/AI/pile) data.

  • download_data.py - script for downloading arcvive files
  • extract_zst.py - script for archive extraction
  • tokenize.py - full script utilizing Huggingface parallel processes for faster tokenization
  • example.txt - first 100 rows on a sample file from the pile.

About

Downloads, extracts and tokenizes pile (https://the-eye.eu/public/AI/pile) data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages