pipeline for pre-processing warc files from CommonCrawl
initial inspiration came from https://arxiv.org/pdf/2306.01116.pdf (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only)
| Name | Name | Last commit date | ||
|---|---|---|---|---|
pipeline for pre-processing warc files from CommonCrawl
initial inspiration came from https://arxiv.org/pdf/2306.01116.pdf (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only)