Skip to content

Latest commit

 

History

History
10 lines (4 loc) · 360 Bytes

File metadata and controls

10 lines (4 loc) · 360 Bytes

warc_processng

pipeline for pre-processing warc files from CommonCrawl

initial inspiration came from https://arxiv.org/pdf/2306.01116.pdf (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only)

Presentation1