A program for efficiently extracting the graph structure from a Wikidata truthy N-Triples dump.
You can install wd2graph by running the following command:
cargo install wd2graphOf course, you can also build it from source.
wd2graph requires only the compressed (.gz) Wikidata truthy dump in the N-Triples format as input. You can download it with the following command:
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.gz
After downloading the dump, you can extract the graph data with the following command:
wd2graph --input latest-truthy.nt.gz \
--output-graph graph.parquet \
--output-nodes nodes.parquetThe outputs are written into zstd compressed Apache Parquet files.
The file given as the --output-nodes argument contains a single column named qid (UInt32) filled with all of the QIDs.
The file given as the --output-graph argument contains 3 columns named lhs (UInt32), property (UInt32), and rhs (UInt32) filled with triplets representing directional edges. lhs and rhs are the QIDs, while property is the PID.
wd2graph uses a single thread. On a dump from March 2023, containing ~100,000,000 nodes and ~700,000,000 edges, it takes ~16 minutes to complete with peak memory usage of ~22GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.