ArchiveGraph is a lightweight Python ingestion pipeline that converts historical archive CSV data into a structured, typed knowledge graph.
It is designed as reusable research infrastructure for digital humanities and cultural heritage datasets: clean the data, deduplicate entities deterministically, build a typed graph, and export standard graph formats.
In digital humanities research, historical datasets are often stored as spreadsheets or CSV files. These tables typically contain:
- repeated entities across many records
- inconsistent naming or extra whitespace
- no explicit graph structure
- manual effort required to build network representations
ArchiveGraph automates the transformation from flat archival records into a structured knowledge graph with typed nodes and relationships.
- CSV ingestion and normalization
- Deterministic entity deduplication
- Person nodes deduplicated by:
person_name + birth_year + death_year - Other entities deduplicated by: normalized label
- Person nodes deduplicated by:
- Typed graph construction using NetworkX
MultiDiGraph - Explicit relationship types:
LOCATED_IN(Person → Place)AFFILIATED_WITH(Person → Institution)ACTIVE_IN(Person → Domain)MENTIONED_IN_EVENT(Person → Event)
- Graph export formats:
- GraphML (compatible with Gephi / Neo4j)
- Node-link JSON
- Console QA summary:
- Data quality report
- Node/edge counts
- Node distribution by type
- Top connected nodes
- Edge distribution by type
- Command-line interface (CLI) support
Using conda:
conda create -n archivegraph python=3.11 -y
conda activate archivegraph
pip install -r requirements.txtRun with default sample data:
python src/main.pyRun with custom input and output directory:
python src/main.py --input data/sample_archive.csv --outdir outputsView help:
python src/main.py -hExpected columns:
- record_id
- person_name
- birth_year
- death_year
- place
- institution
- domain
- event
Example row:
1,Mayer Amschel Rothschild,1744,1812,Frankfurt,House of Rothschild,Finance,Napoleonic Wars=== Data Quality Report ===
rows_loaded: 6
rows_kept: 5
rows_dropped_empty_person: 1
missing_place: 0
missing_institution: 1
missing_domain: 0
missing_event: 0
=== Graph Summary ===
Total nodes: 13
Total edges: 19
Nodes by type:
Person: 3
Place: 3
Institution: 2
Domain: 3
Event: 2
Top connected nodes (degree):
Mayer Amschel Rothschild [Person] -> degree 8
Clara Schumann [Person] -> degree 8
Edges by type:
LOCATED_IN: 5
AFFILIATED_WITH: 4
ACTIVE_IN: 5
MENTIONED_IN_EVENT: 5
- Person
- Place
- Institution
- Domain
- Event
All nodes include:
node_typelabel
Person nodes additionally include:
birth_yeardeath_year
LOCATED_IN(Person → Place)AFFILIATED_WITH(Person → Institution)ACTIVE_IN(Person → Domain)MENTIONED_IN_EVENT(Person → Event)
Each edge includes:
edge_typelabel
archivegraph/
data/
sample_archive.csv
src/
load_clean.py
entity_manager.py
build_graph.py
export_graph.py
main.py
outputs/
README.md
requirements.txt
ArchiveGraph is intentionally minimal and focused.
It is not:
- a UI or dashboard
- a visualization tool
- a machine learning system
- an RDF/Wikidata integration layer
- a large-scale distributed pipeline
It is a clean, reusable ingestion layer for structured historical graph construction.
ArchiveGraph provides a reproducible and deterministic method for converting historical archive spreadsheets into structured knowledge graphs with typed entities and relationships.
It is intended for:
- research software engineers
- digital humanities labs
- cultural heritage data teams
- knowledge graph researchers