ArchiveGraph

ArchiveGraph is a lightweight Python ingestion pipeline that converts historical archive CSV data into a structured, typed knowledge graph.

It is designed as reusable research infrastructure for digital humanities and cultural heritage datasets: clean the data, deduplicate entities deterministically, build a typed graph, and export standard graph formats.

Problem

In digital humanities research, historical datasets are often stored as spreadsheets or CSV files. These tables typically contain:

repeated entities across many records
inconsistent naming or extra whitespace
no explicit graph structure
manual effort required to build network representations

ArchiveGraph automates the transformation from flat archival records into a structured knowledge graph with typed nodes and relationships.

Features

CSV ingestion and normalization
Deterministic entity deduplication
- Person nodes deduplicated by: person_name + birth_year + death_year
- Other entities deduplicated by: normalized label
Typed graph construction using NetworkX MultiDiGraph
Explicit relationship types:
- LOCATED_IN (Person → Place)
- AFFILIATED_WITH (Person → Institution)
- ACTIVE_IN (Person → Domain)
- MENTIONED_IN_EVENT (Person → Event)
Graph export formats:
- GraphML (compatible with Gephi / Neo4j)
- Node-link JSON
Console QA summary:
- Data quality report
- Node/edge counts
- Node distribution by type
- Top connected nodes
- Edge distribution by type
Command-line interface (CLI) support

Installation

Using conda:

conda create -n archivegraph python=3.11 -y
conda activate archivegraph
pip install -r requirements.txt

Usage

Run with default sample data:

python src/main.py

Run with custom input and output directory:

python src/main.py --input data/sample_archive.csv --outdir outputs

View help:

python src/main.py -h

Example Input (CSV)

Expected columns:

record_id
person_name
birth_year
death_year
place
institution
domain
event

Example row:

1,Mayer Amschel Rothschild,1744,1812,Frankfurt,House of Rothschild,Finance,Napoleonic Wars

Example Console Output

=== Data Quality Report ===
rows_loaded: 6
rows_kept: 5
rows_dropped_empty_person: 1
missing_place: 0
missing_institution: 1
missing_domain: 0
missing_event: 0

=== Graph Summary ===
Total nodes: 13
Total edges: 19

Nodes by type:
  Person: 3
  Place: 3
  Institution: 2
  Domain: 3
  Event: 2

Top connected nodes (degree):
  Mayer Amschel Rothschild [Person] -> degree 8
  Clara Schumann [Person] -> degree 8

Edges by type:
  LOCATED_IN: 5
  AFFILIATED_WITH: 4
  ACTIVE_IN: 5
  MENTIONED_IN_EVENT: 5

Graph Model

Node Types

Person
Place
Institution
Domain
Event

Node Attributes

All nodes include:

node_type
label

Person nodes additionally include:

birth_year
death_year

Edge Types

LOCATED_IN (Person → Place)
AFFILIATED_WITH (Person → Institution)
ACTIVE_IN (Person → Domain)
MENTIONED_IN_EVENT (Person → Event)

Each edge includes:

edge_type
label

Project Structure

archivegraph/
  data/
    sample_archive.csv
  src/
    load_clean.py
    entity_manager.py
    build_graph.py
    export_graph.py
    main.py
  outputs/
  README.md
  requirements.txt

Scope

ArchiveGraph is intentionally minimal and focused.

It is not:

a UI or dashboard
a visualization tool
a machine learning system
an RDF/Wikidata integration layer
a large-scale distributed pipeline

It is a clean, reusable ingestion layer for structured historical graph construction.

Positioning

ArchiveGraph provides a reproducible and deterministic method for converting historical archive spreadsheets into structured knowledge graphs with typed entities and relationships.

It is intended for:

research software engineers
digital humanities labs
cultural heritage data teams
knowledge graph researchers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArchiveGraph

Problem

Features

Installation

Usage

Example Input (CSV)

Example Console Output

Graph Model

Node Types

Node Attributes

Edge Types

Project Structure

Scope

Positioning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ArchiveGraph

Problem

Features

Installation

Usage

Example Input (CSV)

Example Console Output

Graph Model

Node Types

Node Attributes

Edge Types

Project Structure

Scope

Positioning

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages