Skip to content

N-JrP/archivegraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArchiveGraph

ArchiveGraph is a lightweight Python ingestion pipeline that converts historical archive CSV data into a structured, typed knowledge graph.

It is designed as reusable research infrastructure for digital humanities and cultural heritage datasets: clean the data, deduplicate entities deterministically, build a typed graph, and export standard graph formats.


Problem

In digital humanities research, historical datasets are often stored as spreadsheets or CSV files. These tables typically contain:

  • repeated entities across many records
  • inconsistent naming or extra whitespace
  • no explicit graph structure
  • manual effort required to build network representations

ArchiveGraph automates the transformation from flat archival records into a structured knowledge graph with typed nodes and relationships.


Features

  • CSV ingestion and normalization
  • Deterministic entity deduplication
    • Person nodes deduplicated by: person_name + birth_year + death_year
    • Other entities deduplicated by: normalized label
  • Typed graph construction using NetworkX MultiDiGraph
  • Explicit relationship types:
    • LOCATED_IN (Person → Place)
    • AFFILIATED_WITH (Person → Institution)
    • ACTIVE_IN (Person → Domain)
    • MENTIONED_IN_EVENT (Person → Event)
  • Graph export formats:
    • GraphML (compatible with Gephi / Neo4j)
    • Node-link JSON
  • Console QA summary:
    • Data quality report
    • Node/edge counts
    • Node distribution by type
    • Top connected nodes
    • Edge distribution by type
  • Command-line interface (CLI) support

Installation

Using conda:

conda create -n archivegraph python=3.11 -y
conda activate archivegraph
pip install -r requirements.txt

Usage

Run with default sample data:

python src/main.py

Run with custom input and output directory:

python src/main.py --input data/sample_archive.csv --outdir outputs

View help:

python src/main.py -h

Example Input (CSV)

Expected columns:

  • record_id
  • person_name
  • birth_year
  • death_year
  • place
  • institution
  • domain
  • event

Example row:

1,Mayer Amschel Rothschild,1744,1812,Frankfurt,House of Rothschild,Finance,Napoleonic Wars

Example Console Output

=== Data Quality Report ===
rows_loaded: 6
rows_kept: 5
rows_dropped_empty_person: 1
missing_place: 0
missing_institution: 1
missing_domain: 0
missing_event: 0

=== Graph Summary ===
Total nodes: 13
Total edges: 19

Nodes by type:
  Person: 3
  Place: 3
  Institution: 2
  Domain: 3
  Event: 2

Top connected nodes (degree):
  Mayer Amschel Rothschild [Person] -> degree 8
  Clara Schumann [Person] -> degree 8

Edges by type:
  LOCATED_IN: 5
  AFFILIATED_WITH: 4
  ACTIVE_IN: 5
  MENTIONED_IN_EVENT: 5

Graph Model

Node Types

  • Person
  • Place
  • Institution
  • Domain
  • Event

Node Attributes

All nodes include:

  • node_type
  • label

Person nodes additionally include:

  • birth_year
  • death_year

Edge Types

  • LOCATED_IN (Person → Place)
  • AFFILIATED_WITH (Person → Institution)
  • ACTIVE_IN (Person → Domain)
  • MENTIONED_IN_EVENT (Person → Event)

Each edge includes:

  • edge_type
  • label

Project Structure

archivegraph/
  data/
    sample_archive.csv
  src/
    load_clean.py
    entity_manager.py
    build_graph.py
    export_graph.py
    main.py
  outputs/
  README.md
  requirements.txt

Scope

ArchiveGraph is intentionally minimal and focused.

It is not:

  • a UI or dashboard
  • a visualization tool
  • a machine learning system
  • an RDF/Wikidata integration layer
  • a large-scale distributed pipeline

It is a clean, reusable ingestion layer for structured historical graph construction.


Positioning

ArchiveGraph provides a reproducible and deterministic method for converting historical archive spreadsheets into structured knowledge graphs with typed entities and relationships.

It is intended for:

  • research software engineers
  • digital humanities labs
  • cultural heritage data teams
  • knowledge graph researchers

About

Reusable ingestion layer for transforming archival spreadsheets into typed knowledge graphs for digital humanities and cultural heritage research.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages