DocSpectra is a research and engineering project focused on analyzing the structural organization of technical documentation using graph representations and spectral graph theory. The goal is to extract hyperlink graphs, section hierarchies, and other structural signals from real-world documentation, and to use spectral features to characterize documentation structure, modularity, and navigability-related properties.
This repository contains:
- Scripts for extracting graphs from documentation sources
- Specifications defining dataset structure and parsing rules
- Derived datasets generated from open documentation sources
- Tools for spectral analysis and graph diagnostics
- Documentation for legal usage and dataset release guidelines
This project supports experiments for a planned conference paper submission and serves as the foundation for a larger research direction exploring the relationship between documentation structure and information quality indicators.
Each dataset folder includes:
graph.edgelist— directed hyperlink graphgraph.gml— visualization-ready graphnodes.csv— page-level metadatasections.edgelist— heading hierarchy graphspectra.npy(optional) — eigenvalue spectrum of LaplacianSOURCE.md— licensing and provenance for the original documentationREADME.md— dataset-specific details
Raw documentation files are never included in this repo and must be downloaded separately.
If you use this software or dataset, please cite the associated paper. See CITATION.cff for full citation metadata.
Preferred citation
D'Antonio, Rocker. "Graph Spectra of Technical Documentation." IEEE SoutheastCon 2026, 2026.
This project analyzes the structural properties of publicly available technical documentation corpora.
All documentation sources remain the property of their respective owners and are used here for non-commercial research analysis only. No documentation content is redistributed.
Analyzed corpora include:
-
Python 3.12 Documentation
Python Software Foundation
License: Python Software Foundation Documentation License v2
https://docs.python.org/3.12/ -
MkDocs Documentation
MkDocs Project
License: BSD-3-Clause
https://www.mkdocs.org/ -
AgOpenGPS Documentation
AgOpenGPS Project
License: MIT License
https://docs.agopengps.com/ -
OADA API Documentation
Open Ag Data Alliance
License: See repository
https://github.com/OADA/oada-docs
This repository contains only derived graph representations and summary statistics, not original documentation content.
This project relies on standard open-source scientific computing libraries, including:
- Python
- NumPy
- SciPy
- NetworkX
- BeautifulSoup
All dependencies are used in accordance with their respective licenses.
To reproduce the published dataset directly run the importer pipeline against the canonical dataset root.
Note: The raw
edges.csvmay include duplicate(source, target)pairs because repeated<a>links are preserved in QA counts. Spectral analysis deduplicates these by loading edges into a NetworkXDiGraph(which stores a single edge per pair), so repeated links do not affect analysis metrics.
⚠️ The importer downloads source documentation intodatasets/DocSpectraSE2026/raw/. Raw files are not committed to Git.
python docspectra-importer/docspectra_importer/corpora_importer.py -c datasets/DocSpectraSE2026/corpora.yml -o datasets/DocSpectraSE2026python -m docspectra_importer.graph_builder -c datasets/DocSpectraSE2026/corpora.yml -d datasets/DocSpectraSE2026python -m docspectra_importer.graph_qa -d datasets/DocSpectraSE2026 -c datasets/DocSpectraSE2026/corpora.yml --verbosepython docspectra_analysis/graph_spectra.py -d datasets/DocSpectraSE2026 --format bothWhen complete, the dataset directory will contain:
raw/downloaded corporagraphs/<source_id>/graph artifactsanalysis/<source_id>/qa.*QA reportsanalysis/<source_id>/spectra.*spectra reportsanalysis/qa.rollup.*andanalysis/spectra.rollup.*rollups