Skip to content

DocSpectra/DocSpectraSE2026

Repository files navigation

DocSpectra

DocSpectra is a research and engineering project focused on analyzing the structural organization of technical documentation using graph representations and spectral graph theory. The goal is to extract hyperlink graphs, section hierarchies, and other structural signals from real-world documentation, and to use spectral features to characterize documentation structure, modularity, and navigability-related properties.

This repository contains:

  • Scripts for extracting graphs from documentation sources
  • Specifications defining dataset structure and parsing rules
  • Derived datasets generated from open documentation sources
  • Tools for spectral analysis and graph diagnostics
  • Documentation for legal usage and dataset release guidelines

This project supports experiments for a planned conference paper submission and serves as the foundation for a larger research direction exploring the relationship between documentation structure and information quality indicators.

📦 Datasets

Each dataset folder includes:

  • graph.edgelist — directed hyperlink graph
  • graph.gml — visualization-ready graph
  • nodes.csv — page-level metadata
  • sections.edgelist — heading hierarchy graph
  • spectra.npy (optional) — eigenvalue spectrum of Laplacian
  • SOURCE.md — licensing and provenance for the original documentation
  • README.md — dataset-specific details

Raw documentation files are never included in this repo and must be downloaded separately.


📖 Citation

If you use this software or dataset, please cite the associated paper. See CITATION.cff for full citation metadata.

Preferred citation

D'Antonio, Rocker. "Graph Spectra of Technical Documentation." IEEE SoutheastCon 2026, 2026.


Data Sources and Attribution

This project analyzes the structural properties of publicly available technical documentation corpora.
All documentation sources remain the property of their respective owners and are used here for non-commercial research analysis only. No documentation content is redistributed.

Analyzed corpora include:

This repository contains only derived graph representations and summary statistics, not original documentation content.

Software Dependencies

This project relies on standard open-source scientific computing libraries, including:

  • Python
  • NumPy
  • SciPy
  • NetworkX
  • BeautifulSoup

All dependencies are used in accordance with their respective licenses.


📦 Generating the DocSpectraSE2026 Dataset

To reproduce the published dataset directly run the importer pipeline against the canonical dataset root.

Note: The raw edges.csv may include duplicate (source, target) pairs because repeated <a> links are preserved in QA counts. Spectral analysis deduplicates these by loading edges into a NetworkX DiGraph (which stores a single edge per pair), so repeated links do not affect analysis metrics.

⚠️ The importer downloads source documentation into datasets/DocSpectraSE2026/raw/. Raw files are not committed to Git.

1) Download corpora

python docspectra-importer/docspectra_importer/corpora_importer.py -c datasets/DocSpectraSE2026/corpora.yml -o datasets/DocSpectraSE2026

2) Build graphs

python -m docspectra_importer.graph_builder -c datasets/DocSpectraSE2026/corpora.yml -d datasets/DocSpectraSE2026

3) Run graph QA

python -m docspectra_importer.graph_qa -d datasets/DocSpectraSE2026 -c datasets/DocSpectraSE2026/corpora.yml --verbose

4) Compute spectral analysis

python docspectra_analysis/graph_spectra.py -d datasets/DocSpectraSE2026 --format both

Outputs

When complete, the dataset directory will contain:

  • raw/ downloaded corpora
  • graphs/<source_id>/ graph artifacts
  • analysis/<source_id>/qa.* QA reports
  • analysis/<source_id>/spectra.* spectra reports
  • analysis/qa.rollup.* and analysis/spectra.rollup.* rollups

About

Graph Spectra on Technical Documentation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages