Spiderman

Recursively collect DOIs + metadata to create a web of knowledge seeded by your favorite papers.

Give it one DOI — say, a review you love — and it fetches that paper's metadata and reference list from Crossref, inserts every cited DOI as a node in a directed networkx graph (edge = "cites"), then recursively does the same for each reference. The graph is the artefact: a persistent, queryable, renderable JSON object. The code is purely functional — every operation takes a graph and returns a new one — so the graph is a mathematically clean object you can always get from and operate upon.

Seed used throughout: 10.1088/0034-4885/71/3/036601 (Eggers & Villermaux, Physics of liquid jets, Rep. Prog. Phys. 2008).

Install

Requires Python ≥ 3.10. Runtime dependency: networkx only.

With uv (recommended)

git clone <this-repo> spiderman && cd spiderman
uv venv                      # creates .venv at the repo root
uv pip install -e ".[dev]"   # editable install + pytest

With stdlib venv + pip

python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"

Either way you get two console commands on the venv path, spiderman and spiderman-render. Activate the venv (source .venv/bin/activate) or call them as .venv/bin/spiderman ....

CLI — the commands you will actually use

1. Seed a new graph

Fetch one paper and register all of its references as frontier stubs (one network call):

spiderman 10.1088/0034-4885/71/3/036601
# nodes: 490  edges: 489  frontier (unfetched): 489
# saved: data/knowledge_graph.json

2. Build the web (deepen)

--depth N means N fetch generations: depth 1 fetches the seed, depth 2 also fetches every reference, and so on. Re-running on an existing --out file resumes — already-fetched papers are never re-fetched, the crawl just expands the frontier:

spiderman 10.1088/0034-4885/71/3/036601 --depth 2 --mailto you@example.org
# nodes: 5085  edges: 8453  frontier (unfetched): 4596

Always pass --mailto for real crawls — it puts you in Crossref's polite pool. Depth 2 on a 489-reference review is ~489 sequential HTTP calls (10–30 minutes).

3. Update / keep growing

The same command is the update command. Run it again next week, point it at the same JSON, raise the depth — only unfetched DOIs cost network calls:

spiderman 10.1088/0034-4885/71/3/036601 --depth 3 --out data/knowledge_graph.json

4. Render the proof (visualize)

Turn the saved JSON into a self-contained interactive HTML page — no network, no server, no recrawl:

spiderman-render data/knowledge_graph.json
# rendered: data/knowledge_graph.html (697 of 5085 nodes, min degree 3)

Open the HTML in a browser: nodes are colored so that no two adjacent nodes share a color (greedy proper coloring, classic four-color palette first), edges are arrows from citing to cited, and double-clicking any node or edge opens its metadata panel (title, authors, journal, year, abstract, citation counts). By default only the connected core is drawn — nodes with ≥ 3 edges — and the page header states exactly what fraction you are seeing. Dial it:

spiderman-render data/knowledge_graph.json --min-degree 2   # wider core
spiderman-render data/knowledge_graph.json --min-degree 0   # everything (slow!)
spiderman-render data/knowledge_graph.json --html /tmp/web.html

You can also crawl and render in one shot: spiderman <doi> --depth 2 --html data/knowledge_graph.html.

Full flag reference

Command	Flag	Default	Meaning
`spiderman`	`doi` (positional)	—	seed DOI (`10.x/...`, URL and `doi:` forms accepted)
	`--depth`	`1`	fetch generations
	`--out`	`data/knowledge_graph.json`	graph file; loaded and resumed if it exists
	`--html`	off	also render after crawling
	`--min-degree`	`3`	render threshold (with `--html`)
	`--mailto`	`spiderman@example.org`	Crossref polite-pool contact
`spiderman-render`	`graph_json` (positional)	—	saved graph JSON
	`--html`	`<graph_json>.html`	output path
	`--min-degree`	`3`	only draw nodes with at least this many edges

python -m spiderman ... is identical to spiderman ....

Python API

Everything is a pure function exported from the top-level package: functions take a graph (plus arguments) and return values or new graphs — your input graph is never mutated. The only effectful functions are fetch_paper_metadata, save_graph, load_graph, and save_graph_html.

Seed, build, and update

from pathlib import Path
from spiderman import (
    crawl_doi, empty_graph, fetch_paper_metadata,
    load_graph, save_graph, save_graph_html,
)

seed = "10.1088/0034-4885/71/3/036601"

graph = empty_graph(seed)
graph = crawl_doi(graph, seed, depth=1, fetch_metadata=fetch_paper_metadata)
save_graph(graph, Path("data/knowledge_graph.json"))

# Later: load, deepen, save again. Fetched papers are never re-fetched.
graph = load_graph(Path("data/knowledge_graph.json"))
graph = crawl_doi(graph, seed, depth=2, fetch_metadata=fetch_paper_metadata)
save_graph(graph, Path("data/knowledge_graph.json"))

fetch_metadata is an injected function of type (str) -> PaperMetadata | None — pass your own (e.g. a closure over a dict) to crawl without a network.

Query

from spiderman import (
    has_paper, is_fetched, paper_metadata_of, papers_by_author,
    references_of, cited_by, unfetched_dois,
)

paper = paper_metadata_of(graph, seed)     # typed frozen dataclass
paper.title, paper.authors, paper.year     # 'Physics of liquid jets', (...), 2008

references_of(graph, seed)                 # DOIs this paper cites
cited_by(graph, "10.1103/revmodphys.69.865")  # DOIs citing it, within the graph
papers_by_author(graph, "eggers")          # case-insensitive substring match
unfetched_dois(graph)                      # the crawl frontier

Modify

from spiderman import add_paper, add_citation, prune_low_degree_nodes

grown = add_citation(graph, "10.1/a", "10.1/b")   # returns a NEW graph
core = prune_low_degree_nodes(graph, minimum_degree=3)

Render

from spiderman import render_graph_html, save_graph_html, color_nodes

html_text = render_graph_html(graph, minimum_degree=3)   # pure: returns the document
save_graph_html(graph, Path("data/web.html"), minimum_degree=3)

coloring = color_nodes(graph)   # proper coloring: adjacent nodes always differ

The artifact

data/knowledge_graph.json is standard node-link JSON — human-readable, jq-able, loss-free on round trip:

{
  "directed": true,
  "multigraph": false,
  "graph": {"seed_doi": "10.1088/0034-4885/71/3/036601"},
  "nodes": [
    {"id": "10.1088/0034-4885/71/3/036601", "title": "Physics of liquid jets",
     "authors": ["Jens Eggers", "Emmanuel Villermaux"],
     "journal": "Reports on Progress in Physics", "year": 2008,
     "abstract": "", "fetched": true},
    {"id": "10.1098/rstl.1805.0005", "fetched": false}
  ],
  "links": [
    {"source": "10.1088/0034-4885/71/3/036601", "target": "10.1098/rstl.1805.0005"}
  ]
}

"fetched": false marks a frontier stub: a paper we have seen cited but not yet expanded. The rendered HTML is derived from this file and reproducible at any time; the JSON is canonical.

Architecture

Every significant decision is an ADR in docs/adr/ — metadata source, graph representation, purity discipline, crawl semantics, persistence format, rendering. Read those for the why. The shape of the code:

spiderman/
  doi/           normalize_doi, validate_doi          (canonical node identity)
  metadata/      PaperMetadata, parse_crossref_work,
                 fetch_paper_metadata                 (the only network effect)
  graph/         empty_graph, add_paper, add_citation,
                 queries, prune, JSON <-> graph       (pure graph API)
  crawl/         crawl_doi                            (stateless recursion)
  render/        color_nodes, cytoscape_elements,
                 render_graph_html                    (pure HTML generation)
  persistence/   save_graph, load_graph,
                 save_graph_html                      (the only disk effects)
  cli/           crawl_command, render_command        (console entry points)

One function per file; frozen dataclasses; effects only at named boundaries. Agent roles and coding standards live in Claude.md.

Testing

.venv/bin/python -m pytest -q                  # 94 unit tests, no network
.venv/bin/python -m pytest -q -m integration   # live Crossref test

Unit tests replace the network with a fake fetch closure; integration tests hit the real Crossref API and are deselected by default.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.claude/agents		.claude/agents
data		data
docs/adr		docs/adr
spiderman		spiderman
tests		tests
.gitignore		.gitignore
Claude.md		Claude.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spiderman

Install

With uv (recommended)

With stdlib venv + pip

CLI — the commands you will actually use

1. Seed a new graph

2. Build the web (deepen)

3. Update / keep growing

4. Render the proof (visualize)

Full flag reference

Python API

Seed, build, and update

Query

Modify

Render

The artifact

Architecture

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spiderman

Install

With uv (recommended)

With stdlib venv + pip

CLI — the commands you will actually use

1. Seed a new graph

2. Build the web (deepen)

3. Update / keep growing

4. Render the proof (visualize)

Full flag reference

Python API

Seed, build, and update

Query

Modify

Render

The artifact

Architecture

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages