Recursively collect DOIs + metadata to create a web of knowledge seeded by your favorite papers.
Give it one DOI — say, a review you love — and it fetches that paper's metadata and reference list from Crossref, inserts every cited DOI as a node in a directed networkx graph (edge = "cites"), then recursively does the same for each reference. The graph is the artefact: a persistent, queryable, renderable JSON object. The code is purely functional — every operation takes a graph and returns a new one — so the graph is a mathematically clean object you can always get from and operate upon.
Seed used throughout: 10.1088/0034-4885/71/3/036601
(Eggers & Villermaux, Physics of liquid jets, Rep. Prog. Phys. 2008).
Requires Python ≥ 3.10. Runtime dependency: networkx only.
git clone <this-repo> spiderman && cd spiderman
uv venv # creates .venv at the repo root
uv pip install -e ".[dev]" # editable install + pytestpython3 -m venv .venv
.venv/bin/pip install -e ".[dev]"Either way you get two console commands on the venv path, spiderman and
spiderman-render. Activate the venv (source .venv/bin/activate) or call
them as .venv/bin/spiderman ....
Fetch one paper and register all of its references as frontier stubs (one network call):
spiderman 10.1088/0034-4885/71/3/036601
# nodes: 490 edges: 489 frontier (unfetched): 489
# saved: data/knowledge_graph.json--depth N means N fetch generations: depth 1 fetches the seed, depth 2
also fetches every reference, and so on. Re-running on an existing --out
file resumes — already-fetched papers are never re-fetched, the crawl
just expands the frontier:
spiderman 10.1088/0034-4885/71/3/036601 --depth 2 --mailto you@example.org
# nodes: 5085 edges: 8453 frontier (unfetched): 4596Always pass --mailto for real crawls — it puts you in Crossref's polite
pool. Depth 2 on a 489-reference review is ~489 sequential HTTP calls
(10–30 minutes).
The same command is the update command. Run it again next week, point it at the same JSON, raise the depth — only unfetched DOIs cost network calls:
spiderman 10.1088/0034-4885/71/3/036601 --depth 3 --out data/knowledge_graph.jsonTurn the saved JSON into a self-contained interactive HTML page — no network, no server, no recrawl:
spiderman-render data/knowledge_graph.json
# rendered: data/knowledge_graph.html (697 of 5085 nodes, min degree 3)Open the HTML in a browser: nodes are colored so that no two adjacent nodes share a color (greedy proper coloring, classic four-color palette first), edges are arrows from citing to cited, and double-clicking any node or edge opens its metadata panel (title, authors, journal, year, abstract, citation counts). By default only the connected core is drawn — nodes with ≥ 3 edges — and the page header states exactly what fraction you are seeing. Dial it:
spiderman-render data/knowledge_graph.json --min-degree 2 # wider core
spiderman-render data/knowledge_graph.json --min-degree 0 # everything (slow!)
spiderman-render data/knowledge_graph.json --html /tmp/web.htmlYou can also crawl and render in one shot: spiderman <doi> --depth 2 --html data/knowledge_graph.html.
| Command | Flag | Default | Meaning |
|---|---|---|---|
spiderman |
doi (positional) |
— | seed DOI (10.x/..., URL and doi: forms accepted) |
--depth |
1 |
fetch generations | |
--out |
data/knowledge_graph.json |
graph file; loaded and resumed if it exists | |
--html |
off | also render after crawling | |
--min-degree |
3 |
render threshold (with --html) |
|
--mailto |
spiderman@example.org |
Crossref polite-pool contact | |
spiderman-render |
graph_json (positional) |
— | saved graph JSON |
--html |
<graph_json>.html |
output path | |
--min-degree |
3 |
only draw nodes with at least this many edges |
python -m spiderman ... is identical to spiderman ....
Everything is a pure function exported from the top-level package: functions
take a graph (plus arguments) and return values or new graphs — your
input graph is never mutated. The only effectful functions are
fetch_paper_metadata, save_graph, load_graph, and save_graph_html.
from pathlib import Path
from spiderman import (
crawl_doi, empty_graph, fetch_paper_metadata,
load_graph, save_graph, save_graph_html,
)
seed = "10.1088/0034-4885/71/3/036601"
graph = empty_graph(seed)
graph = crawl_doi(graph, seed, depth=1, fetch_metadata=fetch_paper_metadata)
save_graph(graph, Path("data/knowledge_graph.json"))
# Later: load, deepen, save again. Fetched papers are never re-fetched.
graph = load_graph(Path("data/knowledge_graph.json"))
graph = crawl_doi(graph, seed, depth=2, fetch_metadata=fetch_paper_metadata)
save_graph(graph, Path("data/knowledge_graph.json"))fetch_metadata is an injected function of type (str) -> PaperMetadata | None
— pass your own (e.g. a closure over a dict) to crawl without a network.
from spiderman import (
has_paper, is_fetched, paper_metadata_of, papers_by_author,
references_of, cited_by, unfetched_dois,
)
paper = paper_metadata_of(graph, seed) # typed frozen dataclass
paper.title, paper.authors, paper.year # 'Physics of liquid jets', (...), 2008
references_of(graph, seed) # DOIs this paper cites
cited_by(graph, "10.1103/revmodphys.69.865") # DOIs citing it, within the graph
papers_by_author(graph, "eggers") # case-insensitive substring match
unfetched_dois(graph) # the crawl frontierfrom spiderman import add_paper, add_citation, prune_low_degree_nodes
grown = add_citation(graph, "10.1/a", "10.1/b") # returns a NEW graph
core = prune_low_degree_nodes(graph, minimum_degree=3)from spiderman import render_graph_html, save_graph_html, color_nodes
html_text = render_graph_html(graph, minimum_degree=3) # pure: returns the document
save_graph_html(graph, Path("data/web.html"), minimum_degree=3)
coloring = color_nodes(graph) # proper coloring: adjacent nodes always differdata/knowledge_graph.json is standard node-link JSON — human-readable,
jq-able, loss-free on round trip:
{
"directed": true,
"multigraph": false,
"graph": {"seed_doi": "10.1088/0034-4885/71/3/036601"},
"nodes": [
{"id": "10.1088/0034-4885/71/3/036601", "title": "Physics of liquid jets",
"authors": ["Jens Eggers", "Emmanuel Villermaux"],
"journal": "Reports on Progress in Physics", "year": 2008,
"abstract": "", "fetched": true},
{"id": "10.1098/rstl.1805.0005", "fetched": false}
],
"links": [
{"source": "10.1088/0034-4885/71/3/036601", "target": "10.1098/rstl.1805.0005"}
]
}"fetched": false marks a frontier stub: a paper we have seen cited but not
yet expanded. The rendered HTML is derived from this file and reproducible
at any time; the JSON is canonical.
Every significant decision is an ADR in docs/adr/ —
metadata source, graph representation, purity discipline, crawl semantics,
persistence format, rendering. Read those for the why. The shape of the
code:
spiderman/
doi/ normalize_doi, validate_doi (canonical node identity)
metadata/ PaperMetadata, parse_crossref_work,
fetch_paper_metadata (the only network effect)
graph/ empty_graph, add_paper, add_citation,
queries, prune, JSON <-> graph (pure graph API)
crawl/ crawl_doi (stateless recursion)
render/ color_nodes, cytoscape_elements,
render_graph_html (pure HTML generation)
persistence/ save_graph, load_graph,
save_graph_html (the only disk effects)
cli/ crawl_command, render_command (console entry points)
One function per file; frozen dataclasses; effects only at named boundaries.
Agent roles and coding standards live in Claude.md.
.venv/bin/python -m pytest -q # 94 unit tests, no network
.venv/bin/python -m pytest -q -m integration # live Crossref testUnit tests replace the network with a fake fetch closure; integration tests hit the real Crossref API and are deselected by default.