Skip to content

vyaas/spiderman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spiderman

Recursively collect DOIs + metadata to create a web of knowledge seeded by your favorite papers.

Give it one DOI — say, a review you love — and it fetches that paper's metadata and reference list from Crossref, inserts every cited DOI as a node in a directed networkx graph (edge = "cites"), then recursively does the same for each reference. The graph is the artefact: a persistent, queryable, renderable JSON object. The code is purely functional — every operation takes a graph and returns a new one — so the graph is a mathematically clean object you can always get from and operate upon.

Seed used throughout: 10.1088/0034-4885/71/3/036601 (Eggers & Villermaux, Physics of liquid jets, Rep. Prog. Phys. 2008).


Install

Requires Python ≥ 3.10. Runtime dependency: networkx only.

With uv (recommended)

git clone <this-repo> spiderman && cd spiderman
uv venv                      # creates .venv at the repo root
uv pip install -e ".[dev]"   # editable install + pytest

With stdlib venv + pip

python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"

Either way you get two console commands on the venv path, spiderman and spiderman-render. Activate the venv (source .venv/bin/activate) or call them as .venv/bin/spiderman ....


CLI — the commands you will actually use

1. Seed a new graph

Fetch one paper and register all of its references as frontier stubs (one network call):

spiderman 10.1088/0034-4885/71/3/036601
# nodes: 490  edges: 489  frontier (unfetched): 489
# saved: data/knowledge_graph.json

2. Build the web (deepen)

--depth N means N fetch generations: depth 1 fetches the seed, depth 2 also fetches every reference, and so on. Re-running on an existing --out file resumes — already-fetched papers are never re-fetched, the crawl just expands the frontier:

spiderman 10.1088/0034-4885/71/3/036601 --depth 2 --mailto you@example.org
# nodes: 5085  edges: 8453  frontier (unfetched): 4596

Always pass --mailto for real crawls — it puts you in Crossref's polite pool. Depth 2 on a 489-reference review is ~489 sequential HTTP calls (10–30 minutes).

3. Update / keep growing

The same command is the update command. Run it again next week, point it at the same JSON, raise the depth — only unfetched DOIs cost network calls:

spiderman 10.1088/0034-4885/71/3/036601 --depth 3 --out data/knowledge_graph.json

4. Render the proof (visualize)

Turn the saved JSON into a self-contained interactive HTML page — no network, no server, no recrawl:

spiderman-render data/knowledge_graph.json
# rendered: data/knowledge_graph.html (697 of 5085 nodes, min degree 3)

Open the HTML in a browser: nodes are colored so that no two adjacent nodes share a color (greedy proper coloring, classic four-color palette first), edges are arrows from citing to cited, and double-clicking any node or edge opens its metadata panel (title, authors, journal, year, abstract, citation counts). By default only the connected core is drawn — nodes with ≥ 3 edges — and the page header states exactly what fraction you are seeing. Dial it:

spiderman-render data/knowledge_graph.json --min-degree 2   # wider core
spiderman-render data/knowledge_graph.json --min-degree 0   # everything (slow!)
spiderman-render data/knowledge_graph.json --html /tmp/web.html

You can also crawl and render in one shot: spiderman <doi> --depth 2 --html data/knowledge_graph.html.

Full flag reference

Command Flag Default Meaning
spiderman doi (positional) seed DOI (10.x/..., URL and doi: forms accepted)
--depth 1 fetch generations
--out data/knowledge_graph.json graph file; loaded and resumed if it exists
--html off also render after crawling
--min-degree 3 render threshold (with --html)
--mailto spiderman@example.org Crossref polite-pool contact
spiderman-render graph_json (positional) saved graph JSON
--html <graph_json>.html output path
--min-degree 3 only draw nodes with at least this many edges

python -m spiderman ... is identical to spiderman ....


Python API

Everything is a pure function exported from the top-level package: functions take a graph (plus arguments) and return values or new graphs — your input graph is never mutated. The only effectful functions are fetch_paper_metadata, save_graph, load_graph, and save_graph_html.

Seed, build, and update

from pathlib import Path
from spiderman import (
    crawl_doi, empty_graph, fetch_paper_metadata,
    load_graph, save_graph, save_graph_html,
)

seed = "10.1088/0034-4885/71/3/036601"

graph = empty_graph(seed)
graph = crawl_doi(graph, seed, depth=1, fetch_metadata=fetch_paper_metadata)
save_graph(graph, Path("data/knowledge_graph.json"))

# Later: load, deepen, save again. Fetched papers are never re-fetched.
graph = load_graph(Path("data/knowledge_graph.json"))
graph = crawl_doi(graph, seed, depth=2, fetch_metadata=fetch_paper_metadata)
save_graph(graph, Path("data/knowledge_graph.json"))

fetch_metadata is an injected function of type (str) -> PaperMetadata | None — pass your own (e.g. a closure over a dict) to crawl without a network.

Query

from spiderman import (
    has_paper, is_fetched, paper_metadata_of, papers_by_author,
    references_of, cited_by, unfetched_dois,
)

paper = paper_metadata_of(graph, seed)     # typed frozen dataclass
paper.title, paper.authors, paper.year     # 'Physics of liquid jets', (...), 2008

references_of(graph, seed)                 # DOIs this paper cites
cited_by(graph, "10.1103/revmodphys.69.865")  # DOIs citing it, within the graph
papers_by_author(graph, "eggers")          # case-insensitive substring match
unfetched_dois(graph)                      # the crawl frontier

Modify

from spiderman import add_paper, add_citation, prune_low_degree_nodes

grown = add_citation(graph, "10.1/a", "10.1/b")   # returns a NEW graph
core = prune_low_degree_nodes(graph, minimum_degree=3)

Render

from spiderman import render_graph_html, save_graph_html, color_nodes

html_text = render_graph_html(graph, minimum_degree=3)   # pure: returns the document
save_graph_html(graph, Path("data/web.html"), minimum_degree=3)

coloring = color_nodes(graph)   # proper coloring: adjacent nodes always differ

The artifact

data/knowledge_graph.json is standard node-link JSON — human-readable, jq-able, loss-free on round trip:

{
  "directed": true,
  "multigraph": false,
  "graph": {"seed_doi": "10.1088/0034-4885/71/3/036601"},
  "nodes": [
    {"id": "10.1088/0034-4885/71/3/036601", "title": "Physics of liquid jets",
     "authors": ["Jens Eggers", "Emmanuel Villermaux"],
     "journal": "Reports on Progress in Physics", "year": 2008,
     "abstract": "", "fetched": true},
    {"id": "10.1098/rstl.1805.0005", "fetched": false}
  ],
  "links": [
    {"source": "10.1088/0034-4885/71/3/036601", "target": "10.1098/rstl.1805.0005"}
  ]
}

"fetched": false marks a frontier stub: a paper we have seen cited but not yet expanded. The rendered HTML is derived from this file and reproducible at any time; the JSON is canonical.


Architecture

Every significant decision is an ADR in docs/adr/ — metadata source, graph representation, purity discipline, crawl semantics, persistence format, rendering. Read those for the why. The shape of the code:

spiderman/
  doi/           normalize_doi, validate_doi          (canonical node identity)
  metadata/      PaperMetadata, parse_crossref_work,
                 fetch_paper_metadata                 (the only network effect)
  graph/         empty_graph, add_paper, add_citation,
                 queries, prune, JSON <-> graph       (pure graph API)
  crawl/         crawl_doi                            (stateless recursion)
  render/        color_nodes, cytoscape_elements,
                 render_graph_html                    (pure HTML generation)
  persistence/   save_graph, load_graph,
                 save_graph_html                      (the only disk effects)
  cli/           crawl_command, render_command        (console entry points)

One function per file; frozen dataclasses; effects only at named boundaries. Agent roles and coding standards live in Claude.md.

Testing

.venv/bin/python -m pytest -q                  # 94 unit tests, no network
.venv/bin/python -m pytest -q -m integration   # live Crossref test

Unit tests replace the network with a fake fetch closure; integration tests hit the real Crossref API and are deselected by default.

About

A DOI web generated via a single recursive crawl+extract

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors