Skip to content

refactor(papers): use opencite for paper sync + declare dependency#309

Merged
neuromechanist merged 3 commits into
developfrom
feature/issue-307-opencite-paper-sync
Jun 5, 2026
Merged

refactor(papers): use opencite for paper sync + declare dependency#309
neuromechanist merged 3 commits into
developfrom
feature/issue-307-opencite-paper-sync

Conversation

@neuromechanist

Copy link
Copy Markdown
Member

Summary

Replaces the homegrown paper fetch layer in papers_sync.py with neuromechanist/opencite (the maintained successor this code inspired), and declares it as a dependency so GitHub's dependency graph attributes OSA under opencite's dependents.

Closes #307. Scope: sync/fetch layer only (per design decision) — the local SQLite + FTS store, upsert_paper, and the search_*_papers tool are unchanged.

What changed

  • sync_openalex_papers / sync_semanticscholar_papers / sync_pubmed_papers now restrict an opencite search to that one source; sync_all_papers runs one deduplicated opencite search across (openalex, s2, pubmed) per query (replacing three sequential per-source fetches).
  • sync_citing_papers uses opencite's CitationExplorer.citing_papers (no manual OpenAlex-ID lookup step).
  • Removed the hand-rolled inverted-index reconstruction and per-source XML/JSON parsing (opencite handles it). Net −101 lines.
  • Public signatures unchangedsrc/cli/sync.py and src/api/scheduler.py call these functions exactly as before.
  • opencite>=0.5.2 added to the server optional-dependencies (+ uv.lock).

Design notes

  • Stable identity / dedup: each opencite Paper maps to a (source, external_id) with priority OpenAlex > Semantic Scholar > PubMed > DOI > arXiv, keeping source labels aligned with rows already in the DB.
  • Default sources kept conservative (openalex, s2, pubmed) so batch coverage matches prior behavior; opencite's broader sources (arxiv/biorxiv/osf/...) are reserved for the opt-in live-search feature (feat(papers): optional live 'search most recent papers' via opencite #308).
  • No Config.from_env(): Config is built directly from OSA-supplied keys, so sync never depends on ambient .env files (opencite's dotenv loader raised OSError on OSA's .env). Keys flow from OSA settings exactly as before; configure_openalex() is retained for the CLI's configure-once pattern.

Tests

  • Real opencite Paper objects (no mocks) cover the source/ID/URL mapping and _store_papers against a real SQLite DB (labels, dedup, skip-no-id, forced-source).
  • Network smoke tests confirm opencite fetches from OpenAlex end-to-end.
  • 245 knowledge + 20 scheduler + CLI sync tests pass; ruff clean.

Follow-ups (separate issues)

Closes #307

Replace the hand-rolled OpenAlex/Semantic Scholar/PubMed fetchers (and
inverted-index reconstruction) with the opencite multi-source client,
which aggregates and deduplicates across sources. Public sync function
signatures are unchanged, so the CLI and scheduler call them as before;
only the fetch layer is swapped. opencite Paper objects map to stable
(source, external_id) pairs compatible with existing rows.

Declares opencite>=0.5.2 as a server dependency, which also attributes
OSA to neuromechanist/opencite in GitHub's dependency graph.

Config is constructed directly (not Config.from_env) so sync never
depends on ambient .env files in the working directory.

Closes #307
…idge

Address PR review:
- sync_all_papers/sync_citing_papers now open a single SearchOrchestrator/
  CitationExplorer for the whole batch instead of one per query/DOI (each
  open spins up 11 HTTP clients). Per-item errors are isolated so one bad
  query/DOI no longer aborts the batch.
- Add _run() helper that uses asyncio.run normally but offloads to a worker
  thread if a loop is already running, so the public sync functions are safe
  to call from any context (CLI, scheduler thread, or future async caller).
- Test _run in both the no-loop and running-loop paths.
@neuromechanist

Copy link
Copy Markdown
Member Author

Review (Sonnet, pr-review-toolkit) + resolutions

No critical issues. Mapping logic, DB-constraint compliance (url NOT NULL, created_at types), Config construction with empty-string keys, error recovery, public-API stability, and import resolution all confirmed solid.

Both 'important' findings addressed (no tech debt carried forward):

  1. One orchestrator per query was wasteful (each open spins up 11 HTTP clients). sync_all_papers/sync_citing_papers now open a single SearchOrchestrator/CitationExplorer for the whole batch and loop queries/DOIs inside it, with per-item error isolation so one bad query/DOI doesn't abort the batch.
  2. asyncio.run would raise inside a running loop (latent fragility). Added a _run() helper that uses asyncio.run normally but offloads to a worker thread when a loop is already running, so the public sync functions are safe from any context. Tested both paths (no-loop and running-loop).

50 scheduler + CLI sync, 23 offline mapping/run, and 2 network smoke tests pass; ruff clean.

@neuromechanist neuromechanist enabled auto-merge (squash) June 5, 2026 07:46
@neuromechanist neuromechanist merged commit b1814f9 into develop Jun 5, 2026
6 checks passed
@neuromechanist neuromechanist deleted the feature/issue-307-opencite-paper-sync branch June 5, 2026 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant