refactor(papers): use opencite for paper sync + declare dependency#309
Merged
Conversation
Replace the hand-rolled OpenAlex/Semantic Scholar/PubMed fetchers (and inverted-index reconstruction) with the opencite multi-source client, which aggregates and deduplicates across sources. Public sync function signatures are unchanged, so the CLI and scheduler call them as before; only the fetch layer is swapped. opencite Paper objects map to stable (source, external_id) pairs compatible with existing rows. Declares opencite>=0.5.2 as a server dependency, which also attributes OSA to neuromechanist/opencite in GitHub's dependency graph. Config is constructed directly (not Config.from_env) so sync never depends on ambient .env files in the working directory. Closes #307
…idge Address PR review: - sync_all_papers/sync_citing_papers now open a single SearchOrchestrator/ CitationExplorer for the whole batch instead of one per query/DOI (each open spins up 11 HTTP clients). Per-item errors are isolated so one bad query/DOI no longer aborts the batch. - Add _run() helper that uses asyncio.run normally but offloads to a worker thread if a loop is already running, so the public sync functions are safe to call from any context (CLI, scheduler thread, or future async caller). - Test _run in both the no-loop and running-loop paths.
Member
Author
Review (Sonnet, pr-review-toolkit) + resolutionsNo critical issues. Mapping logic, DB-constraint compliance (url NOT NULL, created_at types), Config construction with empty-string keys, error recovery, public-API stability, and import resolution all confirmed solid. Both 'important' findings addressed (no tech debt carried forward):
|
…opencite-paper-sync
This was referenced Jun 5, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the homegrown paper fetch layer in
papers_sync.pywith neuromechanist/opencite (the maintained successor this code inspired), and declares it as a dependency so GitHub's dependency graph attributes OSA under opencite's dependents.Closes #307. Scope: sync/fetch layer only (per design decision) — the local SQLite + FTS store,
upsert_paper, and thesearch_*_paperstool are unchanged.What changed
sync_openalex_papers/sync_semanticscholar_papers/sync_pubmed_papersnow restrict an opencite search to that one source;sync_all_papersruns one deduplicated opencite search across(openalex, s2, pubmed)per query (replacing three sequential per-source fetches).sync_citing_papersuses opencite'sCitationExplorer.citing_papers(no manual OpenAlex-ID lookup step).src/cli/sync.pyandsrc/api/scheduler.pycall these functions exactly as before.opencite>=0.5.2added to theserveroptional-dependencies (+ uv.lock).Design notes
Papermaps to a(source, external_id)with priority OpenAlex > Semantic Scholar > PubMed > DOI > arXiv, keeping source labels aligned with rows already in the DB.openalex, s2, pubmed) so batch coverage matches prior behavior; opencite's broader sources (arxiv/biorxiv/osf/...) are reserved for the opt-in live-search feature (feat(papers): optional live 'search most recent papers' via opencite #308).Config.from_env(): Config is built directly from OSA-supplied keys, so sync never depends on ambient.envfiles (opencite's dotenv loader raisedOSErroron OSA's.env). Keys flow from OSA settings exactly as before;configure_openalex()is retained for the CLI's configure-once pattern.Tests
Paperobjects (no mocks) cover the source/ID/URL mapping and_store_papersagainst a real SQLite DB (labels, dedup, skip-no-id, forced-source).245knowledge +20scheduler + CLI sync tests pass; ruff clean.Follow-ups (separate issues)
Closes #307