fix(citations): true uncapped per-year counts from OpenAlex#335
Merged
Conversation
Add a direct OpenAlex client (openalex_citations.py) that resolves a DOI to its work id, returns the complete per-year citation histogram via group_by (uncapped), and cursor-paginates the latest N citing papers sorted by publication date. Add a citation_counts(cites_doi, year, count) table and a replace_citation_counts helper that mirrors the histogram wholesale. opencite caps citing-paper fetches at one page (<=200) with no pagination and no aggregation, which silently truncated recent citations and inverted the per-year curve; this is the foundation for fixing that.
sync_citing_papers now queries OpenAlex directly per canonical DOI: it stores the exact, complete per-year counts in citation_counts (source of truth for the dashboard) and upserts the latest 2000 citing papers (publication date desc) into the papers table for the search tool. get_citation_stats reads the counts table (empty, not error, before the first sync). The CLI decouples the citation storage cap from the query --limit.
- OpenAlex client tests via httpx.MockTransport (resolve/404, group_by parsing, cursor pagination, limit, titleless skip, error propagation). - get_citation_stats and the endpoint now assert against citation_counts; add replace-overwrites and missing-table-is-empty cases. - End-to-end sync_citing_papers test (real client + real DB, mock transport): stores true counts and links recent papers; unresolved DOI skipped.
- Critical: never wipe stored counts on an empty histogram (likely a transient OpenAlex gap) — skip the DOI with a warning instead. - sync all no longer forwards --limit to citations (would re-cap the stored sample at 100); uses the 2000 default like sync papers. - recent_citing_papers: bound page count and stop on an empty results page so a stuck/non-null cursor can't spin; build the stored URL from the normalized DOI for consistency. - replace_citation_counts: explicit rollback so a DOI is never half-replaced. - Per-DOI failure log includes the exception type. Tests: empty-counts-does-not-wipe, empty-results stop, absent-meta stop, normalized-URL.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The citations chart was wrong, not just incomplete. The old pipeline fetched citing papers through opencite, which returns a single page (<=200), oldest-first, with no pagination and no aggregation. For a heavily-cited paper that silently dropped all recent citations and produced an inverted curve. Example (BIDS 2016 paper, what we had stored):
2019:18 2020:15 2021:21 2022:11 2023:5 2024:2 2025:0— declining and zero in 2025, when OpenAlex actually reports 1,899 citing works rising to 318 in 2025.Fix
Query OpenAlex directly and separate counts from storage:
group_by=publication_yeargives the exact, complete per-year histogram (no cap). Stored in a newcitation_counts(cites_doi, year, count)table, which is now the source of truth forGET /{community_id}/citations.sort=publication_date:desc) are upserted intopapersfor the search corpus.Components
src/knowledge/openalex_citations.py—OpenAlexCitationClient(DOI→work id,counts_by_year,recent_citing_papers), injectablehttpx.Clientfor testing.src/knowledge/db.py—citation_countstable +replace_citation_counts(delete+insert, exact mirror).src/knowledge/search.py—get_citation_statsreadscitation_counts; empty (not error) before the first sync.src/knowledge/papers_sync.py—sync_citing_papersrewritten (counts + latest 2,000); removed the dead opencite citation path. Default storage cap 2,000.src/cli/sync.py— citation storage cap decoupled from the query--limit.Test plan
tests/test_knowledge/test_openalex_citations.py(11): resolve/404, group_by parsing + non-year buckets, cursor pagination, limit, titleless skip, HTTP error propagation — viahttpx.MockTransport(HTTP fixture, not a logic mock).test_citation_stats.py/test_citations_feed.py: now assert againstcitation_counts; added replace-overwrites and missing-table-is-empty cases.test_papers_sync.py: end-to-endsync_citing_paperswith a mock transport (real client + real SQLite) — stores true counts and links recent papers; unresolved DOI skipped.Deploy follow-up
After merge + dev redeploy, re-run
sync papers --community {bids,eeglab} --citationsto populatecitation_counts. The chart will then match OpenAlex (e.g. BIDS 2025 ≈ 318).