Skip to content

fix(citations): true uncapped per-year counts from OpenAlex#335

Merged
neuromechanist merged 4 commits into
developfrom
feature/accurate-citation-counts
Jun 10, 2026
Merged

fix(citations): true uncapped per-year counts from OpenAlex#335
neuromechanist merged 4 commits into
developfrom
feature/accurate-citation-counts

Conversation

@neuromechanist

Copy link
Copy Markdown
Member

Problem

The citations chart was wrong, not just incomplete. The old pipeline fetched citing papers through opencite, which returns a single page (<=200), oldest-first, with no pagination and no aggregation. For a heavily-cited paper that silently dropped all recent citations and produced an inverted curve. Example (BIDS 2016 paper, what we had stored): 2019:18 2020:15 2021:21 2022:11 2023:5 2024:2 2025:0 — declining and zero in 2025, when OpenAlex actually reports 1,899 citing works rising to 318 in 2025.

Fix

Query OpenAlex directly and separate counts from storage:

  • Counts (true to the data): per canonical DOI, group_by=publication_year gives the exact, complete per-year histogram (no cap). Stored in a new citation_counts(cites_doi, year, count) table, which is now the source of truth for GET /{community_id}/citations.
  • Stored papers (search tool): the latest 2,000 citing papers per DOI (cursor pagination, sort=publication_date:desc) are upserted into papers for the search corpus.

Components

  • src/knowledge/openalex_citations.pyOpenAlexCitationClient (DOI→work id, counts_by_year, recent_citing_papers), injectable httpx.Client for testing.
  • src/knowledge/db.pycitation_counts table + replace_citation_counts (delete+insert, exact mirror).
  • src/knowledge/search.pyget_citation_stats reads citation_counts; empty (not error) before the first sync.
  • src/knowledge/papers_sync.pysync_citing_papers rewritten (counts + latest 2,000); removed the dead opencite citation path. Default storage cap 2,000.
  • src/cli/sync.py — citation storage cap decoupled from the query --limit.

Test plan

  • tests/test_knowledge/test_openalex_citations.py (11): resolve/404, group_by parsing + non-year buckets, cursor pagination, limit, titleless skip, HTTP error propagation — via httpx.MockTransport (HTTP fixture, not a logic mock).
  • test_citation_stats.py / test_citations_feed.py: now assert against citation_counts; added replace-overwrites and missing-table-is-empty cases.
  • test_papers_sync.py: end-to-end sync_citing_papers with a mock transport (real client + real SQLite) — stores true counts and links recent papers; unresolved DOI skipped.
  • Regression: db, search, scheduler, sync, community router, cli sync — all green.

Deploy follow-up

After merge + dev redeploy, re-run sync papers --community {bids,eeglab} --citations to populate citation_counts. The chart will then match OpenAlex (e.g. BIDS 2025 ≈ 318).

Add a direct OpenAlex client (openalex_citations.py) that resolves a DOI to
its work id, returns the complete per-year citation histogram via group_by
(uncapped), and cursor-paginates the latest N citing papers sorted by
publication date. Add a citation_counts(cites_doi, year, count) table and a
replace_citation_counts helper that mirrors the histogram wholesale.

opencite caps citing-paper fetches at one page (<=200) with no pagination and
no aggregation, which silently truncated recent citations and inverted the
per-year curve; this is the foundation for fixing that.
sync_citing_papers now queries OpenAlex directly per canonical DOI: it stores
the exact, complete per-year counts in citation_counts (source of truth for the
dashboard) and upserts the latest 2000 citing papers (publication date desc)
into the papers table for the search tool. get_citation_stats reads the counts
table (empty, not error, before the first sync). The CLI decouples the citation
storage cap from the query --limit.
- OpenAlex client tests via httpx.MockTransport (resolve/404, group_by parsing,
  cursor pagination, limit, titleless skip, error propagation).
- get_citation_stats and the endpoint now assert against citation_counts;
  add replace-overwrites and missing-table-is-empty cases.
- End-to-end sync_citing_papers test (real client + real DB, mock transport):
  stores true counts and links recent papers; unresolved DOI skipped.
- Critical: never wipe stored counts on an empty histogram (likely a
  transient OpenAlex gap) — skip the DOI with a warning instead.
- sync all no longer forwards --limit to citations (would re-cap the stored
  sample at 100); uses the 2000 default like sync papers.
- recent_citing_papers: bound page count and stop on an empty results page
  so a stuck/non-null cursor can't spin; build the stored URL from the
  normalized DOI for consistency.
- replace_citation_counts: explicit rollback so a DOI is never half-replaced.
- Per-DOI failure log includes the exception type.

Tests: empty-counts-does-not-wipe, empty-results stop, absent-meta stop,
normalized-URL.
@neuromechanist neuromechanist merged commit 0520b11 into develop Jun 10, 2026
6 checks passed
@neuromechanist neuromechanist deleted the feature/accurate-citation-counts branch June 10, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant