Refactor to use API as source of truth by vincerubinetti · Pull Request #48 · nih-cfde/icc-eval-core

vincerubinetti · 2026-04-13T18:44:53Z

This PR makes the API/database the single source of truth for all of the data. As a result of this, the gather pipeline could be drastically simplified.

After this PR is merged and deployed, we will be able to delete icc-eval-core-private and icc-eval-core-api, making this repo a mono-repo for everything, which simplifies the cross-talk needed between the different systems, and increases transparency imo.

Major changes:

Imports @falquaddoomi 's current contents of icc-eval-core-api into this repo. I opted for this other than merging my code into that repo because this repo has a lot of commit history I'd like to preserve.
No more public/private mode in the pipeline. The pipeline always gathers everything, and expects keys/secrets where auth is needed.
Raw and output JSON is no longer tracked in the repo. Instead, a GitHub Action will still run daily/periodically (from this repo), run the pipeline, take the outputted JSON, import it into the database (on the cloud, so private data is still hidden), then the JSON will just be forgotten/discarded. So this will avoid data duplication, and also avoid potential git LFS issues for large files.
Since raw third-party API responses are no longer being tracked, I no longer prune them to keep the size down. This makes it slightly easier to include other fields in the future, e.g. if we want to keep each contributor's account age or something.
No more cache "mode"/flag in pipeline. Previously there was an env var to flip this on and off, but it adds complication and hasn't been necessary so far. The pipeline that runs daily has always had the cache off, as to make sure to get the latest data always. Caching in general is still present, but it is really only meant for development where you're doing repeated runs and don't want to have to wait a long time. The single cache file recently became too big to store in the repo, making the option much less useful for subsequent gh-actions runs.
Got rid of complicated custom logging utility functions. Logging is now simpler, linear, and more descriptive (at the cost of being more verbose). It was a fun exercise at the time, but is not really necessary.
Got rid of complicated query/queryMulti wrapper abstraction. Again, a fun exercise at the time, but not needed. Replaced with a third party util library to implement concurrency, and got rid of saving of intermediate /raw file.
Deleted CFDE Spring 2026 meeting poster source because what it says is no longer how the system is architected. It will still be available in the commit history for posterity.
Gets rid of PDF printed reports. @cgreene has said that we should be ready to provide PDFs of things if needed, but no one from NIH has ever specifically asked for PDFs of the dashboard. Since I first implemented this, more info and interactivity has been added to the dashboard, making it harder to maintain PDF-friendly styling. If someone ends up asking for this, they can either simply print the page right from the browser, or the script I used to print all pages will still be in the commit history. Either way though, I don't see value in tracking big PDF files in the repo and linking to them in the dashboard, when they could just be generated on-demand as artifacts.
Rename /app to /frontend
Refactor frontend to load all data from API queries. Reduce duplication of work, e.g. transforming publications once in one spot, having tanstack-query hooks for each data source, etc.
Co-locate summary code by domain. That is, put getAnalyticsOverview (CFDE-wide analytics info) next to getAnalytics (for one project) next to each other, instead of having the overview code in the main gather.ts script. Easier to read and maintain.
Instead of saving all output at very end of gather.ts script, save after each step (again, co-locating by domain). I originally did this because I wanted the whole script to either succeed or fail together and avoid accidentally creating partial/mixed states, e.g. failing on repo step and now half the json files are updated and half are not. Now that the output files are meant to be ephemeral (hopefully made clear by them being gitignored), this doesn't matter as much.
Use bun for everything instead of node.
Adds gathering of ORCID to project auth map as part of pipeline.

falquaddoomi

I did a high-level review and it looks good to me.

Once this is merged, I'm going to clone it and move over some of my code that was uncommitted and didn't make it into this copy. I'll also see if anything broke for me locally, and submit the updates + any fixes as a PR.

vincerubinetti · 2026-04-13T20:51:05Z

Since this PR is contingent on the backend serving all the public files as well, I'll hold off on merging. If I were to merge, the dashboard would at least still be live since it's deploying from the private repo, but the data would stop updating daily (not that it changes that much day-to-day usually).

Feel free to append any needed backend changes to this PR.

vincerubinetti · 2026-04-20T20:14:09Z

@seandavi FYI we're changing the architecture here. Just wanted to make you aware, I don't think it will affect anyone other than Faisal and I. Note that the poster I presented at the spring meeting is no longer true -- that's no longer how the pipeline works -- just so we don't misspeak about technical details in meetings.

…local API

…0.0.1 over localhost

falquaddoomi · 2026-04-22T02:12:31Z

Hey @vincerubinetti, after a few tweaks (committed) I got the frontend to contact the backend for user info, but it seems to not be issuing other requests, e.g. for core projects. I see this in the browser's console:

[Vue Router warn]: No match found for location with path "/core-project/"

(Sorry for the image of the traceback; Chrome gave me all sorts of nonsense when I tried to copy-paste the text.)

Also, the database starts out unpopulated as expected, and since the data isn't public I don't want to put the dump in this repo. If you'd like to try it locally, I made a dumpfile from the current prod database; I can send you it through another channel as well as instructions on how to load it.

vincerubinetti · 2026-04-22T15:01:43Z

Did you see the const mock = true; flag in /frontend/src/api/index.ts? Sorry I didn't document/mention that anywhere.

I'll look into the router matching issue.

If you could send me the db dump that'd be great. Maybe it's even small enough in this case to send through slack?

falquaddoomi · 2026-04-27T20:36:27Z

Did you see the const mock = true; flag in /frontend/src/api/index.ts? Sorry I didn't document/mention that anywhere.

Aha, that was it. Would it be ok for me to modify that to pull an env var and default to true if it's not found?

About the database, sure thing; it's extremely small (~1mb) and would be easy to send over, e.g. Slack.

One last thing, now that I have the frontend calling the backend I've noticed that this version of the backend code is still relying on some private JSON files being available on the backend's filesystem that I was intending to move into the database anyway. I'm thinking I should make that change in this PR rather than holding off on it since you've already done a lot of work removing the other private data dependencies.

vincerubinetti · 2026-04-27T21:06:12Z

I'm thinking I should make that change in this PR rather than holding off on it since you've already done a lot of work removing the other private data dependencies.

Sounds good to me, let's close it all out with this PR. Let me know if you run into difficulty running the pipeline locally to generate all the json files.

vincerubinetti · 2026-05-05T17:58:17Z

FYI, going to push some small changes today.

Also running git merge origin/main -s ours to make this branch be the full source of truth.

… an error as a result

… to be run in a container

…the "pipeline" service

…g a request

…rized stack

falquaddoomi · 2026-05-29T19:29:28Z

Hey @vincerubinetti, FYI I've updated the PR with the following significant changes:

addition of data models, endpoints for previously-unmodeled public data
modified gather to allow it to run headless when HEADLESS_BROWSER env var is "true"
frontend changes:
- prevents passing undefined values as querystring args when making requests against the API
- adds check for VITE_USE_MOCK env var
  - if "true" or undefined, uses mocking
  - if "false" (as it is in the dev config), disables mocking
added Dockerfile to data folder, added pipeline service to root Compose file
- added env var placeholders in .env.TEMPLATE needed for running the gather script, e.g. AUTH_ENTREZ, etc.
added a basic README regarding how to use the Compose stack
admin UI now displays ORCID-user mappings that contain a star as "all projects"

falquaddoomi · 2026-05-29T19:32:15Z

@vincerubinetti just an FYI, but I noticed when running the gather script I encountered the following error both inside the container and via the run_pipeline.sh script:

% ./run_pipeline.sh --gather
Running script: 'gather'
With params: ''
Running in /data
$ bun ./gather.ts
 ------------------------------------------------------------
 Opportunities
 ------------------------------------------------------------
 Scraping https://commonfund.nih.gov/dataecosystem/FundingOpportunities for documents
 Parsing 17 documents for opportunities
Processing 17 items with concurrency 1...
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-24-006.html
     RFA-RM-24-006
   https://commonfund.nih.gov/sites/default/files/OTA-24-005.pdf
     NOT-RM-24-006
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-24-006.html
     NOT-RM-24-003
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-24-003.html
     RFA-RM-23-014
   https://commonfund.nih.gov/sites/default/files/OTA-24-004.pdf
     RFA-RM-23-013
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-014.html
     RFA-RM-23-015
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-013.html
     RFA-RM-23-002
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-015.html
     OTA-23-004
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-002.html
     NOT-RM-23-005
   https://commonfund.nih.gov/sites/default/files/OTA-23-004.pdf
     NOT-RM-23-006
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-23-005.html
     RFA-RM-23-003
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-23-006.html
     RFA-RM-22-007
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-003.html
     RFA-RM-21-007
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-22-007.html
     RFA-RM-19-012
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-24-006.html
    ⚠ Error: Bad end offset: 2755512
   https://commonfund.nih.gov/sites/default/files/OTA-24-005.pdf
    ⚠ Error: Bad end offset: 1458176
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-24-006.html
    ⚠ Error: Bad end offset: 802816
 Including 17 manual opportunities
   RFA-RM-24-006
   OTA-24-005
   NOT-RM-24-006
   NOT-RM-24-003
   OTA-24-004
   RFA-RM-23-014
   RFA-RM-23-013
   RFA-RM-23-015
   RFA-RM-23-002
   OTA-23-004
   NOT-RM-23-005
   NOT-RM-23-006
   RFA-RM-23-003
   RFA-RM-22-007
   RFA-RM-21-007
   RFA-RM-19-012
   OTA-20-005
✓ 17 opportunities
 ------------------------------------------------------------
 Projects
 ------------------------------------------------------------
 Getting projects for 17 opportunities
 Including 4 manual core projects
✓ 91 core projects
✓ 166 projects
 ------------------------------------------------------------
 Publications
 ------------------------------------------------------------
 Getting publications for 91 core projects
 Getting metadata for 406 core publications
  ⚠ Number of RePORTER and iCite pubs don't match
     Not in iCite: 42115493, 42110661, 42163770
     Not in RePORTER: 
 ------------------------------------------------------------
 Journals
 ------------------------------------------------------------
 Getting 185 journals
 Downloading from https://www.scimagojr.com/journalrank.php?out=xls
 Getting full journal details
Processing 185 items with concurrency 1...
   Front Bioinform
   
   ...omitted..

   Front Physiol
   Front Oncol
   Eur J Cancer
   J Med Imaging Radiat Oncol
✓ 185 journals
 ------------------------------------------------------------
 Analytics
 ------------------------------------------------------------
 Getting Google Analytics properties
   properties/361964108 Glygen - GA4
   properties/503949232 GlyTableMaker

   ...omitted..

   properties/386147844 data.4dnucleome.org - GA4
   properties/367401149 Drug Central TiD - GA4
 Getting Google Analytics data for 23 properties
Processing 23 items with concurrency 10...
   Glygen - GA4 - Over time
   GlyTableMaker - Over time
   HuBMAP Portal - GA4 - Over time
   HuBMAP Consortium - GA4 - Over time

   ...omitted...

   data.4dnucleome.org - GA4 - Page views
   Drug Central TiD - GA4 - Page views
   data.4dnucleome.org - GA4 - Scroll events
   Drug Central TiD - GA4 - Scroll events
 [object Object]
  ⚠ Error: 4 DEADLINE_EXCEEDED: Deadline expired before operation could complete.
✗ 1 errors
24 |   level: keyof typeof levels = "",
25 |   indent = 0,
26 | ) => {
27 |   const { color, icon } = levels[level];
28 |   console.log("  ".repeat(indent) + color(icon, message));
29 |   if (level === "error") throw Error(message);
                                    ^
error: 1 errors
      at log (/Users/faisala/Checkouts/chai/cfde/icc-eval-core/data/util/log.ts:29:32)
      at <anonymous> (/Users/faisala/Checkouts/chai/cfde/icc-eval-core/data/gather/analytics.ts:85:22)

Bun v1.2.20 (macOS arm64)

Not sure if that's expected or if you have insight into what might be going on. I had successfully run the gather script a few days ago (this is after the changes I'd made to use the headless browser), but today I ran into this, so I can't imagine it's something I introduced. I may be wrong, though.

vincerubinetti · 2026-05-29T19:35:21Z

Looks like it's failing on the main branch too as of yesterday, I'll look into it

vincerubinetti · 2026-06-05T23:47:39Z

@falquaddoomi Regarding the pipeline failure, that is only happening on GitHub actions, it runs fine locally. The issue is that the "playwright install" step is hanging.

From some research, it seems like it's due to unreliability of apt install mirrors that the GitHub ubuntu-latest image uses:

microsoft/playwright#23388
NousResearch/hermes-agent#35166
Frazzled-Productions/poke-memory#644

It seems like we might be able to just wait out the problem, or pin the Ubuntu version possibly, or use an official Playwright Docker container on Actions:

microsoft/playwright#23388 (comment)

I'll look into whether that causes any problems with the pipeline on Actions. Have you experienced the same issue with the oven/bun image you're using for running the pipeline?

Regardless, the issue isn't due to anything in this PR and is unrelated.

falquaddoomi · 2026-06-10T18:17:24Z

@vincerubinetti thanks for the response; yeah, seems that error I was seeing was something transient, as I'm able to run the pipeline locally fine, both in the new pipeline container and on the host. I didn't encounter any issues loading the resulting scraped data into the database, either.

Given that, here's the remaining todos on my end:

enforce authentication only on the Repositories and Analytics views, via the user ORCID -> core project mapping
load users.json and replace the existing user->project mapping with its contents

vincerubinetti · 2026-07-01T02:08:36Z

@falquaddoomi Note my last changes.

I renamed the /repo-overview API endpoint in the backend to /repositories-overiew to match what the frontend expects. I started to go further and rename the models/serializers/etc, but stopped. You can do that if you feel the desire to.

Also, to simplify and normalize things in the Google Analytics, I made some changes. See /frontend/src/api/types/analytics.json for the new type definition for the analytics. Notice the addition of pageViews. Notice the addition of returningUsers metric to every dimension. Notice there is no more topDimension or byMetric prefixes, for simplicity and also clarity because I'm now essentially returning all results, not just the top 10.

I thought about trying to update the backend with these naming changes, and I could probably get most of the way there, but I realized I'd probably screw up the migrations especially. However, since we're starting from scratch with this PR, if you wanted to start fresh with the migrations, I can try to make these naming changes for you. Or maybe I can make a first pass at the backend renames, and you can take it from there.

vincerubinetti added 9 commits April 10, 2026 13:53

import api repo

c32f53e

rename app folder to frontend

23bebe4

update env, git ignore, setup instructions

cac4897

remove CACHE and PRIVATE

9a43749

fix gh-actions run script filename

9eb5da2

refactor

7dc9d35

refactor frontend to use api

ba1ce70

Merge remote-tracking branch 'origin/main' into refactor

cf63bc5

remove printing

d6ef29d

vincerubinetti requested review from falquaddoomi and removed request for falquaddoomi April 13, 2026 19:02

add orcid-to-project map

00fa066

vincerubinetti requested a review from falquaddoomi April 13, 2026 20:09

move auth setup instructions

c02328d

falquaddoomi approved these changes Apr 13, 2026

View reviewed changes

vincerubinetti requested a review from seandavi April 20, 2026 20:12

vincerubinetti and others added 4 commits April 20, 2026 16:32

add new users, reverse map direction

c300d5b

Adds frontend Dockerfile, updates env vars in dev environment to use …

b473dc7

…local API

Fixes recent psycopg2 build failure by switching to binary version

19339a1

Adds 127.0.0.1:5173 as CORS origin, since ORCID sandbox requires 127.…

20c8231

…0.0.1 over localhost

fix route warnings

78a0248

Merge remote-tracking branch 'origin/main' into refactor

0c100dd

upgrade frontend packages

2bbbe58

falquaddoomi added 9 commits May 27, 2026 10:45

Frontend using mock vs. real data settable via env var VITE_USE_MOCK

792d95a

Prevents frontend container from trying to launch browser and showing…

b0583ee

… an error as a result

Configures frontend container to use local API rather than mock data

4459888

Adds HEADLESS_BROWSER env var to data-gathering pipeline, allowing it…

7ba681c

… to be run in a container

Adds Dockerfile + ignore for data folder, so that it can be run as …

de691ca

…the "pipeline" service

Adds 'pipeline' service to root Docker Compose

79d72ed

Prevents passing undefined values as a querystring arg when performin…

e4a7d89

…g a request

Adds .env template placeholders for values needed during gather

f5a6876

Adds a README with some basic instructions on how to use the containe…

7e6f4a3

…rized stack

minor tweaks to previous changes

13cf3c7

vincerubinetti and others added 14 commits June 10, 2026 14:34

find+replace repo -> repository

1b553b5

Drops and recreates db before import

2aa0e70

Removed random 'a' in the middle of the page

f069a82

Allows backend to import either repositories.json or repos.json

a757352

Updates perms to restrict only Analytics, Repository

65e8c22

Updates tests for new perms model

e4262ae

Disabled old pipeline workflow, added new workflow that runs on GCP VM

771e46c

Updated workflow to add db import step. Added missing perms helpers

f339177

now just git pulls rather than checking out main

2187b6e

split new vs returning users

1e95ef7

repo -> repository

c1366d2

repo -> repository

fd5db08

fix repo-overview -> repositories-overview backend endpoint name

3e22976

simplify and normalize analytics dimensions

db83a3d

Uh oh!

Conversation

vincerubinetti commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falquaddoomi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincerubinetti commented Apr 13, 2026

Uh oh!

vincerubinetti commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falquaddoomi commented Apr 22, 2026

Uh oh!

vincerubinetti commented Apr 22, 2026

Uh oh!

falquaddoomi commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincerubinetti commented Apr 27, 2026

Uh oh!

vincerubinetti commented May 5, 2026

Uh oh!

falquaddoomi commented May 29, 2026

Uh oh!

falquaddoomi commented May 29, 2026

Uh oh!

vincerubinetti commented May 29, 2026

Uh oh!

vincerubinetti commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falquaddoomi commented Jun 10, 2026

Uh oh!

vincerubinetti commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vincerubinetti commented Apr 13, 2026 •

edited

Loading

falquaddoomi left a comment •

edited

Loading

vincerubinetti commented Apr 20, 2026 •

edited

Loading

falquaddoomi commented Apr 27, 2026 •

edited

Loading

vincerubinetti commented Jun 5, 2026 •

edited

Loading