Skip to content

Refactor to use API as source of truth#48

Open
vincerubinetti wants to merge 52 commits into
mainfrom
refactor
Open

Refactor to use API as source of truth#48
vincerubinetti wants to merge 52 commits into
mainfrom
refactor

Conversation

@vincerubinetti

@vincerubinetti vincerubinetti commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

This PR makes the API/database the single source of truth for all of the data. As a result of this, the gather pipeline could be drastically simplified.

After this PR is merged and deployed, we will be able to delete icc-eval-core-private and icc-eval-core-api, making this repo a mono-repo for everything, which simplifies the cross-talk needed between the different systems, and increases transparency imo.

Major changes:

  • Imports @falquaddoomi 's current contents of icc-eval-core-api into this repo. I opted for this other than merging my code into that repo because this repo has a lot of commit history I'd like to preserve.
  • No more public/private mode in the pipeline. The pipeline always gathers everything, and expects keys/secrets where auth is needed.
  • Raw and output JSON is no longer tracked in the repo. Instead, a GitHub Action will still run daily/periodically (from this repo), run the pipeline, take the outputted JSON, import it into the database (on the cloud, so private data is still hidden), then the JSON will just be forgotten/discarded. So this will avoid data duplication, and also avoid potential git LFS issues for large files.
  • Since raw third-party API responses are no longer being tracked, I no longer prune them to keep the size down. This makes it slightly easier to include other fields in the future, e.g. if we want to keep each contributor's account age or something.
  • No more cache "mode"/flag in pipeline. Previously there was an env var to flip this on and off, but it adds complication and hasn't been necessary so far. The pipeline that runs daily has always had the cache off, as to make sure to get the latest data always. Caching in general is still present, but it is really only meant for development where you're doing repeated runs and don't want to have to wait a long time. The single cache file recently became too big to store in the repo, making the option much less useful for subsequent gh-actions runs.
  • Got rid of complicated custom logging utility functions. Logging is now simpler, linear, and more descriptive (at the cost of being more verbose). It was a fun exercise at the time, but is not really necessary.
  • Got rid of complicated query/queryMulti wrapper abstraction. Again, a fun exercise at the time, but not needed. Replaced with a third party util library to implement concurrency, and got rid of saving of intermediate /raw file.
  • Deleted CFDE Spring 2026 meeting poster source because what it says is no longer how the system is architected. It will still be available in the commit history for posterity.
  • Gets rid of PDF printed reports. @cgreene has said that we should be ready to provide PDFs of things if needed, but no one from NIH has ever specifically asked for PDFs of the dashboard. Since I first implemented this, more info and interactivity has been added to the dashboard, making it harder to maintain PDF-friendly styling. If someone ends up asking for this, they can either simply print the page right from the browser, or the script I used to print all pages will still be in the commit history. Either way though, I don't see value in tracking big PDF files in the repo and linking to them in the dashboard, when they could just be generated on-demand as artifacts.
  • Rename /app to /frontend
  • Refactor frontend to load all data from API queries. Reduce duplication of work, e.g. transforming publications once in one spot, having tanstack-query hooks for each data source, etc.
  • Co-locate summary code by domain. That is, put getAnalyticsOverview (CFDE-wide analytics info) next to getAnalytics (for one project) next to each other, instead of having the overview code in the main gather.ts script. Easier to read and maintain.
  • Instead of saving all output at very end of gather.ts script, save after each step (again, co-locating by domain). I originally did this because I wanted the whole script to either succeed or fail together and avoid accidentally creating partial/mixed states, e.g. failing on repo step and now half the json files are updated and half are not. Now that the output files are meant to be ephemeral (hopefully made clear by them being gitignored), this doesn't matter as much.
  • Use bun for everything instead of node.
  • Adds gathering of ORCID to project auth map as part of pipeline.

@vincerubinetti vincerubinetti requested review from falquaddoomi and removed request for falquaddoomi April 13, 2026 19:02

@falquaddoomi falquaddoomi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a high-level review and it looks good to me.

Once this is merged, I'm going to clone it and move over some of my code that was uncommitted and didn't make it into this copy. I'll also see if anything broke for me locally, and submit the updates + any fixes as a PR.

@vincerubinetti

Copy link
Copy Markdown
Collaborator Author

Since this PR is contingent on the backend serving all the public files as well, I'll hold off on merging. If I were to merge, the dashboard would at least still be live since it's deploying from the private repo, but the data would stop updating daily (not that it changes that much day-to-day usually).

Feel free to append any needed backend changes to this PR.

@vincerubinetti vincerubinetti requested a review from seandavi April 20, 2026 20:12
@vincerubinetti

vincerubinetti commented Apr 20, 2026

Copy link
Copy Markdown
Collaborator Author

@seandavi FYI we're changing the architecture here. Just wanted to make you aware, I don't think it will affect anyone other than Faisal and I. Note that the poster I presented at the spring meeting is no longer true -- that's no longer how the pipeline works -- just so we don't misspeak about technical details in meetings.

@falquaddoomi

Copy link
Copy Markdown
Contributor

Hey @vincerubinetti, after a few tweaks (committed) I got the frontend to contact the backend for user info, but it seems to not be issuing other requests, e.g. for core projects. I see this in the browser's console:

[Vue Router warn]: No match found for location with path "/core-project/"
Screenshot 2026-04-21 at 8 10 01 PM

(Sorry for the image of the traceback; Chrome gave me all sorts of nonsense when I tried to copy-paste the text.)

Also, the database starts out unpopulated as expected, and since the data isn't public I don't want to put the dump in this repo. If you'd like to try it locally, I made a dumpfile from the current prod database; I can send you it through another channel as well as instructions on how to load it.

@vincerubinetti

Copy link
Copy Markdown
Collaborator Author

Did you see the const mock = true; flag in /frontend/src/api/index.ts? Sorry I didn't document/mention that anywhere.

I'll look into the router matching issue.

If you could send me the db dump that'd be great. Maybe it's even small enough in this case to send through slack?

@falquaddoomi

falquaddoomi commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Did you see the const mock = true; flag in /frontend/src/api/index.ts? Sorry I didn't document/mention that anywhere.

Aha, that was it. Would it be ok for me to modify that to pull an env var and default to true if it's not found?

About the database, sure thing; it's extremely small (~1mb) and would be easy to send over, e.g. Slack.

One last thing, now that I have the frontend calling the backend I've noticed that this version of the backend code is still relying on some private JSON files being available on the backend's filesystem that I was intending to move into the database anyway. I'm thinking I should make that change in this PR rather than holding off on it since you've already done a lot of work removing the other private data dependencies.

@vincerubinetti

Copy link
Copy Markdown
Collaborator Author

I'm thinking I should make that change in this PR rather than holding off on it since you've already done a lot of work removing the other private data dependencies.

Sounds good to me, let's close it all out with this PR. Let me know if you run into difficulty running the pipeline locally to generate all the json files.

@vincerubinetti

Copy link
Copy Markdown
Collaborator Author

FYI, going to push some small changes today.

Also running git merge origin/main -s ours to make this branch be the full source of truth.

@falquaddoomi

Copy link
Copy Markdown
Contributor

Hey @vincerubinetti, FYI I've updated the PR with the following significant changes:

  • addition of data models, endpoints for previously-unmodeled public data
  • modified gather to allow it to run headless when HEADLESS_BROWSER env var is "true"
  • frontend changes:
    • prevents passing undefined values as querystring args when making requests against the API
    • adds check for VITE_USE_MOCK env var
      • if "true" or undefined, uses mocking
      • if "false" (as it is in the dev config), disables mocking
  • added Dockerfile to data folder, added pipeline service to root Compose file
    • added env var placeholders in .env.TEMPLATE needed for running the gather script, e.g. AUTH_ENTREZ, etc.
  • added a basic README regarding how to use the Compose stack
  • admin UI now displays ORCID-user mappings that contain a star as "all projects"

@falquaddoomi

Copy link
Copy Markdown
Contributor

@vincerubinetti just an FYI, but I noticed when running the gather script I encountered the following error both inside the container and via the run_pipeline.sh script:

% ./run_pipeline.sh --gather
Running script: 'gather'
With params: ''
Running in /data
$ bun ./gather.ts
 ------------------------------------------------------------
 Opportunities
 ------------------------------------------------------------
 Scraping https://commonfund.nih.gov/dataecosystem/FundingOpportunities for documents
 Parsing 17 documents for opportunities
Processing 17 items with concurrency 1...
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-24-006.html
     RFA-RM-24-006
   https://commonfund.nih.gov/sites/default/files/OTA-24-005.pdf
     NOT-RM-24-006
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-24-006.html
     NOT-RM-24-003
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-24-003.html
     RFA-RM-23-014
   https://commonfund.nih.gov/sites/default/files/OTA-24-004.pdf
     RFA-RM-23-013
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-014.html
     RFA-RM-23-015
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-013.html
     RFA-RM-23-002
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-015.html
     OTA-23-004
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-002.html
     NOT-RM-23-005
   https://commonfund.nih.gov/sites/default/files/OTA-23-004.pdf
     NOT-RM-23-006
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-23-005.html
     RFA-RM-23-003
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-23-006.html
     RFA-RM-22-007
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-23-003.html
     RFA-RM-21-007
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-22-007.html
     RFA-RM-19-012
   https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-24-006.html
    ⚠ Error: Bad end offset: 2755512
   https://commonfund.nih.gov/sites/default/files/OTA-24-005.pdf
    ⚠ Error: Bad end offset: 1458176
   https://grants.nih.gov/grants/guide/notice-files/NOT-RM-24-006.html
    ⚠ Error: Bad end offset: 802816
 Including 17 manual opportunities
   RFA-RM-24-006
   OTA-24-005
   NOT-RM-24-006
   NOT-RM-24-003
   OTA-24-004
   RFA-RM-23-014
   RFA-RM-23-013
   RFA-RM-23-015
   RFA-RM-23-002
   OTA-23-004
   NOT-RM-23-005
   NOT-RM-23-006
   RFA-RM-23-003
   RFA-RM-22-007
   RFA-RM-21-007
   RFA-RM-19-012
   OTA-20-005
✓ 17 opportunities
 ------------------------------------------------------------
 Projects
 ------------------------------------------------------------
 Getting projects for 17 opportunities
 Including 4 manual core projects
✓ 91 core projects
✓ 166 projects
 ------------------------------------------------------------
 Publications
 ------------------------------------------------------------
 Getting publications for 91 core projects
 Getting metadata for 406 core publications
  ⚠ Number of RePORTER and iCite pubs don't match
     Not in iCite: 42115493, 42110661, 42163770
     Not in RePORTER: 
 ------------------------------------------------------------
 Journals
 ------------------------------------------------------------
 Getting 185 journals
 Downloading from https://www.scimagojr.com/journalrank.php?out=xls
 Getting full journal details
Processing 185 items with concurrency 1...
   Front Bioinform
   
   ...omitted..

   Front Physiol
   Front Oncol
   Eur J Cancer
   J Med Imaging Radiat Oncol
✓ 185 journals
 ------------------------------------------------------------
 Analytics
 ------------------------------------------------------------
 Getting Google Analytics properties
   properties/361964108 Glygen - GA4
   properties/503949232 GlyTableMaker

   ...omitted..

   properties/386147844 data.4dnucleome.org - GA4
   properties/367401149 Drug Central TiD - GA4
 Getting Google Analytics data for 23 properties
Processing 23 items with concurrency 10...
   Glygen - GA4 - Over time
   GlyTableMaker - Over time
   HuBMAP Portal - GA4 - Over time
   HuBMAP Consortium - GA4 - Over time

   ...omitted...

   data.4dnucleome.org - GA4 - Page views
   Drug Central TiD - GA4 - Page views
   data.4dnucleome.org - GA4 - Scroll events
   Drug Central TiD - GA4 - Scroll events
 [object Object]
  ⚠ Error: 4 DEADLINE_EXCEEDED: Deadline expired before operation could complete.
✗ 1 errors
24 |   level: keyof typeof levels = "",
25 |   indent = 0,
26 | ) => {
27 |   const { color, icon } = levels[level];
28 |   console.log("  ".repeat(indent) + color(icon, message));
29 |   if (level === "error") throw Error(message);
                                    ^
error: 1 errors
      at log (/Users/faisala/Checkouts/chai/cfde/icc-eval-core/data/util/log.ts:29:32)
      at <anonymous> (/Users/faisala/Checkouts/chai/cfde/icc-eval-core/data/gather/analytics.ts:85:22)

Bun v1.2.20 (macOS arm64)

Not sure if that's expected or if you have insight into what might be going on. I had successfully run the gather script a few days ago (this is after the changes I'd made to use the headless browser), but today I ran into this, so I can't imagine it's something I introduced. I may be wrong, though.

@vincerubinetti

Copy link
Copy Markdown
Collaborator Author

Looks like it's failing on the main branch too as of yesterday, I'll look into it

Screenshot 2026-05-29 at 3 35 02 PM

@vincerubinetti

vincerubinetti commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@falquaddoomi Regarding the pipeline failure, that is only happening on GitHub actions, it runs fine locally. The issue is that the "playwright install" step is hanging.

From some research, it seems like it's due to unreliability of apt install mirrors that the GitHub ubuntu-latest image uses:

microsoft/playwright#23388
NousResearch/hermes-agent#35166
Frazzled-Productions/poke-memory#644

It seems like we might be able to just wait out the problem, or pin the Ubuntu version possibly, or use an official Playwright Docker container on Actions:

microsoft/playwright#23388 (comment)

I'll look into whether that causes any problems with the pipeline on Actions. Have you experienced the same issue with the oven/bun image you're using for running the pipeline?

Regardless, the issue isn't due to anything in this PR and is unrelated.

@falquaddoomi

Copy link
Copy Markdown
Contributor

@vincerubinetti thanks for the response; yeah, seems that error I was seeing was something transient, as I'm able to run the pipeline locally fine, both in the new pipeline container and on the host. I didn't encounter any issues loading the resulting scraped data into the database, either.

Given that, here's the remaining todos on my end:

  • enforce authentication only on the Repositories and Analytics views, via the user ORCID -> core project mapping
  • load users.json and replace the existing user->project mapping with its contents

@vincerubinetti

Copy link
Copy Markdown
Collaborator Author

@falquaddoomi Note my last changes.

I renamed the /repo-overview API endpoint in the backend to /repositories-overiew to match what the frontend expects. I started to go further and rename the models/serializers/etc, but stopped. You can do that if you feel the desire to.

Also, to simplify and normalize things in the Google Analytics, I made some changes. See /frontend/src/api/types/analytics.json for the new type definition for the analytics. Notice the addition of pageViews. Notice the addition of returningUsers metric to every dimension. Notice there is no more topDimension or byMetric prefixes, for simplicity and also clarity because I'm now essentially returning all results, not just the top 10.

I thought about trying to update the backend with these naming changes, and I could probably get most of the way there, but I realized I'd probably screw up the migrations especially. However, since we're starting from scratch with this PR, if you wanted to start fresh with the migrations, I can try to make these naming changes for you. Or maybe I can make a first pass at the backend renames, and you can take it from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants