Scholar Scraper

Automated academic paper scraper — fetches Google Scholar results, enriches metadata via Crossref, and verifies indexing in WOS and Scopus.

Spanish documentation: README-ES.md

Pipeline

ScholarScraper (Firefox/Chrome) → queries/<folder>/page_N.html
       ↓
ParserScholarLite → Crossref (DOI) → Scopus API / WOS API
       ↓
output/<query>_output/<timestamp>/scraped_papers.xlsx

Requirements

Python 3.11+ and uv (pip install uv)
Firefox Portable (included in browser/) with a pre-configured user profile
Institutional WOS access via Clarivate
API keys for Scopus Developer and Clarivate WOS

Setup

Install dependencies:

uv venv
uv pip install -r requirements.txt

Create a .env file at the project root:

SCOPUS_API_KEY=your_scopus_api_key
WOS_API_KEY=your_wos_api_key

Usage

Full pipeline (recommended)

run_pipeline.py runs scraping and parsing in a single command:

# Scrape + parse (Firefox portable, 100 results)
uv run --with selenium run_pipeline.py "machine learning healthcare"

# More results with WOS/Scopus indexer verification
uv run --with selenium run_pipeline.py "deep learning NLP" --max 200 --indexers

# Chrome backend (better stealth, requires Chrome installed)
uv run --with nodriver run_pipeline.py "your query" --backend chrome --max 100

# Re-parse an existing folder without scraping again
uv run --with selenium run_pipeline.py "your query" --skip-scrape --out MyQuery

Output: output/<query>_output/<timestamp>/scraped_papers.xlsx

Step by step

1. Scrape Google Scholar

uv run --with selenium scholar_scraper_test.py "your query" --max 100 --out MyQuery

HTML pages are saved to queries/MyQuery/page_0.html, page_10.html, etc.

2. Authenticate WOS session

Open browser/FirefoxPortable/FirefoxPortable.exe (includes the institutional profile). Log in to the institutional library portal and access Clarivate WOS. The session may expire — renew it before each run if indexer verification is enabled.

3. Parse saved HTMLs

python main_scrape.py

Iterates over all subfolders in queries/ and writes scraped_papers.xlsx to output/. Runtime can reach up to 3 hours depending on paper volume.

Tampermonkey (manual alternative)

tampermonkey/open_all_links.js enables manual scraping directly from the browser:

Key K: downloads the current page as HTML.
Mass-open button: opens results in batches of 100.

Install via the Tampermonkey extension and import open_all_links.js. Place downloaded HTMLs inside queries/<subfolder>/ before running main_scrape.py.

Scraping backends

Backend	Flag	Requirement	Notes
`firefox` (default)	`--with selenium`	Firefox Portable included	Uses pre-configured portable profile
`chrome`	`--with nodriver`	Chrome installed on the system	Better bot evasion — does not use CDP

Known issues

Issue	Cause	Fix
CAPTCHA on Google Scholar	Too many requests in a short window	Scraper pauses for manual resolution; increase delays between sessions.
Crossref 504 error	Service timeout under high load	Scraper retries automatically; reduce query volume if it persists.
Excel file not generated	Excel is open during execution	Close Excel before running the scraper.
Disk full from Firefox temp files	Firefox does not clean up its temp folder automatically	Run `delete_trash.bat` to clear the temp directory.
High RAM usage from Firefox	Selenium + Firefox baseline behavior	Pending optimization.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
config		config
lib/habanero_git		lib/habanero_git
scraper_modules		scraper_modules
tampermonkey		tampermonkey
wos_scraper_legacy		wos_scraper_legacy
.gitignore		.gitignore
LICENSE		LICENSE
README-ES.md		README-ES.md
README.md		README.md
check.py		check.py
delete_trash.bat		delete_trash.bat
excel_cleaner_teams.py		excel_cleaner_teams.py
main_scrape.py		main_scrape.py
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scholar Scraper

Pipeline

Requirements

Setup

Usage

Full pipeline (recommended)

Step by step

1. Scrape Google Scholar

2. Authenticate WOS session

3. Parse saved HTMLs

Tampermonkey (manual alternative)

Scraping backends

Known issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scholar Scraper

Pipeline

Requirements

Setup

Usage

Full pipeline (recommended)

Step by step

1. Scrape Google Scholar

2. Authenticate WOS session

3. Parse saved HTMLs

Tampermonkey (manual alternative)

Scraping backends

Known issues

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages