Automated academic paper scraper — fetches Google Scholar results, enriches metadata via Crossref, and verifies indexing in WOS and Scopus.
Spanish documentation: README-ES.md
ScholarScraper (Firefox/Chrome) → queries/<folder>/page_N.html
↓
ParserScholarLite → Crossref (DOI) → Scopus API / WOS API
↓
output/<query>_output/<timestamp>/scraped_papers.xlsx
- Python 3.11+ and uv (
pip install uv) - Firefox Portable (included in
browser/) with a pre-configured user profile - Institutional WOS access via Clarivate
- API keys for Scopus Developer and Clarivate WOS
-
Install dependencies:
uv venv uv pip install -r requirements.txt -
Create a
.envfile at the project root:SCOPUS_API_KEY=your_scopus_api_key WOS_API_KEY=your_wos_api_key
run_pipeline.py runs scraping and parsing in a single command:
# Scrape + parse (Firefox portable, 100 results)
uv run --with selenium run_pipeline.py "machine learning healthcare"
# More results with WOS/Scopus indexer verification
uv run --with selenium run_pipeline.py "deep learning NLP" --max 200 --indexers
# Chrome backend (better stealth, requires Chrome installed)
uv run --with nodriver run_pipeline.py "your query" --backend chrome --max 100
# Re-parse an existing folder without scraping again
uv run --with selenium run_pipeline.py "your query" --skip-scrape --out MyQueryOutput: output/<query>_output/<timestamp>/scraped_papers.xlsx
uv run --with selenium scholar_scraper_test.py "your query" --max 100 --out MyQueryHTML pages are saved to queries/MyQuery/page_0.html, page_10.html, etc.
Open browser/FirefoxPortable/FirefoxPortable.exe (includes the institutional profile). Log in to the institutional library portal and access Clarivate WOS. The session may expire — renew it before each run if indexer verification is enabled.
python main_scrape.pyIterates over all subfolders in queries/ and writes scraped_papers.xlsx to output/. Runtime can reach up to 3 hours depending on paper volume.
tampermonkey/open_all_links.js enables manual scraping directly from the browser:
- Key K: downloads the current page as HTML.
- Mass-open button: opens results in batches of 100.
Install via the Tampermonkey extension and import open_all_links.js. Place downloaded HTMLs inside queries/<subfolder>/ before running main_scrape.py.
| Backend | Flag | Requirement | Notes |
|---|---|---|---|
firefox (default) |
--with selenium |
Firefox Portable included | Uses pre-configured portable profile |
chrome |
--with nodriver |
Chrome installed on the system | Better bot evasion — does not use CDP |
| Issue | Cause | Fix |
|---|---|---|
| CAPTCHA on Google Scholar | Too many requests in a short window | Scraper pauses for manual resolution; increase delays between sessions. |
| Crossref 504 error | Service timeout under high load | Scraper retries automatically; reduce query volume if it persists. |
| Excel file not generated | Excel is open during execution | Close Excel before running the scraper. |
| Disk full from Firefox temp files | Firefox does not clean up its temp folder automatically | Run delete_trash.bat to clear the temp directory. |
| High RAM usage from Firefox | Selenium + Firefox baseline behavior | Pending optimization. |