The open dataset and toolkit for global job market data. 3.3M+ live jobs from 400 000+ companies, scraped directly from the ATS platforms where companies actually post. No LinkedIn, no reposts, no recruiters.
from jobhive import search
df = search(query="ml engineer", location="Paris", remote=True)No API key, no auth, no rate limits. The dataset refreshes every 24 hours.
Most job aggregators scrape LinkedIn and Indeed — both full of duplicates, ghost listings, and reposts. jobhive goes one layer down: directly to the ATS platforms (Greenhouse, Lever, Ashby, Workday, BambooHR…) where companies actually post.
- Single source of truth — every row comes from the company's own ATS, so titles, locations, and salaries are accurate.
- No duplicates — one ATS posting = one row.
- Structured salary when the ATS exposes it (Ashby, Greenhouse Pay Transparency, Lever salaryRange, etc.).
- MIT licensed, fully open — fork the dataset, fork the scrapers.
| Metric | Value |
|---|---|
| Live jobs | 3 376 000+ |
| Companies | 406 000+ |
| ATS platforms | 31 |
Top 10 by job count:
| ATS | Jobs |
|---|---|
| Bundesagentur (DE public-sector) | 931 049 |
| Workday | 653 041 |
| EURES (EU/EEA public-sector) | 626 783 |
| SmartRecruiters | 213 372 |
| SuccessFactors | 180 499 |
| Greenhouse | 110 071 |
| Oracle HCM | 107 464 |
| iCIMS | 92 211 |
| Lever | 60 342 |
| Phenom | 56 483 |
Counts come from the live manifest at
https://storage.stapply.ai/jobhive/v1/manifest.json — verify any time
with jobhive list-ats.
pip install jobhive-pyDistributed as jobhive-py on PyPI; the import name is still jobhive.
Optional extras:
pip install "jobhive-py[parquet]" # faster downloads via Apache Parquet
pip install "jobhive-py[scrapers]" # build your own pipeline
pip install "jobhive-py[all]"from jobhive import search
# Free-text title + location + remote filter
df = search(query="rust", location="Berlin", remote=True, salary_min=80_000)
# Restrict to one ATS slice (smaller download)
df = search(query="data engineer", ats="ashby")
# Pandas all the way down
df.groupby("company").size().sort_values(ascending=False).head(20)Every row carries:
global_id, url, title, company, ats_type, ats_id,
location, country_iso, region, is_remote, lat, lon,
salary_min, salary_max, salary_currency, salary_period, salary_summary,
employment_type, commitment, experience, department, team,
description, posted_at, fetched_at, language,
requisition_id, apply_url, raw
Full per-field semantics (types, defaults, derivation rules, examples)
live in JOB_SCHEMA.md. global_id is the
cross-ATS unique key in the form {ats_type}:{ats_id}. Optional fields
are None when the source ATS doesn't expose them; raw keeps any
provider-specific fields the canonical schema doesn't represent.
from jobhive.scrapers import GreenhouseScraper, LeverScraper, AshbyScraper
jobs = GreenhouseScraper("anthropic").fetch() # → list[Job]
jobs = LeverScraper("palantir").fetch()
jobs = AshbyScraper("openai").fetch()Or pick by name:
from jobhive.scrapers import get_scraper
scraper = get_scraper("ashby", "openai")Multi-tenant ATS (pass the company's slug on that ATS):
Greenhouse, Lever, Ashby, SmartRecruiters, Workable,
Rippling, Personio, Gem, JoinCom, iCIMS, JazzHR, Breezy,
Teamtailor, Pinpoint, BambooHR, Cornerstone, Recruitee,
Recruiterbox, Eightfold, Avature, Phenom, Workday, Oracle,
SuccessFactors, Taleo, Mercor.
Custom big-tech APIs (single-tenant, slug ignored): Amazon,
Apple, Google, TikTok, Uber.
National public-sector aggregators: Bundesagentur (DE),
Arbetsformedlingen (SE), Eures (EU/EEA-wide).
Hybrid jobboards: WelcomeToTheJungle.
Browser-required (run via Browserbase
remote sessions): Meta, Tesla. Set JOBHIVE_USE_BROWSERBASE=1
together with BROWSERBASE_API_KEY and BROWSERBASE_PROJECT_ID to
enable; without those env vars the scrapers log a warning and skip.
Tesla also needs a Browserbase project that bypasses Akamai (default
sessions are currently 403'd).
jobhive search "platform engineer" --location Paris --limit 20
jobhive scrape ashby openai
jobhive list-atsThe goal is the largest open-source live job dataset on the internet. That's a forever project, and there's a clear path to make it bigger:
- Add a new ATS scraper — every ATS we don't cover yet is a few
thousand companies missing from the dataset. The scraper API is
intentionally tiny: subclass
BaseScraper, setats, implementfetch(). See any file undersrc/jobhive/scrapers/for a 50-line reference, and theJobmodel insrc/jobhive/models.pyfor the schema you populate. - Improve coverage on an existing ATS — many scrapers extract description / salary / employment-type only when the ATS surfaces them. If you find a tenant where a field is structurally available but we're missing it, a one-line PR is welcome.
- Add new tenants — every supported ATS has a CSV under
ats-companies/. New rows = new companies in the dataset. One-line PRs are welcome. - Report broken scrapers — open an issue with the slug and the failure mode. ATS APIs drift; flagging a regression early keeps the dataset accurate for everyone.
git clone https://github.com/stapply-ai/ats-scrapers
cd ats-scrapers
uv pip install -e ".[dev,scrapers]"
pytest
ruff check .PRs welcome on main. CI is green for all 6 of {3.11, 3.12, 3.13} ×
{ubuntu, macos}; please keep it that way.
MIT.
Built with Reverse API Engineer.
