An end-to-end data pipeline that collects, parses, and analyzes U.S. IPO performance from 2019–2025.
- Scrapes IPO listings (offer price, ticker, date) from stockanalysis.com
- Finds and downloads official S-1 filings from SEC EDGAR
- Parses structured fields from S-1 text: financials, underwriters, VC history
- Fetches post-IPO stock price performance at 30, 90, and 180 days
- Loads everything into a normalized SQLite database for analysis
| Source | What We Collected | Method |
|---|---|---|
| stockanalysis.com | IPO list, offer price, ticker | Web scraping |
| SEC EDGAR | S-1 filings: financials, underwriters, risk factors | Web scraping |
| yfinance | Post-IPO stock returns at 30/90/180 days | Python library |
Raw scrape: 2,655 IPOs
After removing SPACs: 1,723 IPOs (932 shell companies removed)
After price filter: 1,675 IPOs (48 with no IPO price removed)
After yfinance check: 1,198 IPOs (477 delisted or unavailable removed)
Final with full data: 702 IPOs (have 90-day return data)
SPACs (Special Purpose Acquisition Companies) were excluded because they are shell vehicles with no operating financials to analyze. 2020–2021 were peak SPAC years, accounting for most of the 932 removed.
5-table SQLite schema:
ipos (main table)
├── financials → revenue, net income, profitability
├── vc_history → founding year, funding rounds, total raised
├── price_performance → 30d, 90d, 180d stock returns
└── underwriters → banks that underwrote the IPO (many-to-one)
Row counts after full pipeline:
| Table | Rows |
|---|---|
| ipos | 1,191 |
| financials | 866 |
| underwriters | 2,648 |
| vc_history | 1,036 |
| price_performance | 702 |
| Year | IPOs | Avg 90-Day Return | Notes |
|---|---|---|---|
| 2019 | 124 | -0.3% | |
| 2020 | 159 | +45.8% 🚀 | Pandemic stimulus + low rates |
| 2021 | 319 | -23.2% | Peak SPAC era, rate hike fears begin |
| 2022 | 88 | -39.0% 📉 | Aggressive Fed rate hikes |
| 2023 | 106 | -28.3% | |
| 2024 | 167 | -22.5% | |
| 2025 | 228 | -4.9% | Partial year |
| Bank | Deals | Avg 90-Day Return |
|---|---|---|
| Goldman Sachs | 283 | +1.2% |
| Morgan Stanley | 280 | -1.4% |
| J.P. Morgan | 239 | +1.8% |
| Jefferies | 143 | +12.0% |
| Barclays | 131 | +7.1% |
Jefferies and Barclays outperformed the bulge-bracket banks on 90-day returns, which may reflect more selective deal sourcing at the mid-tier level.
ipo_tracker/
│
├── scrapers/
│ ├── master_list.py # Phase 1: scrape IPO list from stockanalysis.com
│ ├── edgar_scraper.py # Phase 2: find S-1 accession numbers on EDGAR
│ ├── fetch_s1_html.py # Phase 2: download and cache S-1 HTML files
│ ├── parse_s1.py # Phase 2: extract structured fields from S-1 text
│ ├── vc_history_from_s1.py # Phase 3: extract VC/funding history from S-1 text
│ └── fetch_prices.py # Phase 4: pull post-IPO stock price data
│
├── database/
│ ├── schema.sql # table definitions
│ ├── db.py # connection helper
│ └── load_data.py # loads cleaned CSVs into SQLite
│
├── notebooks/
│ ├── clean_master_list.ipynb # Phase 1: cleaning and validation
│ ├── data_quality.ipynb # Phase 5: QA and cross-table checks
│ └── ipo_analytics.ipynb # exploratory analysis and visualizations
│
├── data/
│ └── cleaned/ # cleaned CSVs (raw/ is gitignored due to size)
│ ├── ipo_master.csv
│ ├── ipo_master_validated.csv
│ ├── s1_accessions.csv
│ ├── s1_parsed.csv
│ ├── vc_history.csv
│ ├── price_performance_clean.csv
│ └── ipo_analysis_ready.csv ← final dataset
│
├── .env.example # template for API keys and config
├── requirements.txt
└── logs/ # runtime logs (gitignored)
git clone https://github.com/vedantt17/ipo_tracker.git
cd ipo_trackerpython -m venv .venv
source .venv/bin/activate # Mac/Linux
.venv\Scripts\activate # Windows
pip install -r requirements.txt
playwright install chromiumcp .env.example .env
# edit .env and add your Alpha Vantage keypython scrapers/master_list.py # Phase 1: build IPO universe
python scrapers/edgar_scraper.py # Phase 2: find S-1 accessions
python scrapers/fetch_s1_html.py # Phase 2: download S-1 files
python scrapers/parse_s1.py # Phase 2: parse structured fields
sqlite3 database/ipo_tracker.db < database/schema.sql # initialize DB
python database/load_data.py # Phase 2: load CSVs into SQLite
python scrapers/vc_history_from_s1.py # Phase 3: extract VC history
python scrapers/fetch_prices.py # Phase 4: pull price datajupyter notebook| Challenge | What Happened | Fix |
|---|---|---|
| Crunchbase blocked | Bot detection blocked all scraping attempts | Extracted VC data from S-1 text instead |
| SEC homepage cached | Fetcher saved SEC homepage instead of actual S-1 | Fixed link finder to only accept /Archives/ paths |
| CIK formatting bug | CIK numbers stored as floats broke EDGAR lookups | Converted with str(int(float(cik))).zfill(10) |
| UBS false positives | 3-letter string matched in 983 wrong filings | Removed UBS from underwriter parser, deleted bad rows |
| stockanalysis.com range | 404 errors for years before 2019 | Extended scrape forward to 2025 instead |
For academic research purposes only.