Skip to content

jigangz/smart-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Smart Scraper

A production-ready web scraping platform with a beautiful dark-themed dashboard UI, built with FastAPI and Next.js.

Features

  • Scrapling-Powered Scraping Engine β€” Three scraping modes powered by Scrapling:
    • Fast Mode β€” Pure HTTP fetching via Fetcher (fastest, for static sites)
    • Dynamic Mode β€” Playwright-based via DynamicFetcher (for JS-rendered pages)
    • Stealth Mode β€” Max anti-detection via StealthyFetcher (bypasses Cloudflare, WAFs, and bot protection)
    • Auto-fallback: fast β†’ dynamic β†’ stealth if no results found
  • Advanced Anti-Detection β€” Scrapling's built-in TLS fingerprinting, CDP leak fix, WebRTC leak fix, canvas noise injection, headless bypass, timezone matching, plus 50+ rotating User-Agents, header randomization, proxy rotation, CAPTCHA detection, exponential backoff
  • Beautiful Dashboard β€” Dark-themed UI built with Next.js, shadcn/ui, Recharts, and Framer Motion
  • Job Scheduling β€” Run scraping jobs on demand or schedule them with cron expressions
  • Real-time Updates β€” WebSocket support for live job progress tracking
  • Data Export β€” Export scraped data as CSV or JSON
  • Fully Dockerized β€” One command to start everything

Screenshots

The frontend works standalone with realistic mock data β€” just run npm run dev to see the full UI.

Dashboard Jobs
Stats, activity charts, recent jobs Create, manage, and monitor scraping jobs
Results Settings
Search, filter, and export scraped data Configure proxies, anti-detection, and exports

Tech Stack

Backend

  • FastAPI β€” Async Python web framework
  • SQLAlchemy + aiosqlite β€” Async SQLite database
  • Scrapling β€” Advanced anti-bot scraping framework (3 fetcher modes)
  • httpx β€” Async HTTP client
  • Patchright + Playwright β€” Anti-detection headless browsers for JS-heavy pages
  • APScheduler β€” Job scheduling
  • BeautifulSoup4 β€” HTML parsing

Frontend

  • Next.js 14 β€” React framework with App Router
  • shadcn/ui β€” Radix UI + Tailwind CSS component library
  • Recharts β€” Charting library
  • Framer Motion β€” Animations
  • Lucide Icons β€” Icon set
  • TypeScript β€” Type safety

Quick Start

With Docker (Recommended)

docker-compose up --build

Manual Setup

Backend

cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install chromium
python -m patchright install chromium
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Frontend

cd frontend
npm install
npm run dev

Open http://localhost:3000 in your browser.

API Endpoints

Method Endpoint Description
POST /api/jobs Create a scraping job
GET /api/jobs List all jobs
GET /api/jobs/{id} Get job details + results
POST /api/jobs/{id}/run Run a job immediately
DELETE /api/jobs/{id} Delete a job
GET /api/results/{job_id} Get scraped results
GET /api/results/{job_id}/export?format=csv Export as CSV
GET /api/results/{job_id}/export?format=json Export as JSON
GET /api/stats Dashboard statistics
WS /ws/jobs/{id} Real-time job progress

Scraping Modes

Mode Fetcher Best For Speed
Fast Fetcher Static HTML pages, APIs Fastest
Dynamic DynamicFetcher JS-rendered SPAs, infinite scroll Medium
Stealth StealthyFetcher Cloudflare, DataDome, bot-protected sites Slowest

The engine automatically falls back through modes (fast β†’ dynamic β†’ stealth) if no results are found.

Anti-Detection Features

Scrapling built-in:

  • TLS fingerprint mimicry
  • CDP (Chrome DevTools Protocol) leak fix
  • WebRTC leak prevention
  • Canvas fingerprint noise injection
  • Headless browser detection bypass
  • Timezone and locale matching
  • Adaptive element tracking (resilient to site layout changes)

Additional layers:

  • 50+ real browser User-Agent strings
  • Randomized request headers (Accept-Language, Accept-Encoding, Sec-Fetch-*)
  • Configurable request delays (2-5s default)
  • HTTP/SOCKS5 proxy support with rotation
  • CAPTCHA detection (reCAPTCHA, hCaptcha, Cloudflare)
  • Referer chain simulation
  • Exponential backoff with configurable retries
  • Cookie and session management

Project Structure

smart-scraper/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ main.py              # FastAPI app entry
β”‚   β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”‚   β”œβ”€β”€ routes.py        # All API endpoints
β”‚   β”‚   β”‚   └── schemas.py       # Pydantic models
β”‚   β”‚   β”œβ”€β”€ scraper/
β”‚   β”‚   β”‚   β”œβ”€β”€ engine.py        # Core scraping engine
β”‚   β”‚   β”‚   β”œβ”€β”€ anti_detect.py   # Anti-detection system
β”‚   β”‚   β”‚   β”œβ”€β”€ parsers.py       # HTML parsing
β”‚   β”‚   β”‚   └── scheduler.py     # Job scheduling
β”‚   β”‚   β”œβ”€β”€ db/
β”‚   β”‚   β”‚   β”œβ”€β”€ database.py      # SQLite setup
β”‚   β”‚   β”‚   └── models.py        # DB models
β”‚   β”‚   └── export/
β”‚   β”‚       └── exporter.py      # CSV/JSON export
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ app/                 # Next.js pages
β”‚   β”‚   β”œβ”€β”€ components/          # React components
β”‚   β”‚   └── lib/                 # Utilities + API client
β”‚   β”œβ”€β”€ package.json
β”‚   └── Dockerfile
β”œβ”€β”€ docker-compose.yml
└── README.md

Roadmap

See ROADMAP.md for planned features:

  • Phase 1 β€” JS Interactions (click, scroll, load more)
  • Phase 2 β€” Cookie/Session Injection (authenticated scraping)
  • Phase 3 β€” CAPTCHA Solving Service (2Captcha, CapSolver)
  • Phase 4 β€” LLM Auto-Selector (auto-generate CSS selectors)
  • Phase 5 β€” Site Template Library (pre-built configs for common sites)
  • Phase 6 β€” Webhook & Notifications

License

MIT

About

πŸ•·οΈ Intelligent web scraping platform with Scrapling anti-detection, FastAPI backend, and beautiful Next.js dashboard. Bypasses Cloudflare, fingerprinting, and bot detection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors