Skip to content

iamanishroy/oncrawl

Repository files navigation

Oncrawl

A standalone, lightweight web scraping API to extract markdown content and categorized links from any URL.

Inspired by deepcrawl.

Deployable anywhere: Cloud Run, Vercel Serverless, Cloudflare Workers, or bare Node.js.

Quick Start

# Install dependencies
npm install

# Set your API key
export ONCRAWL_API_KEY=your-secret-key

# Start the server
npm run dev

The server starts on http://localhost:3000.

API Endpoints

Health Check

curl http://localhost:3000/

Read Endpoint

Extract markdown content and metadata from a URL.

GET (returns markdown text):

curl -H "Authorization: Bearer your-api-key" \
  "http://localhost:3000/read?url=https://example.com"

POST (returns JSON with options):

curl -X POST -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","metadata":true}' \
  http://localhost:3000/read

POST body options:

Field Type Default Description
url string required URL to scrape
markdown boolean true Include markdown content
cleanedHtml boolean false Include cleaned HTML
rawHtml boolean false Include raw HTML
metadata boolean true Include page metadata

Links Endpoint

Extract and categorize links from a URL.

GET (returns internal links):

curl -H "Authorization: Bearer your-api-key" \
  "http://localhost:3000/links?url=https://example.com"

POST (returns JSON with options):

curl -X POST -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","includeExternal":true,"includeMedia":true}' \
  http://localhost:3000/links

POST body options:

Field Type Default Description
url string required URL to extract links from
includeExternal boolean false Include external links
includeMedia boolean false Include media (images, videos, documents)
metadata boolean true Include page metadata

Environment Variables

Variable Description Required
ONCRAWL_API_KEY API key for authentication Yes
PORT Server port (default: 3000) No

Deployment

Vercel (One-Click)

The fastest way to deploy is using the Deploy to Vercel button. It will automatically prompt you for the required ONCRAWL_API_KEY.

Deploy with Vercel

Manual Deployment

Vercel CLI

Oncrawl is built with Hono and is automatically detected as a Hono project by Vercel.

vercel

Docker

Cloud Run

gcloud run deploy oncrawl \
  --source . \
  --set-env-vars ONCRAWL_API_KEY=your-key

License

MIT

About

Standalone web scraping API - extract markdown and links from any URL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors