URL Metadata Extraction Service

A Docker-based service that extracts metadata (title, description, keywords) from URLs using headless Chrome and Playwright. The code is based on karakeep, simplified to only provide a REST Endpoint for metadata and content extraction.

Features:

Rich metadata extraction via metascraper (title, description, images, author, dates, favicon with automatic compression)
Site-specific plugins for YouTube, Amazon, X/Twitter, Spotify, Soundcloud
Readable content extraction via /content endpoint (Readability.js + DOMPurify)
SSRF protection with DNS caching and IP range validation
Bot detection evasion and GDPR consent handling
Rate limiting
Optional API key authentication
OpenAPI 3.1 documentation with Swagger UI
Production-ready with Traefik and Cloudflare Zero Trust tunnel support

Not implemented:

reuse persistent browser connection
page screenshot
archiving
save to PDF

Quick Start

# Start the service
docker compose up -d

# Extract metadata from a URL
curl -X POST http://localhost:3000/process \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# Extract content from a URL
curl -X POST http://localhost:3000/content \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# With API key authentication (if API_KEY is set)
curl -X POST http://localhost:3000/process \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-api-key" \
  -d '{"url": "https://example.com"}'

# View API documentation
open http://localhost:3000/docs

Architecture

┌─────────────────┐     ┌──────────────────┐
│   API Service   │────▶│  Chrome Browser  │
│   (Node.js)     │     │  (alpine-chrome) │
│   Port: 3000    │     │  Port: 9222      │
└─────────────────┘     └──────────────────┘

API Service: Hono web server that receives URL requests and returns extracted metadata
Chrome Browser: Headless Chrome instance accessed via Chrome DevTools Protocol (CDP)

Overview:

Browser connects via CDP to an external Chrome instance (not bundled Chromium)
Each page fetch uses an isolated browser context, closed after use
All URLs are validated against private IP ranges before fetching (SSRF protection)
Sub-requests during page load are also validated and blocked if targeting forbidden IPs
Metascraper plugins are ordered: site-specific (YouTube, Amazon, X) run before generic extractors

Environment Variables

Variable	Default	Description
`BROWSER_WEB_URL`	`http://chrome.localhost:9222`	Chrome CDP endpoint
`FETCH_TIMEOUT_MS`	`30000`	Page fetch timeout
`PORT`	`3000`	API server port
`DNS_RESOLVER_TIMEOUT_MS`	`3000`	DNS lookup timeout
`RATE_LIMIT_REQUESTS`	`5`	Max requests per window
`RATE_LIMIT_WINDOW_MS`	`60000`	Rate limit window
`CONSENT_COOKIES_PATH`	`config/consent-cookies.json`	Cookie bypass config
`API_KEY`	(none)	Optional API key; if set, requires `Authorization: Bearer <key>` header
`DOCS_USERNAME`	(none)	Optional basic auth username for `/doc` and `/docs` endpoints
`DOCS_PASSWORD`	(none)	Optional basic auth password for `/doc` and `/docs` endpoints
`FAVICON_SIZE`	`32`	Target max dimension in pixels for favicon compression
`FAVICON_MAX_SIZE_BYTES`	`3072`	Max favicon size in bytes (3KB); larger falls back to URL
`FAVICON_OUTPUT_FORMAT`	`png`	Output format for favicon (`png` or `webp`)
`FAVICON_FETCH_TIMEOUT_MS`	`5000`	Timeout for fetching favicon URLs
`CLOUDFLARE_TUNNEL_TOKEN`	(none)	Cloudflare Tunnel token for production deployment
`API_HOST`	(none)	Public hostname for Traefik routing (e.g., `metadata.yourdomain.com`)

Production Deployment with Cloudflare Zero Trust

The service includes Traefik reverse proxy and Cloudflare Tunnel for secure production deployment. Local port access (:3000) remains available for development.

Architecture

Local dev:   localhost:3000 → API Container → Chrome Container

Production:  Internet → Cloudflare Tunnel → Traefik → API Container → Chrome Container

Setup Instructions

Go to Cloudflare Zero Trust dashboard
Navigate to Networks → Tunnels → Create a tunnel
Select Cloudflared connector type
Name your tunnel (e.g., metadata-extractor)
Copy the tunnel token and add it to your .env file:
```
CLOUDFLARE_TUNNEL_TOKEN=your-token-here
```
In the tunnel configuration, add a Public Hostname:
- Subdomain: your choice (e.g., metadata)
- Domain: select your Cloudflare domain
- Service Type: HTTP
- URL: traefik:80
Update API_HOST in .env to match your hostname:
```
API_HOST=metadata.yourdomain.com
```
Start the service:
```
docker compose up -d
```

Verification

# Check all containers are running
docker compose ps

# Test local access
curl http://localhost:3000/health

# Test tunnel access
curl https://metadata.yourdomain.com/health

Privacy

No Data Persistence

The service is stateless — it doesn't store fetched content, extracted metadata, or user requests
Each page fetch creates an isolated browser context that is closed immediately after use
No databases, caches, or logs retain user-submitted URLs or extracted data

Security Measures That Support Privacy

SSRF Protection — Validates all URLs against private/internal IP ranges before fetching. Sub-requests during page load are also validated and blocked if targeting forbidden IPs.
Rate Limiting — Sliding window rate limiting per IP prevents abuse (default: 5 requests per 60-second window).
Optional API Key Authentication — When API_KEY is set, requires Authorization: Bearer <key> header to prevent unauthorized access.
Content Sanitization — Uses DOMPurify to sanitize extracted content, removing potentially malicious scripts.

What the Service Does Access

Fetches the provided URL using headless Chrome
May set consent cookies to bypass cookie dialogs (configured via consent-cookies.json)
Uses an adblocker to reduce tracking during page fetches

Recommendations for Operators

Deploy behind a reverse proxy with TLS
Set API_KEY to restrict access
Consider network isolation for the Chrome container

TODOs

Prometheus metrics - Request counts, latencies, error rates
Structured logging - Replace console.log with structured JSON logging for easier debugging and monitoring.
Tests - Unit tests for extractor, integration tests for API
bruno - add local api client https://www.usebruno.com/

Known Issues

YouTube descriptions are generic - Returns "Enjoy the videos and music you love..." instead of actual video description. YouTube loads this dynamically via JS; may need longer wait or different extraction strategy.
Some sites still detect bot - Stealth plugin helps but isn't perfect. Sites like LinkedIn, Instagram may still block.
Memory usage unknown - No profiling done. Long-running service with adblocker may accumulate memory.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
config		config
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENCE		LICENCE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
playwright-dns.md		playwright-dns.md
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

URL Metadata Extraction Service

Quick Start

Architecture

Environment Variables

Production Deployment with Cloudflare Zero Trust

Architecture

Setup Instructions

Verification

Privacy

No Data Persistence

Security Measures That Support Privacy

What the Service Does Access

Recommendations for Operators

TODOs

Known Issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

URL Metadata Extraction Service

Quick Start

Architecture

Environment Variables

Production Deployment with Cloudflare Zero Trust

Architecture

Setup Instructions

Verification

Privacy

No Data Persistence

Security Measures That Support Privacy

What the Service Does Access

Recommendations for Operators

TODOs

Known Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages