WaybackDiff - Wayback Machine Terms of Service Change Report Generator
A Python CLI tool that fetches historical snapshots of web pages from the Internet Archive and generates HTML diff reports showing all content changes over time.
Legal, compliance, and research professionals need to track changes to Terms of Service, Privacy Policies, and other legal documents over time. Manually comparing archived versions is tedious and error-prone. This tool automates the process of:
- Finding all archived versions of a URL
- Identifying when content actually changed (vs. cosmetic updates)
- Generating a readable, navigable diff report
A Python script that:
- Queries the Wayback Machine CDX API for all snapshots of a target URL
- Deduplicates by content hash to find truly unique versions
- Fetches and extracts text content from each unique version
- Generates unified diffs between consecutive versions
- Produces a comprehensive HTML report with timeline, statistics, and color-coded diffs
- Legal professionals tracking contract/policy changes
- Compliance teams monitoring vendor terms
- Researchers studying policy evolution
- Anyone needing historical documentation of web page changes
- Successfully retrieves snapshots from Wayback Machine for any valid URL
- Accurately identifies content changes (ignores layout/styling changes)
- Produces a readable, navigable HTML report
- Handles rate limiting and network errors gracefully
CLI tool (Python script)
Python 3.x
requests- HTTP client for API calls and content fetchingbeautifulsoup4- HTML parsing and text extraction- Standard library:
difflib,json,datetime,re
- Rate limited by Wayback Machine API (1.5s delay between requests)
- Depends on pages being archived by Internet Archive
- Text extraction quality varies based on page structure
- Query Wayback Machine CDX API for snapshots
- Deduplicate by content hash (digest)
- Fetch archived page content
- Extract text from HTML (strip nav/header/footer/scripts)
- Generate unified diffs between versions
- Produce styled HTML report with:
- Executive summary with statistics
- Version timeline with links to archived versions
- Color-coded diff sections (green=added, red=removed)
- Table of contents for navigation
- GUI interface
- Multiple URL comparison
- PDF/Markdown export formats
- Database storage of results
- Scheduled/automated monitoring
- Email notifications for changes
- Should the tool accept command-line arguments instead of hardcoded config?
- Should it support comparing specific date ranges or versions?
- Should it handle JavaScript-rendered pages (would need headless browser)?