Project Overview

Project Name

WaybackDiff - Wayback Machine Terms of Service Change Report Generator

One-Line Summary

A Python CLI tool that fetches historical snapshots of web pages from the Internet Archive and generates HTML diff reports showing all content changes over time.

Problem Statement

Legal, compliance, and research professionals need to track changes to Terms of Service, Privacy Policies, and other legal documents over time. Manually comparing archived versions is tedious and error-prone. This tool automates the process of:

Finding all archived versions of a URL
Identifying when content actually changed (vs. cosmetic updates)
Generating a readable, navigable diff report

Proposed Solution

A Python script that:

Queries the Wayback Machine CDX API for all snapshots of a target URL
Deduplicates by content hash to find truly unique versions
Fetches and extracts text content from each unique version
Generates unified diffs between consecutive versions
Produces a comprehensive HTML report with timeline, statistics, and color-coded diffs

Target Users

Legal professionals tracking contract/policy changes
Compliance teams monitoring vendor terms
Researchers studying policy evolution
Anyone needing historical documentation of web page changes

Success Criteria

Successfully retrieves snapshots from Wayback Machine for any valid URL
Accurately identifies content changes (ignores layout/styling changes)
Produces a readable, navigable HTML report
Handles rate limiting and network errors gracefully

Technical Context

Platform/Environment

CLI tool (Python script)

Primary Language/Framework

Python 3.x

Key Dependencies

requests - HTTP client for API calls and content fetching
beautifulsoup4 - HTML parsing and text extraction
Standard library: difflib, json, datetime, re

Constraints

Rate limited by Wayback Machine API (1.5s delay between requests)
Depends on pages being archived by Internet Archive
Text extraction quality varies based on page structure

Scope

In Scope (MVP)

Query Wayback Machine CDX API for snapshots
Deduplicate by content hash (digest)
Fetch archived page content
Extract text from HTML (strip nav/header/footer/scripts)
Generate unified diffs between versions
Produce styled HTML report with:
- Executive summary with statistics
- Version timeline with links to archived versions
- Color-coded diff sections (green=added, red=removed)
- Table of contents for navigation

Out of Scope (Future)

GUI interface
Multiple URL comparison
PDF/Markdown export formats
Database storage of results
Scheduled/automated monitoring
Email notifications for changes

Open Questions

Should the tool accept command-line arguments instead of hardcoded config?
Should it support comparing specific date ranges or versions?
Should it handle JavaScript-rendered pages (would need headless browser)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Overview

Project Name

One-Line Summary

Problem Statement

Proposed Solution

Target Users

Success Criteria

Technical Context

Platform/Environment

Primary Language/Framework

Key Dependencies

Constraints

Scope

In Scope (MVP)

Out of Scope (Future)

Open Questions

FilesExpand file tree

PROJECT.md

Latest commit

History

PROJECT.md

File metadata and controls

Project Overview

Project Name

One-Line Summary

Problem Statement

Proposed Solution

Target Users

Success Criteria

Technical Context

Platform/Environment

Primary Language/Framework

Key Dependencies

Constraints

Scope

In Scope (MVP)

Out of Scope (Future)

Open Questions