Skip to content

Latest commit

 

History

History
89 lines (72 loc) · 3.33 KB

File metadata and controls

89 lines (72 loc) · 3.33 KB

Project Overview

Project Name

WaybackDiff - Wayback Machine Terms of Service Change Report Generator

One-Line Summary

A Python CLI tool that fetches historical snapshots of web pages from the Internet Archive and generates HTML diff reports showing all content changes over time.

Problem Statement

Legal, compliance, and research professionals need to track changes to Terms of Service, Privacy Policies, and other legal documents over time. Manually comparing archived versions is tedious and error-prone. This tool automates the process of:

  • Finding all archived versions of a URL
  • Identifying when content actually changed (vs. cosmetic updates)
  • Generating a readable, navigable diff report

Proposed Solution

A Python script that:

  1. Queries the Wayback Machine CDX API for all snapshots of a target URL
  2. Deduplicates by content hash to find truly unique versions
  3. Fetches and extracts text content from each unique version
  4. Generates unified diffs between consecutive versions
  5. Produces a comprehensive HTML report with timeline, statistics, and color-coded diffs

Target Users

  • Legal professionals tracking contract/policy changes
  • Compliance teams monitoring vendor terms
  • Researchers studying policy evolution
  • Anyone needing historical documentation of web page changes

Success Criteria

  • Successfully retrieves snapshots from Wayback Machine for any valid URL
  • Accurately identifies content changes (ignores layout/styling changes)
  • Produces a readable, navigable HTML report
  • Handles rate limiting and network errors gracefully

Technical Context

Platform/Environment

CLI tool (Python script)

Primary Language/Framework

Python 3.x

Key Dependencies

  • requests - HTTP client for API calls and content fetching
  • beautifulsoup4 - HTML parsing and text extraction
  • Standard library: difflib, json, datetime, re

Constraints

  • Rate limited by Wayback Machine API (1.5s delay between requests)
  • Depends on pages being archived by Internet Archive
  • Text extraction quality varies based on page structure

Scope

In Scope (MVP)

  • Query Wayback Machine CDX API for snapshots
  • Deduplicate by content hash (digest)
  • Fetch archived page content
  • Extract text from HTML (strip nav/header/footer/scripts)
  • Generate unified diffs between versions
  • Produce styled HTML report with:
    • Executive summary with statistics
    • Version timeline with links to archived versions
    • Color-coded diff sections (green=added, red=removed)
    • Table of contents for navigation

Out of Scope (Future)

  • GUI interface
  • Multiple URL comparison
  • PDF/Markdown export formats
  • Database storage of results
  • Scheduled/automated monitoring
  • Email notifications for changes

Open Questions

  • Should the tool accept command-line arguments instead of hardcoded config?
  • Should it support comparing specific date ranges or versions?
  • Should it handle JavaScript-rendered pages (would need headless browser)?