A single-file Python script that orchestrates a 5-step pipeline: snapshot discovery → deduplication → content fetching → diff generation → HTML report generation. All operations are sequential with rate limiting for external API calls.
Purpose: Query Wayback Machine for snapshot metadata Responsibilities:
- Build CDX API query with parameters
- Parse JSON response into snapshot records
- Handle API errors and empty results
- Functions:
fetch_cdx_data()
Purpose: Filter snapshots to unique content versions Responsibilities:
- Track seen content hashes (digests)
- Return only first occurrence of each unique version
- Functions:
deduplicate_by_digest()
Purpose: Retrieve archived HTML from Wayback Machine Responsibilities:
- Construct Wayback Machine URLs with
id_flag - Implement retry logic with exponential backoff
- Rate limit requests (configurable delay)
- Functions:
fetch_wayback_content()
Purpose: Convert HTML to clean, comparable text Responsibilities:
- Parse HTML with BeautifulSoup
- Remove non-content elements (scripts, styles, nav, etc.)
- Locate main content area using CSS selectors
- Preserve document structure with heading markers
- Clean excessive whitespace
- Functions:
extract_text_from_html()
Purpose: Compare versions and identify changes Responsibilities:
- Generate unified diff between text versions
- Calculate addition/deletion statistics
- Filter out trivial diffs
- Functions:
generate_diff()
Purpose: Produce final HTML output Responsibilities:
- Build HTML document with embedded CSS
- Render executive summary and statistics
- Create version timeline with links
- Format diff sections with syntax highlighting
- Functions:
generate_html_report()
[Target URL + Date Range]
│
▼
┌─────────────────────────┐
│ CDX API Query │ ──→ Snapshot metadata (timestamp, digest)
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Deduplication │ ──→ Unique snapshots only
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Content Fetching │ ──→ Raw HTML for each version
│ (rate limited) │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Text Extraction │ ──→ Clean text content
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Diff Generation │ ──→ Changes between versions
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ HTML Report Generation │ ──→ Final report file
└─────────────────────────┘
| Layer | Technology | Rationale |
|---|---|---|
| Language | Python 3.x | Rapid development, rich ecosystem |
| HTTP Client | requests | De facto standard, simple API |
| HTML Parser | BeautifulSoup4 | Robust, forgiving HTML parsing |
| Diff Engine | difflib (stdlib) | Unified diff format, no dependencies |
| Date Handling | datetime (stdlib) | Timestamp parsing and formatting |
| Output | HTML + CSS | Universal, no dependencies to view |
/WaybackDiff
├── PROJECT.md # Project overview and scope
├── REQUIREMENTS.md # Detailed requirements
├── ARCHITECTURE.md # This file
├── README.md # User documentation
├── requirements.txt # Python dependencies
├── wayback_tos_diff_report.py # Main script
├── changes_report.html # Generated output (gitignored)
├── snapshots/ # Saved HTML snapshots (gitignored)
└── Devlog/ # Development logs
└── .gitkeep
Context: Need a tool that's easy to run and share Decision: Keep everything in one Python file with no custom modules Consequences: Easy to distribute; harder to test individual components; may need refactoring if scope grows
Context: Users need to specify target URL and date range Decision: Configuration variables at top of script rather than CLI arguments Consequences: Easy to modify for power users; less flexible for ad-hoc usage; good candidate for future enhancement
Context: Report needs to be standalone and shareable Decision: Inline all styles in the HTML file Consequences: Single file output; no external dependencies; larger file size; harder to customize styling
Context: Wayback Machine API has usage limits Decision: Fixed 1.5s delay between requests Consequences: Respectful to API; slower execution; could be optimized with adaptive rate limiting
- No authentication or secrets management required
- User-Agent spoofing to avoid bot detection (common practice)
- No user input sanitization needed (hardcoded config)
- HTML output escapes user-controlled content to prevent XSS
| Error Type | Handling |
|---|---|
| Network timeout | Retry up to 3 times with 2s delay |
| HTTP errors | Log and skip that snapshot, continue processing |
| Invalid JSON | Log error, return empty result |
| No snapshots found | Exit gracefully with helpful message |
| Insufficient versions | Exit gracefully (need ≥2 for diff) |
| Content extraction failure | Skip version, continue with others |
- CLI Arguments: Accept URL, date range, output file via argparse
- Multiple URLs: Compare changes across related pages
- Export Formats: Add PDF, Markdown, JSON output options
- Caching: Store fetched content to avoid re-downloading
- Async Fetching: Use aiohttp for parallel requests (with rate limiting)
- Scheduled Runs: Add cron/scheduler support for monitoring
- Notifications: Email/webhook when changes detected