Architecture

System Overview

A single-file Python script that orchestrates a 5-step pipeline: snapshot discovery → deduplication → content fetching → diff generation → HTML report generation. All operations are sequential with rate limiting for external API calls.

Components

Component 1: CDX API Client

Purpose: Query Wayback Machine for snapshot metadata Responsibilities:

Build CDX API query with parameters
Parse JSON response into snapshot records
Handle API errors and empty results
Functions: fetch_cdx_data()

Component 2: Deduplication Engine

Purpose: Filter snapshots to unique content versions Responsibilities:

Track seen content hashes (digests)
Return only first occurrence of each unique version
Functions: deduplicate_by_digest()

Component 3: Content Fetcher

Purpose: Retrieve archived HTML from Wayback Machine Responsibilities:

Construct Wayback Machine URLs with id_ flag
Implement retry logic with exponential backoff
Rate limit requests (configurable delay)
Functions: fetch_wayback_content()

Component 4: Text Extractor

Purpose: Convert HTML to clean, comparable text Responsibilities:

Parse HTML with BeautifulSoup
Remove non-content elements (scripts, styles, nav, etc.)
Locate main content area using CSS selectors
Preserve document structure with heading markers
Clean excessive whitespace
Functions: extract_text_from_html()

Component 5: Diff Generator

Purpose: Compare versions and identify changes Responsibilities:

Generate unified diff between text versions
Calculate addition/deletion statistics
Filter out trivial diffs
Functions: generate_diff()

Component 6: Report Generator

Purpose: Produce final HTML output Responsibilities:

Build HTML document with embedded CSS
Render executive summary and statistics
Create version timeline with links
Format diff sections with syntax highlighting
Functions: generate_html_report()

Data Flow

[Target URL + Date Range]
         │
         ▼
┌─────────────────────────┐
│  CDX API Query          │ ──→ Snapshot metadata (timestamp, digest)
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Deduplication          │ ──→ Unique snapshots only
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Content Fetching       │ ──→ Raw HTML for each version
│  (rate limited)         │
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Text Extraction        │ ──→ Clean text content
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Diff Generation        │ ──→ Changes between versions
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  HTML Report Generation │ ──→ Final report file
└─────────────────────────┘

Technology Stack

Layer	Technology	Rationale
Language	Python 3.x	Rapid development, rich ecosystem
HTTP Client	requests	De facto standard, simple API
HTML Parser	BeautifulSoup4	Robust, forgiving HTML parsing
Diff Engine	difflib (stdlib)	Unified diff format, no dependencies
Date Handling	datetime (stdlib)	Timestamp parsing and formatting
Output	HTML + CSS	Universal, no dependencies to view

Directory Structure

/WaybackDiff
├── PROJECT.md                           # Project overview and scope
├── REQUIREMENTS.md                      # Detailed requirements
├── ARCHITECTURE.md                      # This file
├── README.md                            # User documentation
├── requirements.txt                     # Python dependencies
├── wayback_tos_diff_report.py          # Main script
├── changes_report.html                  # Generated output (gitignored)
├── snapshots/                           # Saved HTML snapshots (gitignored)
└── Devlog/                             # Development logs
    └── .gitkeep

Key Design Decisions

Decision 1: Single-file script

Context: Need a tool that's easy to run and share Decision: Keep everything in one Python file with no custom modules Consequences: Easy to distribute; harder to test individual components; may need refactoring if scope grows

Decision 2: Hardcoded configuration

Context: Users need to specify target URL and date range Decision: Configuration variables at top of script rather than CLI arguments Consequences: Easy to modify for power users; less flexible for ad-hoc usage; good candidate for future enhancement

Decision 3: Embedded CSS in HTML output

Context: Report needs to be standalone and shareable Decision: Inline all styles in the HTML file Consequences: Single file output; no external dependencies; larger file size; harder to customize styling

Decision 4: Rate limiting via sleep

Context: Wayback Machine API has usage limits Decision: Fixed 1.5s delay between requests Consequences: Respectful to API; slower execution; could be optimized with adaptive rate limiting

Security Considerations

No authentication or secrets management required
User-Agent spoofing to avoid bot detection (common practice)
No user input sanitization needed (hardcoded config)
HTML output escapes user-controlled content to prevent XSS

Error Handling Strategy

Error Type	Handling
Network timeout	Retry up to 3 times with 2s delay
HTTP errors	Log and skip that snapshot, continue processing
Invalid JSON	Log error, return empty result
No snapshots found	Exit gracefully with helpful message
Insufficient versions	Exit gracefully (need ≥2 for diff)
Content extraction failure	Skip version, continue with others

Future Enhancement Opportunities

CLI Arguments: Accept URL, date range, output file via argparse
Multiple URLs: Compare changes across related pages
Export Formats: Add PDF, Markdown, JSON output options
Caching: Store fetched content to avoid re-downloading
Async Fetching: Use aiohttp for parallel requests (with rate limiting)
Scheduled Runs: Add cron/scheduler support for monitoring
Notifications: Email/webhook when changes detected

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

System Overview

Components

Component 1: CDX API Client

Component 2: Deduplication Engine

Component 3: Content Fetcher

Component 4: Text Extractor

Component 5: Diff Generator

Component 6: Report Generator

Data Flow

Technology Stack

Directory Structure

Key Design Decisions

Decision 1: Single-file script

Decision 2: Hardcoded configuration

Decision 3: Embedded CSS in HTML output

Decision 4: Rate limiting via sleep

Security Considerations

Error Handling Strategy

Future Enhancement Opportunities

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture

System Overview

Components

Component 1: CDX API Client

Component 2: Deduplication Engine

Component 3: Content Fetcher

Component 4: Text Extractor

Component 5: Diff Generator

Component 6: Report Generator

Data Flow

Technology Stack

Directory Structure

Key Design Decisions

Decision 1: Single-file script

Decision 2: Hardcoded configuration

Decision 3: Embedded CSS in HTML output

Decision 4: Rate limiting via sleep

Security Considerations

Error Handling Strategy

Future Enhancement Opportunities