Skip to content

Latest commit

 

History

History
183 lines (152 loc) · 6.99 KB

File metadata and controls

183 lines (152 loc) · 6.99 KB

Architecture

System Overview

A single-file Python script that orchestrates a 5-step pipeline: snapshot discovery → deduplication → content fetching → diff generation → HTML report generation. All operations are sequential with rate limiting for external API calls.

Components

Component 1: CDX API Client

Purpose: Query Wayback Machine for snapshot metadata Responsibilities:

  • Build CDX API query with parameters
  • Parse JSON response into snapshot records
  • Handle API errors and empty results
  • Functions: fetch_cdx_data()

Component 2: Deduplication Engine

Purpose: Filter snapshots to unique content versions Responsibilities:

  • Track seen content hashes (digests)
  • Return only first occurrence of each unique version
  • Functions: deduplicate_by_digest()

Component 3: Content Fetcher

Purpose: Retrieve archived HTML from Wayback Machine Responsibilities:

  • Construct Wayback Machine URLs with id_ flag
  • Implement retry logic with exponential backoff
  • Rate limit requests (configurable delay)
  • Functions: fetch_wayback_content()

Component 4: Text Extractor

Purpose: Convert HTML to clean, comparable text Responsibilities:

  • Parse HTML with BeautifulSoup
  • Remove non-content elements (scripts, styles, nav, etc.)
  • Locate main content area using CSS selectors
  • Preserve document structure with heading markers
  • Clean excessive whitespace
  • Functions: extract_text_from_html()

Component 5: Diff Generator

Purpose: Compare versions and identify changes Responsibilities:

  • Generate unified diff between text versions
  • Calculate addition/deletion statistics
  • Filter out trivial diffs
  • Functions: generate_diff()

Component 6: Report Generator

Purpose: Produce final HTML output Responsibilities:

  • Build HTML document with embedded CSS
  • Render executive summary and statistics
  • Create version timeline with links
  • Format diff sections with syntax highlighting
  • Functions: generate_html_report()

Data Flow

[Target URL + Date Range]
         │
         ▼
┌─────────────────────────┐
│  CDX API Query          │ ──→ Snapshot metadata (timestamp, digest)
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Deduplication          │ ──→ Unique snapshots only
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Content Fetching       │ ──→ Raw HTML for each version
│  (rate limited)         │
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Text Extraction        │ ──→ Clean text content
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Diff Generation        │ ──→ Changes between versions
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  HTML Report Generation │ ──→ Final report file
└─────────────────────────┘

Technology Stack

Layer Technology Rationale
Language Python 3.x Rapid development, rich ecosystem
HTTP Client requests De facto standard, simple API
HTML Parser BeautifulSoup4 Robust, forgiving HTML parsing
Diff Engine difflib (stdlib) Unified diff format, no dependencies
Date Handling datetime (stdlib) Timestamp parsing and formatting
Output HTML + CSS Universal, no dependencies to view

Directory Structure

/WaybackDiff
├── PROJECT.md                           # Project overview and scope
├── REQUIREMENTS.md                      # Detailed requirements
├── ARCHITECTURE.md                      # This file
├── README.md                            # User documentation
├── requirements.txt                     # Python dependencies
├── wayback_tos_diff_report.py          # Main script
├── changes_report.html                  # Generated output (gitignored)
├── snapshots/                           # Saved HTML snapshots (gitignored)
└── Devlog/                             # Development logs
    └── .gitkeep

Key Design Decisions

Decision 1: Single-file script

Context: Need a tool that's easy to run and share Decision: Keep everything in one Python file with no custom modules Consequences: Easy to distribute; harder to test individual components; may need refactoring if scope grows

Decision 2: Hardcoded configuration

Context: Users need to specify target URL and date range Decision: Configuration variables at top of script rather than CLI arguments Consequences: Easy to modify for power users; less flexible for ad-hoc usage; good candidate for future enhancement

Decision 3: Embedded CSS in HTML output

Context: Report needs to be standalone and shareable Decision: Inline all styles in the HTML file Consequences: Single file output; no external dependencies; larger file size; harder to customize styling

Decision 4: Rate limiting via sleep

Context: Wayback Machine API has usage limits Decision: Fixed 1.5s delay between requests Consequences: Respectful to API; slower execution; could be optimized with adaptive rate limiting


Security Considerations

  • No authentication or secrets management required
  • User-Agent spoofing to avoid bot detection (common practice)
  • No user input sanitization needed (hardcoded config)
  • HTML output escapes user-controlled content to prevent XSS

Error Handling Strategy

Error Type Handling
Network timeout Retry up to 3 times with 2s delay
HTTP errors Log and skip that snapshot, continue processing
Invalid JSON Log error, return empty result
No snapshots found Exit gracefully with helpful message
Insufficient versions Exit gracefully (need ≥2 for diff)
Content extraction failure Skip version, continue with others

Future Enhancement Opportunities

  1. CLI Arguments: Accept URL, date range, output file via argparse
  2. Multiple URLs: Compare changes across related pages
  3. Export Formats: Add PDF, Markdown, JSON output options
  4. Caching: Store fetched content to avoid re-downloading
  5. Async Fetching: Use aiohttp for parallel requests (with rate limiting)
  6. Scheduled Runs: Add cron/scheduler support for monitoring
  7. Notifications: Email/webhook when changes detected