Data Diff Checker

A memory-optimized CSV diff tool with streaming processing for comparing large datasets between environments.

╔══════════════════════════════════════════════════════════════════════════════╗
║   ██████╗  █████╗ ████████╗ █████╗     ██████╗ ██╗███████╗███████╗           ║
║   ██╔══██╗██╔══██╗╚══██╔══╝██╔══██╗    ██╔══██╗██║██╔════╝██╔════╝           ║
║   ██║  ██║███████║   ██║   ███████║    ██║  ██║██║█████╗  █████╗             ║
║   ██║  ██║██╔══██║   ██║   ██╔══██║    ██║  ██║██║██╔══╝  ██╔══╝             ║
║   ██████╔╝██║  ██║   ██║   ██║  ██║    ██████╔╝██║██║     ██║                ║
║   ╚═════╝ ╚═╝  ╚═╝   ╚═╝   ╚═╝  ╚═╝    ╚═════╝ ╚═╝╚═╝     ╚═╝                ║
║            ██████╗██╗  ██╗███████╗ ██████╗██╗  ██╗███████╗██████╗            ║
║           ██╔════╝██║  ██║██╔════╝██╔════╝██║ ██╔╝██╔════╝██╔══██╗           ║
║           ██║     ███████║█████╗  ██║     █████╔╝ █████╗  ██████╔╝           ║
║           ╚██████╗██║  ██║███████╗╚██████╗██║  ██╗███████╗██║  ██║           ║
║            ╚═════╝╚═╝  ╚═╝╚══════╝ ╚═════╝╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝           ║
╚══════════════════════════════════════════════════════════════════════════════╝

Features

Memory Efficient: True streaming CSV processing - never loads entire files into memory
Fast Comparison: Hash-based row comparison with two-pass algorithm
Composite Keys: Support for multi-column primary keys
Smart Detection: Auto-detects CSV delimiters and escape styles
Parallel Processing: Concurrent fetches and diffs with configurable parallelism
Rich Output: Detailed JSON reports with example IDs and line numbers
Cross-Platform: Works on Linux, macOS, and Windows (with ANSI support)

Installation

From GitHub

# Clone the repository
git clone https://github.com/snowthen-o7/data-diff-checker.git
cd data-diff-checker

# Upgrade pip first (required for editable installs with pyproject.toml)
pip3 install --upgrade pip

# Install in editable mode (recommended for easy updates)
pip3 install -e .

# Or install normally
pip3 install .

Note: Editable installs (-e) require pip 21.3 or newer. If you see an error about setup.py not found, upgrade pip first with pip3 install --upgrade pip.

Running the tool

After installation, you can run the tool in several ways:

# Option 1: Direct command (if ~/.local/bin is in your PATH)
data-diff --help

# Option 2: Run as a Python module (always works)
python3 -m data_diff_checker --help

If data-diff gives "command not found", either:

Add ~/.local/bin to your PATH: export PATH="$HOME/.local/bin:$PATH" (add to ~/.zshrc or ~/.bashrc)
Or just use python3 -m data_diff_checker instead

Updating

If you installed in editable mode (pip3 install -e .):

cd data-diff-checker
git pull
# Changes are immediately available - no reinstall needed!

If you installed normally (pip3 install .):

cd data-diff-checker
git pull
pip3 install .  # Reinstall after pulling

Quick Start

Compare two local CSV files

python3 -m data_diff_checker --local-prod production.csv --local-dev development.csv

Use a composite primary key

python3 -m data_diff_checker --local-prod prod.csv --local-dev dev.csv --primary-key "sku,locale"

Batch process a folder of file pairs

python3 -m data_diff_checker --local-folder ./responses/

Quick test with row limit

python3 -m data_diff_checker --local-prod large_prod.csv --local-dev large_dev.csv --diff-rows 1000

Usage

Modes of Operation

1. Local File Mode

Compare two CSV files directly:

python3 -m data_diff_checker --local-prod baseline.csv --local-dev compare.csv

2. Folder Mode

Batch process file pairs matching the pattern prod_response_<N>_<hash>.txt and dev_response_<N>_<hash>.txt:

python3 -m data_diff_checker --local-folder ./test_responses/

3. URL Mode

Fetch and compare CSV responses from APIs:

python3 -m data_diff_checker --params-file test_cases.csv \
  --prod-url "https://api.prod.example.com/endpoint" \
  --dev-url "https://api.dev.example.com/endpoint"

The params file should be a CSV with a params column containing URL query strings:

params
"connection_info[shop_name]=store1&connection_info[api_key]=xxx"
"connection_info[shop_name]=store2&connection_info[api_key]=yyy"

Authentication and custom headers

Attach a JWT or arbitrary headers to every request with CLI flags:

# Bearer token on every fetch
python3 -m data_diff_checker --params-file test_cases.csv --jwt "$MY_JWT"

# Arbitrary headers (repeat -H for multiple)
python3 -m data_diff_checker --params-file test_cases.csv \
  -H "X-Api-Key=abc123" -H "X-Tenant=store1"

For per-row headers, add an optional headers column to the params CSV containing a JSON object. Per-row headers override CLI headers on key conflicts; empty or missing cells fall back to CLI headers.

params,headers
"connection_info[shop_name]=store1","{""Authorization"":""Bearer token-for-store1""}"
"connection_info[shop_name]=store2","{""X-Tenant"":""store2"",""Authorization"":""Bearer token-for-store2""}"
"connection_info[shop_name]=store3",

Precedence (lowest → highest): jwt from .data-diff.json → --jwt → --header / -H → headers column.

Command Line Options

Core Options:
  --primary-key, -k     Primary key column(s) for row matching (default: id)
  --timeout, -t         HTTP request timeout in seconds (default: 900)
  --max-examples, -m    Max example IDs in output (default: 10)
  --diff-rows, -r       Max rows to process per file (default: no limit)
  --source-limit, -l    Limit test cases from params file

Input Sources:
  --local-prod          Local production CSV file
  --local-dev           Local development CSV file
  --local-folder, -f    Folder with response file pairs
  --params-file, -p     CSV file with URL parameters (default: params.csv)

URL Mode:
  --prod-url            Base URL for production environment
  --dev-url             Base URL for development environment
  --jwt                 Bearer token sent as Authorization header on every fetch
  --header, -H          Extra header (KEY=VALUE). Repeatable. Overrides --jwt
                        when KEY is Authorization.

Concurrency:
  --max-concurrent-fetches, -F   Max parallel URL fetches (default: 250)
  --max-concurrent-diffs, -c     Max parallel diff operations (default: 10)

Output:
  --output-dir, -o      Directory for response files (default: responses)
  --summary-dir, -s     Directory for JSON summaries (default: summaries)

Other:
  --verbose, -v         Enable verbose output
  --help, -h            Show help message

Output Format

Local Mode Summary

{
  "mode": "local",
  "prod_file": "production.csv",
  "dev_file": "development.csv",
  "rows_added": 5,
  "rows_removed": 2,
  "rows_updated": 150,
  "rows_updated_excluded_only": 30,
  "detailed_key_update_counts": {
    "title": 45,
    "price": 30,
    "description": 75
  },
  "example_ids": {
    "SKU-001": {"prod_line_num": 42, "dev_line_num": 43},
    "SKU-002": {"prod_line_num": 108, "dev_line_num": 110}
  },
  "common_keys": ["id", "title", "price", "availability"],
  "prod_only_keys": ["legacy_field"],
  "dev_only_keys": ["new_field"],
  "prod_row_count": 10000,
  "dev_row_count": 10003,
  "runtime_seconds": 2.34
}

URL Mode Summary (per test case)

URL mode emits one object per test case inside the diffs_summary_*.json arrays. Additional fields on top of the local-mode shape:

{
  "test_case": 0,
  "diff_percentage": 0.15,
  "prod_row_count": 10000,
  "dev_row_count": 10003,
  "prod_status": 200,
  "dev_status": 200,
  "shop_name": "store1",
  "request_params": "connection_info[shop_name]=store1",
  "rows_added": 5,
  "rows_removed": 2,
  "rows_updated": 150,
  "rows_updated_excluded_only": 30,
  "detailed_key_update_counts": {
    "title": 45,
    "price": 30
  },
  "example_ids": {
    "SKU-001": {"prod_line_num": 42, "dev_line_num": 43}
  },
  "example_ids_added": ["SKU-900", "SKU-901"],
  "example_ids_removed": ["SKU-050"],
  "common_keys": ["id", "title", "price", "availability"],
  "prod_only_keys": [],
  "dev_only_keys": ["new_field"],
  "prod_in_stock_percentage": 87.4,
  "dev_in_stock_percentage": 87.1,
  "in_stock_percentage_difference": 0.3,
  "prod_fetch_duration_seconds": 1.42,
  "dev_fetch_duration_seconds": 1.61,
  "runtime_seconds": 3.58
}

Error rows keep test_case, prod_status, dev_status, fetch durations, and runtime_seconds, plus error (object with msg) and non_200: true. shop_name, request_params, example_ids*, and *_in_stock_percentage are only present when the underlying data provides them.

Key Metrics

Field	Description
`rows_added`	Rows present in dev but not in prod
`rows_removed`	Rows present in prod but not in dev
`rows_updated`	Rows with meaningful changes (excludes inventory/availability columns)
`rows_updated_excluded_only`	Rows where only excluded columns changed
`detailed_key_update_counts`	Per-column change counts
`example_ids`	Sample changed rows with line numbers for debugging
`prod_fetch_duration_seconds`	URL mode only — wall-clock time to fetch the prod response
`dev_fetch_duration_seconds`	URL mode only — wall-clock time to fetch the dev response

URL Mode On-Disk Layout

URL mode creates timestamped run folders:

responses/
└── 20241126_143052_params_pk_sku-locale/
    ├── run_metadata.json          # Run configuration
    ├── summary.json               # Diff results
    ├── prod_response_0_abc123.txt # Response files
    ├── dev_response_0_abc123.txt
    └── ...

summaries/
├── diffs_summary_<timestamp>.json         # All results
├── diffs_summary_updates_<timestamp>.json # Only differences  
└── diffs_summary_errors_<timestamp>.json  # Only errors

Memory Optimizations

Data Diff Checker is designed to handle large files efficiently:

Streaming I/O: Rows are processed one at a time, never loading entire files
Hash-based comparison: Stores MD5 hashes instead of full row data
Cached metadata: Headers and row counts are computed once and cached
Two-pass algorithm: Quick hash comparison first, detailed diff only for changes
Incremental GC: Garbage collection between operations

Python API

from data_diff_checker import StreamingCSVReader, EfficientDiffer

# Read CSV with automatic format detection
reader = StreamingCSVReader("data.csv")
headers = reader.read_headers()
for row in reader.iterate_rows():
    print(row)

# Compare two files
differ = EfficientDiffer(primary_keys=["id"])
result = differ.compute_diff("prod.csv", "dev.csv")
print(f"Rows updated: {result['rows_updated']}")

Configuration

Excluded Columns

By default, columns containing these patterns are excluded from "meaningful change" detection:

inventory
availability
_fdx

Changes to these columns are tracked separately in rows_updated_excluded_only.

Customizing via Code

from data_diff_checker import EfficientDiffer

differ = EfficientDiffer(
    primary_keys=["sku", "locale"],
    max_examples=20,
    max_rows=10000,  # Limit for testing
    excluded_patterns=["inventory", "stock", "qty"],
    case_sensitive=True,  # Set to False to ignore case differences
    trim_whitespace=True,  # Set to False to preserve leading/trailing spaces
)

Comparison Options

Option	Default	Description
`primary_keys`	Required	Column(s) that uniquely identify rows
`max_examples`	10	Maximum example IDs to collect per change type
`max_rows`	None	Limit rows processed per file (for testing large files)
`excluded_patterns`	`["inventory", "availability", "_fdx"]`	Column patterns excluded from meaningful changes
`case_sensitive`	`True`	Whether to treat "ABC" and "abc" as different
`trim_whitespace`	`True`	Whether to ignore leading/trailing whitespace

Local Configuration File

Create a .data-diff.json file in your project directory to set defaults (this file is gitignored):

{
  "prod_url": "https://api.prod.example.com/endpoint",
  "dev_url": "https://api.dev.example.com/endpoint",
  "dedup_keys": ["connection_info[store_hash]"],
  "jwt": "eyJhbGciOi..."
}

The jwt key, if set, is used as a default for --jwt (sent as Authorization: Bearer <token>). Keep it only in the gitignored .data-diff.json, never in the params CSV you commit.

With this config, you can run URL mode without specifying URLs every time:

python3 -m data_diff_checker --params-file test_cases.csv

The tool searches for .data-diff.json in the current directory and parent directories (up to your home directory).

Development

Setup

git clone https://github.com/snowthen-o7/data-diff-checker.git
cd data-diff-checker
pip3 install --upgrade pip
pip3 install -e ".[dev]"

Run Tests

pytest

Code Quality

# Format
black src tests

# Lint  
ruff check src tests

# Type check
mypy src

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

1.3.0 (2026)

Added prod_fetch_duration_seconds and dev_fetch_duration_seconds to URL-mode summaries (including non-200 / error rows) so per-environment fetch latency is visible

1.2.0 (2026)

Added --jwt flag for sending Authorization: Bearer <token> on URL-mode fetches
Added repeatable --header / -H KEY=VALUE flag for arbitrary request headers
Added optional headers column in the params CSV (JSON object) for per-row header overrides
Added jwt key support in .data-diff.json for storing the default token locally

1.1.0 (2025)

Added case_sensitive option for case-insensitive comparison
Added trim_whitespace option to control whitespace handling

1.0.0 (2025)

Initial public release
Modular package structure
Streaming CSV processing
Hash-based diff algorithm
Composite primary key support
Parallel processing for URL mode
Cross-platform terminal UI

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src/data_diff_checker		src/data_diff_checker
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_diff_checker_standalone.py		data_diff_checker_standalone.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Data Diff Checker

Features

Installation

From GitHub

Running the tool

Updating

Quick Start

Compare two local CSV files

Use a composite primary key

Batch process a folder of file pairs

Quick test with row limit

Usage

Modes of Operation

1. Local File Mode

2. Folder Mode

3. URL Mode

Authentication and custom headers

Command Line Options

Output Format

Local Mode Summary

URL Mode Summary (per test case)

Key Metrics

URL Mode On-Disk Layout

Memory Optimizations

Python API

Configuration

Excluded Columns

Customizing via Code

Comparison Options

Local Configuration File

Development

Setup

Run Tests

Code Quality

Contributing

License

Changelog

1.3.0 (2026)

1.2.0 (2026)

1.1.0 (2025)

1.0.0 (2025)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages