A memory-optimized CSV diff tool with streaming processing for comparing large datasets between environments.
╔══════════════════════════════════════════════════════════════════════════════╗
║ ██████╗ █████╗ ████████╗ █████╗ ██████╗ ██╗███████╗███████╗ ║
║ ██╔══██╗██╔══██╗╚══██╔══╝██╔══██╗ ██╔══██╗██║██╔════╝██╔════╝ ║
║ ██║ ██║███████║ ██║ ███████║ ██║ ██║██║█████╗ █████╗ ║
║ ██║ ██║██╔══██║ ██║ ██╔══██║ ██║ ██║██║██╔══╝ ██╔══╝ ║
║ ██████╔╝██║ ██║ ██║ ██║ ██║ ██████╔╝██║██║ ██║ ║
║ ╚═════╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═╝╚═╝ ╚═╝ ║
║ ██████╗██╗ ██╗███████╗ ██████╗██╗ ██╗███████╗██████╗ ║
║ ██╔════╝██║ ██║██╔════╝██╔════╝██║ ██╔╝██╔════╝██╔══██╗ ║
║ ██║ ███████║█████╗ ██║ █████╔╝ █████╗ ██████╔╝ ║
║ ╚██████╗██║ ██║███████╗╚██████╗██║ ██╗███████╗██║ ██║ ║
║ ╚═════╝╚═╝ ╚═╝╚══════╝ ╚═════╝╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝ ║
╚══════════════════════════════════════════════════════════════════════════════╝
- Memory Efficient: True streaming CSV processing - never loads entire files into memory
- Fast Comparison: Hash-based row comparison with two-pass algorithm
- Composite Keys: Support for multi-column primary keys
- Smart Detection: Auto-detects CSV delimiters and escape styles
- Parallel Processing: Concurrent fetches and diffs with configurable parallelism
- Rich Output: Detailed JSON reports with example IDs and line numbers
- Cross-Platform: Works on Linux, macOS, and Windows (with ANSI support)
# Clone the repository
git clone https://github.com/snowthen-o7/data-diff-checker.git
cd data-diff-checker
# Upgrade pip first (required for editable installs with pyproject.toml)
pip3 install --upgrade pip
# Install in editable mode (recommended for easy updates)
pip3 install -e .
# Or install normally
pip3 install .Note: Editable installs (
-e) require pip 21.3 or newer. If you see an error aboutsetup.pynot found, upgrade pip first withpip3 install --upgrade pip.
After installation, you can run the tool in several ways:
# Option 1: Direct command (if ~/.local/bin is in your PATH)
data-diff --help
# Option 2: Run as a Python module (always works)
python3 -m data_diff_checker --helpIf data-diff gives "command not found", either:
- Add
~/.local/binto your PATH:export PATH="$HOME/.local/bin:$PATH"(add to~/.zshrcor~/.bashrc) - Or just use
python3 -m data_diff_checkerinstead
If you installed in editable mode (pip3 install -e .):
cd data-diff-checker
git pull
# Changes are immediately available - no reinstall needed!If you installed normally (pip3 install .):
cd data-diff-checker
git pull
pip3 install . # Reinstall after pullingpython3 -m data_diff_checker --local-prod production.csv --local-dev development.csvpython3 -m data_diff_checker --local-prod prod.csv --local-dev dev.csv --primary-key "sku,locale"python3 -m data_diff_checker --local-folder ./responses/python3 -m data_diff_checker --local-prod large_prod.csv --local-dev large_dev.csv --diff-rows 1000Compare two CSV files directly:
python3 -m data_diff_checker --local-prod baseline.csv --local-dev compare.csvBatch process file pairs matching the pattern prod_response_<N>_<hash>.txt and dev_response_<N>_<hash>.txt:
python3 -m data_diff_checker --local-folder ./test_responses/Fetch and compare CSV responses from APIs:
python3 -m data_diff_checker --params-file test_cases.csv \
--prod-url "https://api.prod.example.com/endpoint" \
--dev-url "https://api.dev.example.com/endpoint"The params file should be a CSV with a params column containing URL query strings:
params
"connection_info[shop_name]=store1&connection_info[api_key]=xxx"
"connection_info[shop_name]=store2&connection_info[api_key]=yyy"Attach a JWT or arbitrary headers to every request with CLI flags:
# Bearer token on every fetch
python3 -m data_diff_checker --params-file test_cases.csv --jwt "$MY_JWT"
# Arbitrary headers (repeat -H for multiple)
python3 -m data_diff_checker --params-file test_cases.csv \
-H "X-Api-Key=abc123" -H "X-Tenant=store1"For per-row headers, add an optional headers column to the params CSV containing a JSON object. Per-row headers override CLI headers on key conflicts; empty or missing cells fall back to CLI headers.
params,headers
"connection_info[shop_name]=store1","{""Authorization"":""Bearer token-for-store1""}"
"connection_info[shop_name]=store2","{""X-Tenant"":""store2"",""Authorization"":""Bearer token-for-store2""}"
"connection_info[shop_name]=store3",Precedence (lowest → highest): jwt from .data-diff.json → --jwt → --header / -H → headers column.
Core Options:
--primary-key, -k Primary key column(s) for row matching (default: id)
--timeout, -t HTTP request timeout in seconds (default: 900)
--max-examples, -m Max example IDs in output (default: 10)
--diff-rows, -r Max rows to process per file (default: no limit)
--source-limit, -l Limit test cases from params file
Input Sources:
--local-prod Local production CSV file
--local-dev Local development CSV file
--local-folder, -f Folder with response file pairs
--params-file, -p CSV file with URL parameters (default: params.csv)
URL Mode:
--prod-url Base URL for production environment
--dev-url Base URL for development environment
--jwt Bearer token sent as Authorization header on every fetch
--header, -H Extra header (KEY=VALUE). Repeatable. Overrides --jwt
when KEY is Authorization.
Concurrency:
--max-concurrent-fetches, -F Max parallel URL fetches (default: 250)
--max-concurrent-diffs, -c Max parallel diff operations (default: 10)
Output:
--output-dir, -o Directory for response files (default: responses)
--summary-dir, -s Directory for JSON summaries (default: summaries)
Other:
--verbose, -v Enable verbose output
--help, -h Show help message
{
"mode": "local",
"prod_file": "production.csv",
"dev_file": "development.csv",
"rows_added": 5,
"rows_removed": 2,
"rows_updated": 150,
"rows_updated_excluded_only": 30,
"detailed_key_update_counts": {
"title": 45,
"price": 30,
"description": 75
},
"example_ids": {
"SKU-001": {"prod_line_num": 42, "dev_line_num": 43},
"SKU-002": {"prod_line_num": 108, "dev_line_num": 110}
},
"common_keys": ["id", "title", "price", "availability"],
"prod_only_keys": ["legacy_field"],
"dev_only_keys": ["new_field"],
"prod_row_count": 10000,
"dev_row_count": 10003,
"runtime_seconds": 2.34
}URL mode emits one object per test case inside the diffs_summary_*.json arrays. Additional fields on top of the local-mode shape:
{
"test_case": 0,
"diff_percentage": 0.15,
"prod_row_count": 10000,
"dev_row_count": 10003,
"prod_status": 200,
"dev_status": 200,
"shop_name": "store1",
"request_params": "connection_info[shop_name]=store1",
"rows_added": 5,
"rows_removed": 2,
"rows_updated": 150,
"rows_updated_excluded_only": 30,
"detailed_key_update_counts": {
"title": 45,
"price": 30
},
"example_ids": {
"SKU-001": {"prod_line_num": 42, "dev_line_num": 43}
},
"example_ids_added": ["SKU-900", "SKU-901"],
"example_ids_removed": ["SKU-050"],
"common_keys": ["id", "title", "price", "availability"],
"prod_only_keys": [],
"dev_only_keys": ["new_field"],
"prod_in_stock_percentage": 87.4,
"dev_in_stock_percentage": 87.1,
"in_stock_percentage_difference": 0.3,
"prod_fetch_duration_seconds": 1.42,
"dev_fetch_duration_seconds": 1.61,
"runtime_seconds": 3.58
}Error rows keep test_case, prod_status, dev_status, fetch durations, and runtime_seconds, plus error (object with msg) and non_200: true. shop_name, request_params, example_ids*, and *_in_stock_percentage are only present when the underlying data provides them.
| Field | Description |
|---|---|
rows_added |
Rows present in dev but not in prod |
rows_removed |
Rows present in prod but not in dev |
rows_updated |
Rows with meaningful changes (excludes inventory/availability columns) |
rows_updated_excluded_only |
Rows where only excluded columns changed |
detailed_key_update_counts |
Per-column change counts |
example_ids |
Sample changed rows with line numbers for debugging |
prod_fetch_duration_seconds |
URL mode only — wall-clock time to fetch the prod response |
dev_fetch_duration_seconds |
URL mode only — wall-clock time to fetch the dev response |
URL mode creates timestamped run folders:
responses/
└── 20241126_143052_params_pk_sku-locale/
├── run_metadata.json # Run configuration
├── summary.json # Diff results
├── prod_response_0_abc123.txt # Response files
├── dev_response_0_abc123.txt
└── ...
summaries/
├── diffs_summary_<timestamp>.json # All results
├── diffs_summary_updates_<timestamp>.json # Only differences
└── diffs_summary_errors_<timestamp>.json # Only errors
Data Diff Checker is designed to handle large files efficiently:
- Streaming I/O: Rows are processed one at a time, never loading entire files
- Hash-based comparison: Stores MD5 hashes instead of full row data
- Cached metadata: Headers and row counts are computed once and cached
- Two-pass algorithm: Quick hash comparison first, detailed diff only for changes
- Incremental GC: Garbage collection between operations
from data_diff_checker import StreamingCSVReader, EfficientDiffer
# Read CSV with automatic format detection
reader = StreamingCSVReader("data.csv")
headers = reader.read_headers()
for row in reader.iterate_rows():
print(row)
# Compare two files
differ = EfficientDiffer(primary_keys=["id"])
result = differ.compute_diff("prod.csv", "dev.csv")
print(f"Rows updated: {result['rows_updated']}")By default, columns containing these patterns are excluded from "meaningful change" detection:
inventoryavailability_fdx
Changes to these columns are tracked separately in rows_updated_excluded_only.
from data_diff_checker import EfficientDiffer
differ = EfficientDiffer(
primary_keys=["sku", "locale"],
max_examples=20,
max_rows=10000, # Limit for testing
excluded_patterns=["inventory", "stock", "qty"],
case_sensitive=True, # Set to False to ignore case differences
trim_whitespace=True, # Set to False to preserve leading/trailing spaces
)| Option | Default | Description |
|---|---|---|
primary_keys |
Required | Column(s) that uniquely identify rows |
max_examples |
10 | Maximum example IDs to collect per change type |
max_rows |
None | Limit rows processed per file (for testing large files) |
excluded_patterns |
["inventory", "availability", "_fdx"] |
Column patterns excluded from meaningful changes |
case_sensitive |
True |
Whether to treat "ABC" and "abc" as different |
trim_whitespace |
True |
Whether to ignore leading/trailing whitespace |
Create a .data-diff.json file in your project directory to set defaults (this file is gitignored):
{
"prod_url": "https://api.prod.example.com/endpoint",
"dev_url": "https://api.dev.example.com/endpoint",
"dedup_keys": ["connection_info[store_hash]"],
"jwt": "eyJhbGciOi..."
}The jwt key, if set, is used as a default for --jwt (sent as Authorization: Bearer <token>). Keep it only in the gitignored .data-diff.json, never in the params CSV you commit.
With this config, you can run URL mode without specifying URLs every time:
python3 -m data_diff_checker --params-file test_cases.csvThe tool searches for .data-diff.json in the current directory and parent directories (up to your home directory).
git clone https://github.com/snowthen-o7/data-diff-checker.git
cd data-diff-checker
pip3 install --upgrade pip
pip3 install -e ".[dev]"pytest# Format
black src tests
# Lint
ruff check src tests
# Type check
mypy srcContributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Added
prod_fetch_duration_secondsanddev_fetch_duration_secondsto URL-mode summaries (including non-200 / error rows) so per-environment fetch latency is visible
- Added
--jwtflag for sendingAuthorization: Bearer <token>on URL-mode fetches - Added repeatable
--header/-H KEY=VALUEflag for arbitrary request headers - Added optional
headerscolumn in the params CSV (JSON object) for per-row header overrides - Added
jwtkey support in.data-diff.jsonfor storing the default token locally
- Added
case_sensitiveoption for case-insensitive comparison - Added
trim_whitespaceoption to control whitespace handling
- Initial public release
- Modular package structure
- Streaming CSV processing
- Hash-based diff algorithm
- Composite primary key support
- Parallel processing for URL mode
- Cross-platform terminal UI