DatasetVision is a robust, production-ready CLI engineered to enforce strict data governance, detect dataset drift, and monitor data health for computer vision workflows. Catch data anomalies before your model decays.
Whether you're battling label noise, near-duplicates, or semantic shifts in production datasets, DatasetVision provides lightning-fast intelligence layers to validate your data pipelines deterministically.
- 🛡️ Anomaly Detection Layer
Detect anomalous and out-of-distribution images using deep cv2 feature embeddings and Z-score outlier analysis. - 📉 Data Drift Intelligence
Compare two datasets and accurately quantify semantic drift using Centroid Distance Tracking and Class Anomaly Tracking. - 🔍 Dataset Scanner
Automatically flag corrupted files, purely blank images, and completely extreme aspect ratios. - 👯 Duplicate Hunter
Locate exact duplicates (via MD5 hashing) and near-duplicates (via Perceptual Hashing & Hamming Distances) safely. - 📋 Governance Check
Enforce rules on class imbalance and label noise immediately with strict CI/CD pipeline compatibility. - 📊 HTML Reports
Export automated, fully shareable visual reports of your dataset's structural health locally.
DatasetVision requires Python 3.10+.
# Clone the repository
git clone https://github.com/nibir-ai/datasetvision.git
cd datasetvision
# Install via pip
pip install -e .To install development dependencies (for testing):
pip install -e '.[dev]'DatasetVision provides several intuitive CLI commands powered by typer.
Analyze your dataset's health, anomalies, and verify it passes governance rules:
datasetvision intelligence /path/to/datasetEvaluate domain or semantic drift between a source and target dataset:
datasetvision drift /path/to/old_data /path/to/new_dataFind blank, corrupt, or fundamentally broken images instantly:
datasetvision scan /path/to/dataset --output report.jsonDiscover redundant data dragging down your model training speed:
# Find near duplicates using Perceptual Hashing
datasetvision duplicates /path/to/dataset --near
# Find exact duplicates using MD5
datasetvision duplicates /path/to/dataset --exactExport the intelligence findings to a static, self-contained HTML file:
datasetvision report /path/to/dataset output_report.htmlgraph TD;
CLI[CLI Layer - Typer] --> Core[Intelligence Engine];
Core --> Drift[Drift Analysis];
Core --> Anomaly[OOD Anomaly Detection];
Core --> Duplicate[Duplicates & Hashing];
Core --> Scanner[Corruption Scanner];
Core --> Policy[Governance Engine];
- 📜 License: MIT
- 📚 Changelog: Track our progress
- 🤝 Contributing: Help us grow
- ⚖️ Code of Conduct: Our community commitment
We welcome pull requests! Please read our Contributing Guidelines before submitting.
- Fork the repo.
- Create your feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Ensure tests pass (
pytest tests/). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.