Skip to content

nibir-ai/datasetvision

DatasetVision Logo

DatasetVision 👁️

Industry-Grade Dataset Governance & Drift Intelligence CLI for Computer Vision

Python 3.10+ License Build Status Code Style: Black


🚀 Overview

DatasetVision is a robust, production-ready CLI engineered to enforce strict data governance, detect dataset drift, and monitor data health for computer vision workflows. Catch data anomalies before your model decays.

Whether you're battling label noise, near-duplicates, or semantic shifts in production datasets, DatasetVision provides lightning-fast intelligence layers to validate your data pipelines deterministically.


✨ Enterprise-Grade Features

  • 🛡️ Anomaly Detection Layer
    Detect anomalous and out-of-distribution images using deep cv2 feature embeddings and Z-score outlier analysis.
  • 📉 Data Drift Intelligence
    Compare two datasets and accurately quantify semantic drift using Centroid Distance Tracking and Class Anomaly Tracking.
  • 🔍 Dataset Scanner
    Automatically flag corrupted files, purely blank images, and completely extreme aspect ratios.
  • 👯 Duplicate Hunter
    Locate exact duplicates (via MD5 hashing) and near-duplicates (via Perceptual Hashing & Hamming Distances) safely.
  • 📋 Governance Check
    Enforce rules on class imbalance and label noise immediately with strict CI/CD pipeline compatibility.
  • 📊 HTML Reports
    Export automated, fully shareable visual reports of your dataset's structural health locally.

📦 Installation

DatasetVision requires Python 3.10+.

# Clone the repository
git clone https://github.com/nibir-ai/datasetvision.git
cd datasetvision

# Install via pip
pip install -e .

To install development dependencies (for testing):

pip install -e '.[dev]'

⚡ Quickstart Guide

DatasetVision provides several intuitive CLI commands powered by typer.

1. Generate Intelligence & Enforce Policy

Analyze your dataset's health, anomalies, and verify it passes governance rules:

datasetvision intelligence /path/to/dataset

2. Compare Datasets (Drift Analysis)

Evaluate domain or semantic drift between a source and target dataset:

datasetvision drift /path/to/old_data /path/to/new_data

3. Scan Dataset for Corruption

Find blank, corrupt, or fundamentally broken images instantly:

datasetvision scan /path/to/dataset --output report.json

4. Find Duplicates

Discover redundant data dragging down your model training speed:

# Find near duplicates using Perceptual Hashing
datasetvision duplicates /path/to/dataset --near

# Find exact duplicates using MD5
datasetvision duplicates /path/to/dataset --exact

5. Generate Visual HTML Report

Export the intelligence findings to a static, self-contained HTML file:

datasetvision report /path/to/dataset output_report.html

🏗️ Architecture

graph TD;
    CLI[CLI Layer - Typer] --> Core[Intelligence Engine];
    Core --> Drift[Drift Analysis];
    Core --> Anomaly[OOD Anomaly Detection];
    Core --> Duplicate[Duplicates & Hashing];
    Core --> Scanner[Corruption Scanner];
    Core --> Policy[Governance Engine];
Loading

🔗 Project Links

  • 📜 License: MIT
  • 📚 Changelog: Track our progress
  • 🤝 Contributing: Help us grow
  • ⚖️ Code of Conduct: Our community commitment

🤝 Contributing

We welcome pull requests! Please read our Contributing Guidelines before submitting.

  1. Fork the repo.
  2. Create your feature branch (git checkout -b feature/AmazingFeature).
  3. Commit your changes (git commit -m 'Add some AmazingFeature').
  4. Ensure tests pass (pytest tests/).
  5. Push to the branch (git push origin feature/AmazingFeature).
  6. Open a Pull Request.

Maintained with ❤️ by Nibir Biswas

About

Offline computer vision dataset auditing CLI tool for validating image datasets before model training.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages