Skip to content

Barontex1000/data-anonymization-simulation-tool

Repository files navigation

Data Anonymization and Synthetic Data Generation Tool

Python Dependencies License

A versatile command-line tool written in Python for handling sensitive patient data. It provides functionalities to generate dummy datasets, anonymize existing data using techniques like K-anonymity, and create synthetic datasets that mimic the statistical properties of the anonymized data, all while preserving patient privacy.

✨ Features

  • Dummy Data Generation: Quickly create a sample dataset for testing and development.
  • Data Anonymization:
    • Pseudo-anonymization: Replaces sensitive identifiers (e.g., patient_id) with non-traceable unique IDs.
    • Generalization: Groups quasi-identifiers (e.g., age into bins like "20-29", zip_code into prefixes) to achieve K-anonymity.
    • Suppression: Optionally removes highly sensitive columns (e.g., diagnosis).
  • Synthetic Data Generation: Generate new, artificial datasets that statistically resemble the anonymized data, useful for research and testing without compromising privacy.
  • Command-Line Interface (CLI): Easy-to-use commands for each operation.
  • CSV Output: All generated and processed data is saved as CSV files.

🚀 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

  • Python 3.8 or higher installed on your system.
  • pip (Python package installer), usually included with Python.

Installation

  1. Clone the repository:

    git clone [https://github.com/Barontex1000/data-anonymization-simulation-tool.git](https://github.com/Barontex1000/data-anonymization-simulation-tool.git)
    cd data-anonymization-simulation-tool
  2. Create a virtual environment (recommended): A virtual environment isolates your project's dependencies from other Python projects.

    python -m venv venv
  3. Activate the virtual environment:

    • On Windows:
      .\venv\Scripts\activate
    • On macOS/Linux:
      source venv/bin/activate
  4. Install dependencies:

    pip install -r requirements.txt

⚙️ Usage

The tool is executed via the command line using python data_anonymization_simulation_tool.py <command> [options].

Commands:

  1. generate_dummy: Generates a dummy dataset.

    python data_anonymization_simulation_tool.py generate_dummy --size <num_records> --output <output_file.csv>
    • <num_records>: (Optional, default: 100) Number of dummy records to generate.
    • <output_file.csv>: (Required) Path to save the generated dummy data.

    Example:

    python data_anonymization_simulation_tool.py generate_dummy --size 150 --output data/patient_records_dummy.csv
  2. anonymize: Anonymizes an input dataset.

    python data_anonymization_simulation_tool.py anonymize --input <input_file.csv> --k <k_value> [--suppress_diagnosis] --output <output_file.csv>
    • <input_file.csv>: (Required) Path to the input CSV file to anonymize.
    • <k_value>: (Optional, default: 5) The K-anonymity value. Groups with fewer than k_value identical quasi-identifiers will be warned about.
    • --suppress_diagnosis: (Optional) Flag to remove the diagnosis column entirely from the anonymized data.
    • <output_file.csv>: (Required) Path to save the anonymized data.

    Example:

    python data_anonymization_simulation_tool.py anonymize --input data/patient_records_dummy.csv --k 5 --suppress_diagnosis --output data/patient_records_anonymized.csv
  3. generate_synthetic: Generates synthetic data based on an anonymized dataset.

    python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized <anonymized_input_file.csv> --size <num_synthetic_records> --output <output_file.csv>
    • <anonymized_input_file.csv>: (Required) Path to the anonymized CSV file from which to learn distributions.
    • <num_synthetic_records>: (Optional, default: 500) Number of synthetic records to generate.
    • <output_file.csv>: (Required) Path to save the generated synthetic data.

    Example:

    python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized data/patient_records_anonymized.csv --size 500 --output data/patient_records_synthetic.csv

Example Workflow:

# 1. Generate 150 dummy patient records
python data_anonymization_simulation_tool.py generate_dummy --size 150 --output data/patient_records_dummy.csv

# 2. Anonymize the dummy records with K=5, suppressing diagnosis
python data_anonymization_simulation_tool.py anonymize --input data/patient_records_dummy.csv --k 5 --suppress_diagnosis --output data/patient_records_anonymized.csv

# 3. Generate 500 synthetic records from the anonymized data
python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized data/patient_records_anonymized.csv --size 500 --output data/patient_records_synthetic.csv

📂 Project Structure
.
├── data_anonymization_simulation_tool.py # The core Python script
├── README.md                             # This file
├── LICENSE                               # Licensing information (e.g., MIT License)
├── requirements.txt                      # Python dependencies
├── .gitignore                            # Files/folders to ignore by Git
└── data/                                 # Directory for example input/output data
    ├── patient_records_dummy.csv         # Example dummy data output
    ├── patient_records_anonymized.csv    # Example anonymized data output
    └── patient_records_synthetic.csv     # Example synthetic data output

🤝 Contributing
Contributions are welcome! If you have suggestions for improvements, find bugs, or want to add new features, please feel free to:

Fork the repository.
Create a new branch (git checkout -b feature/your-feature-name).
Make your changes.
Commit your changes (git commit -m 'Add new feature').
Push to the branch (git push origin feature/your-feature-name).
Open a Pull Request.

📄 License
This project is licensed under the MIT License - see the LICENSE file for details.

✉️ Contact
For any questions or inquiries, please open an issue on the GitHub repository or Contact: Baron E. Turyatemba/baronturyatemba596@gmail.com/https://github.com/Barontex1000.

About

A Python command-line tool for privacy-preserving medical data management. It generates dummy records, performs K-anonymity based anonymization (generalization/suppression), and creates synthetic data mirroring real distributions for safe analysis. Essential for handling protected health information responsibly.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages