Data Anonymization and Synthetic Data Generation Tool

A versatile command-line tool written in Python for handling sensitive patient data. It provides functionalities to generate dummy datasets, anonymize existing data using techniques like K-anonymity, and create synthetic datasets that mimic the statistical properties of the anonymized data, all while preserving patient privacy.

✨ Features

Dummy Data Generation: Quickly create a sample dataset for testing and development.
Data Anonymization:
- Pseudo-anonymization: Replaces sensitive identifiers (e.g., patient_id) with non-traceable unique IDs.
- Generalization: Groups quasi-identifiers (e.g., age into bins like "20-29", zip_code into prefixes) to achieve K-anonymity.
- Suppression: Optionally removes highly sensitive columns (e.g., diagnosis).
Synthetic Data Generation: Generate new, artificial datasets that statistically resemble the anonymized data, useful for research and testing without compromising privacy.
Command-Line Interface (CLI): Easy-to-use commands for each operation.
CSV Output: All generated and processed data is saved as CSV files.

🚀 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Python 3.8 or higher installed on your system.
pip (Python package installer), usually included with Python.

Installation

Clone the repository:

git clone [https://github.com/Barontex1000/data-anonymization-simulation-tool.git](https://github.com/Barontex1000/data-anonymization-simulation-tool.git)
cd data-anonymization-simulation-tool

Create a virtual environment (recommended): A virtual environment isolates your project's dependencies from other Python projects.
```
python -m venv venv
```
Activate the virtual environment:
- On Windows:
```
.\venv\Scripts\activate
```
- On macOS/Linux:
```
source venv/bin/activate
```
Install dependencies:
```
pip install -r requirements.txt
```

⚙️ Usage

The tool is executed via the command line using python data_anonymization_simulation_tool.py <command> [options].

Commands:

generate_dummy: Generates a dummy dataset.

python data_anonymization_simulation_tool.py generate_dummy --size <num_records> --output <output_file.csv>

<num_records>: (Optional, default: 100) Number of dummy records to generate.
<output_file.csv>: (Required) Path to save the generated dummy data.

Example:

python data_anonymization_simulation_tool.py generate_dummy --size 150 --output data/patient_records_dummy.csv

anonymize: Anonymizes an input dataset.
```
python data_anonymization_simulation_tool.py anonymize --input <input_file.csv> --k <k_value> [--suppress_diagnosis] --output <output_file.csv>
```
- <input_file.csv>: (Required) Path to the input CSV file to anonymize.
- <k_value>: (Optional, default: 5) The K-anonymity value. Groups with fewer than k_value identical quasi-identifiers will be warned about.
- --suppress_diagnosis: (Optional) Flag to remove the diagnosis column entirely from the anonymized data.
- <output_file.csv>: (Required) Path to save the anonymized data.
Example:
```
python data_anonymization_simulation_tool.py anonymize --input data/patient_records_dummy.csv --k 5 --suppress_diagnosis --output data/patient_records_anonymized.csv
```
generate_synthetic: Generates synthetic data based on an anonymized dataset.
```
python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized <anonymized_input_file.csv> --size <num_synthetic_records> --output <output_file.csv>
```
- <anonymized_input_file.csv>: (Required) Path to the anonymized CSV file from which to learn distributions.
- <num_synthetic_records>: (Optional, default: 500) Number of synthetic records to generate.
- <output_file.csv>: (Required) Path to save the generated synthetic data.
Example:
```
python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized data/patient_records_anonymized.csv --size 500 --output data/patient_records_synthetic.csv
```

Example Workflow:

# 1. Generate 150 dummy patient records
python data_anonymization_simulation_tool.py generate_dummy --size 150 --output data/patient_records_dummy.csv

# 2. Anonymize the dummy records with K=5, suppressing diagnosis
python data_anonymization_simulation_tool.py anonymize --input data/patient_records_dummy.csv --k 5 --suppress_diagnosis --output data/patient_records_anonymized.csv

# 3. Generate 500 synthetic records from the anonymized data
python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized data/patient_records_anonymized.csv --size 500 --output data/patient_records_synthetic.csv

📂 Project Structure
.
├── data_anonymization_simulation_tool.py # The core Python script
├── README.md                             # This file
├── LICENSE                               # Licensing information (e.g., MIT License)
├── requirements.txt                      # Python dependencies
├── .gitignore                            # Files/folders to ignore by Git
└── data/                                 # Directory for example input/output data
    ├── patient_records_dummy.csv         # Example dummy data output
    ├── patient_records_anonymized.csv    # Example anonymized data output
    └── patient_records_synthetic.csv     # Example synthetic data output

🤝 Contributing
Contributions are welcome! If you have suggestions for improvements, find bugs, or want to add new features, please feel free to:

Fork the repository.
Create a new branch (git checkout -b feature/your-feature-name).
Make your changes.
Commit your changes (git commit -m 'Add new feature').
Push to the branch (git push origin feature/your-feature-name).
Open a Pull Request.

📄 License
This project is licensed under the MIT License - see the LICENSE file for details.

✉️ Contact
For any questions or inquiries, please open an issue on the GitHub repository or Contact: Baron E. Turyatemba/baronturyatemba596@gmail.com/https://github.com/Barontex1000.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Anonymization and Synthetic Data Generation Tool

✨ Features

🚀 Getting Started

Prerequisites

Installation

⚙️ Usage

Commands:

Example Workflow:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_anonymization_simulation_tool.py		data_anonymization_simulation_tool.py
requirements.txt		requirements.txt
test execution.md		test execution.md

Folders and files

Latest commit

History

Repository files navigation

Data Anonymization and Synthetic Data Generation Tool

✨ Features

🚀 Getting Started

Prerequisites

Installation

⚙️ Usage

Commands:

Example Workflow:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages