A versatile command-line tool written in Python for handling sensitive patient data. It provides functionalities to generate dummy datasets, anonymize existing data using techniques like K-anonymity, and create synthetic datasets that mimic the statistical properties of the anonymized data, all while preserving patient privacy.
- Dummy Data Generation: Quickly create a sample dataset for testing and development.
- Data Anonymization:
- Pseudo-anonymization: Replaces sensitive identifiers (e.g.,
patient_id) with non-traceable unique IDs. - Generalization: Groups quasi-identifiers (e.g.,
ageinto bins like "20-29",zip_codeinto prefixes) to achieve K-anonymity. - Suppression: Optionally removes highly sensitive columns (e.g.,
diagnosis).
- Pseudo-anonymization: Replaces sensitive identifiers (e.g.,
- Synthetic Data Generation: Generate new, artificial datasets that statistically resemble the anonymized data, useful for research and testing without compromising privacy.
- Command-Line Interface (CLI): Easy-to-use commands for each operation.
- CSV Output: All generated and processed data is saved as CSV files.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Python 3.8 or higher installed on your system.
pip(Python package installer), usually included with Python.
-
Clone the repository:
git clone [https://github.com/Barontex1000/data-anonymization-simulation-tool.git](https://github.com/Barontex1000/data-anonymization-simulation-tool.git) cd data-anonymization-simulation-tool -
Create a virtual environment (recommended): A virtual environment isolates your project's dependencies from other Python projects.
python -m venv venv
-
Activate the virtual environment:
- On Windows:
.\venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install dependencies:
pip install -r requirements.txt
The tool is executed via the command line using python data_anonymization_simulation_tool.py <command> [options].
-
generate_dummy: Generates a dummy dataset.python data_anonymization_simulation_tool.py generate_dummy --size <num_records> --output <output_file.csv>
<num_records>: (Optional, default: 100) Number of dummy records to generate.<output_file.csv>: (Required) Path to save the generated dummy data.
Example:
python data_anonymization_simulation_tool.py generate_dummy --size 150 --output data/patient_records_dummy.csv
-
anonymize: Anonymizes an input dataset.python data_anonymization_simulation_tool.py anonymize --input <input_file.csv> --k <k_value> [--suppress_diagnosis] --output <output_file.csv>
<input_file.csv>: (Required) Path to the input CSV file to anonymize.<k_value>: (Optional, default: 5) The K-anonymity value. Groups with fewer thank_valueidentical quasi-identifiers will be warned about.--suppress_diagnosis: (Optional) Flag to remove thediagnosiscolumn entirely from the anonymized data.<output_file.csv>: (Required) Path to save the anonymized data.
Example:
python data_anonymization_simulation_tool.py anonymize --input data/patient_records_dummy.csv --k 5 --suppress_diagnosis --output data/patient_records_anonymized.csv
-
generate_synthetic: Generates synthetic data based on an anonymized dataset.python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized <anonymized_input_file.csv> --size <num_synthetic_records> --output <output_file.csv>
<anonymized_input_file.csv>: (Required) Path to the anonymized CSV file from which to learn distributions.<num_synthetic_records>: (Optional, default: 500) Number of synthetic records to generate.<output_file.csv>: (Required) Path to save the generated synthetic data.
Example:
python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized data/patient_records_anonymized.csv --size 500 --output data/patient_records_synthetic.csv
# 1. Generate 150 dummy patient records
python data_anonymization_simulation_tool.py generate_dummy --size 150 --output data/patient_records_dummy.csv
# 2. Anonymize the dummy records with K=5, suppressing diagnosis
python data_anonymization_simulation_tool.py anonymize --input data/patient_records_dummy.csv --k 5 --suppress_diagnosis --output data/patient_records_anonymized.csv
# 3. Generate 500 synthetic records from the anonymized data
python data_anonymization_simulation_tool.py generate_synthetic --input_anonymized data/patient_records_anonymized.csv --size 500 --output data/patient_records_synthetic.csv
📂 Project Structure
.
├── data_anonymization_simulation_tool.py # The core Python script
├── README.md # This file
├── LICENSE # Licensing information (e.g., MIT License)
├── requirements.txt # Python dependencies
├── .gitignore # Files/folders to ignore by Git
└── data/ # Directory for example input/output data
├── patient_records_dummy.csv # Example dummy data output
├── patient_records_anonymized.csv # Example anonymized data output
└── patient_records_synthetic.csv # Example synthetic data output
🤝 Contributing
Contributions are welcome! If you have suggestions for improvements, find bugs, or want to add new features, please feel free to:
Fork the repository.
Create a new branch (git checkout -b feature/your-feature-name).
Make your changes.
Commit your changes (git commit -m 'Add new feature').
Push to the branch (git push origin feature/your-feature-name).
Open a Pull Request.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
✉️ Contact
For any questions or inquiries, please open an issue on the GitHub repository or Contact: Baron E. Turyatemba/baronturyatemba596@gmail.com/https://github.com/Barontex1000.