A powerful and configurable Python script (afanon.py) for identifying and anonymizing sensitive information within text documents. This tool employs a multi-layered approach, combining Named Entity Recognition (NER), regular expressions, custom wordlists, and fallback heuristics to provide comprehensive PII redaction.
- Multi-Layered Detection:
- Named Entity Recognition (NER): Leverages NLTK to identify entities like
PERSON,ORGANIZATION,LOCATION,DATE,MONEY, etc. (configurable). - Regular Expression (Regex) Matching: Pre-defined and customizable regex patterns to detect common PII like emails, phone numbers, SSNs, credit card numbers, IP addresses, physical addresses, and postal codes.
- Custom Wordlist Anonymization: Allows users to supply their own lists of sensitive words/phrases (e.g., passwords, specific object names, dog breeds, project codenames) for targeted redaction.
- Fallback Capitalized Word Detection: A heuristic to catch potential names that might be missed by other methods.
- Named Entity Recognition (NER): Leverages NLTK to identify entities like
- Flexible Anonymization Strategies:
placeholder: Replaces sensitive data with generic placeholders (e.g.,[PERSON],[EMAIL]).counter: Replaces with unique, numbered placeholders (e.g.,[PERSON_1],[PERSON_2]).fake: Substitutes PII with realistic-looking but synthetic data generated by the Faker library (e.g., "John Doe" for a name, a fake address for a real one).
- Intelligent Overlap Resolution: When multiple detection methods identify overlapping text, the script prioritizes matches based on specificity (regex/custom lists > NER > fallback) and length to ensure accurate anonymization.
- Highly Configurable:
- Specify which NER entities to target.
- Provide custom files for passwords, objects, dog breeds, or a generic wordlist.
- Choose the anonymization strategy.
- (Advanced) Modify built-in regex patterns directly in the code.
- Command-Line Interface (CLI): Easy to integrate into workflows for processing text files.
- Detailed Logging: Offers insights into the anonymization process, including found entities and applied replacements (with verbose/debug mode).
- Automatic NLTK Data Check: Ensures necessary NLTK resources are available and attempts to download them if missing.
- Python 3.7+
- Required Python packages (see
requirements.txt):nltkFaker
-
Clone the repository:
git clone https://github.com/your-username/advanced-text-anonymizer.git cd advanced-text-anonymizer -
Install dependencies:
pip install -r requirements.txt
-
NLTK Data: The script (
afanon.py) will automatically check and attempt to download required NLTK data packages (punkt,averaged_perceptron_tagger,maxent_ne_chunker,words) on its first run. Ensure you have an internet connection.
The script is run from the command line:
python afanon.py <input_file> <output_file> [options]